glam/docs/GRAPH_SCORE_INHERITANCE.md
2025-12-26 14:30:31 +01:00

292 lines
9.2 KiB
Markdown

# Graph Score Inheritance in Hybrid Retrieval
## Overview
The Heritage RAG system uses a **hybrid retrieval** approach that combines:
1. **Vector search** (semantic similarity via embeddings)
2. **Knowledge graph expansion** (SPARQL-based relationship discovery)
This document explains the **graph score inheritance** feature that ensures vector search results benefit from knowledge graph relationships.
## The Problem
Before graph score inheritance, the hybrid retrieval had a scoring gap:
| Result Source | Vector Score | Graph Score | Combined Score |
|---------------|--------------|-------------|----------------|
| Vector search results | 0.5-0.8 | **0.0** | 0.35-0.56 |
| Graph expansion results | 0.0 | 0.5-0.8 | 0.15-0.24 |
**Why this happened:**
- Vector search finds institutions semantically similar to the query
- Graph expansion finds **different** institutions (same city/type) with different GHCIDs
- Since GHCIDs don't match, no direct merging occurs
- Vector results always dominate because `combined = 0.7 * vector + 0.3 * graph`
**Example before fix:**
```
Query: "Archieven in Amsterdam"
1. Stadsarchief Amsterdam | V:0.659 G:0.000 C:0.461
2. Noord-Hollands Archief | V:0.675 G:0.000 C:0.472
3. The Black Archives | V:0.636 G:0.000 C:0.445
```
The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.
## The Solution: Graph Score Inheritance
Vector results now **inherit** graph scores from related institutions found via graph expansion.
### How It Works
```
1. Vector Search
└── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores
2. Graph Expansion (for top 5 vector results)
└── For Inst_A in Amsterdam:
└── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
└── These get graph_score=0.8 (same_city)
└── They track: related_institutions=[Inst_A.ghcid]
3. Inheritance Calculation
└── Inst_A inherits from [Inst_X, Inst_Y]:
inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
└── Inst_A.graph_score = max(0.0, 0.4) = 0.4
4. Combined Scoring
└── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!
```
### Inheritance Factor
```python
INHERITANCE_FACTOR = 0.5 # Inherit 50% of related institutions' graph scores
```
This means:
- Same-city institutions (graph_score=0.8) → inherited score of **0.40**
- Same-type institutions (graph_score=0.5) → inherited score of **0.25**
## Implementation Details
### File Location
```
/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py
```
### Key Method: `_combine_and_rank()`
Located at lines ~1539-1671, this method:
1. **Creates lookup by GHCID** for merging
2. **Handles direct merges** when graph result GHCID matches vector result
3. **Builds inheritance map** tracking which vector results each graph result was expanded from
4. **Applies inheritance** calculating inherited scores for vector results
5. **Computes combined scores** with the formula: `0.7 * vector + 0.3 * graph`
### Code Structure
```python
def _combine_and_rank(
self,
vector_results: list[RetrievedInstitution],
graph_results: list[RetrievedInstitution],
k: int
) -> list[RetrievedInstitution]:
"""Combine vector and graph results with weighted scoring and graph inheritance."""
# 1. Create lookup by GHCID
results_by_ghcid: dict[str, RetrievedInstitution] = {}
vector_ghcids = set()
# 2. Add vector results
for inst in vector_results:
results_by_ghcid[inst.ghcid] = inst
vector_ghcids.add(inst.ghcid)
# 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}
for inst in graph_results:
if inst.ghcid in results_by_ghcid:
# Direct merge
existing = results_by_ghcid[inst.ghcid]
existing.graph_score = max(existing.graph_score, inst.graph_score)
else:
# New from graph - track for inheritance
results_by_ghcid[inst.ghcid] = inst
for seed_ghcid in inst.related_institutions:
if seed_ghcid in inheritance_map:
inheritance_map[seed_ghcid].append(
(inst.ghcid, inst.graph_score, inst.expansion_reason)
)
# 4. Apply inheritance
INHERITANCE_FACTOR = 0.5
for vector_ghcid, related_list in inheritance_map.items():
if related_list:
inst = results_by_ghcid[vector_ghcid]
related_scores = [score for _, score, _ in related_list]
inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
inst.graph_score = max(inst.graph_score, inherited_score)
# 5. Calculate combined scores
for inst in results_by_ghcid.values():
inst.combined_score = (
self.vector_weight * inst.vector_score +
self.graph_weight * inst.graph_score
)
return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]
```
### Graph Expansion Scores
The `_expand_via_graph()` method assigns these base scores:
| Expansion Type | Graph Score | SPARQL Pattern |
|----------------|-------------|----------------|
| Same city | 0.8 | `?s schema:location ?loc . ?loc hc:cityCode ?cityCode` |
| Same institution type | 0.5 | `?s hc:institutionType ?type` |
## Results
### Before (Graph Score = 0.0)
```
Query: "Welke musea zijn er in Utrecht?"
1. Centraal Museum | V:0.589 G:0.000 C:0.412
2. Museum Speelklok | V:0.591 G:0.000 C:0.414
3. Universiteitsmuseum Utrecht | V:0.641 G:0.000 C:0.449
```
### After (Graph Score Inherited)
```
Query: "Welke musea zijn er in Utrecht?"
1. Universiteitsmuseum Utrecht | V:0.641 G:0.400 C:0.569
2. Museum Speelklok | V:0.591 G:0.400 C:0.534
3. Centraal Museum | V:0.589 G:0.400 C:0.532
```
**Key improvements:**
- Graph scores now **0.400** (inherited from same-city museums)
- Combined scores **increased by ~25%** (0.412 → 0.532)
- Ranking now considers **geographic relevance**
### More Examples
```
Query: "Bibliotheken in Den Haag"
1. Centrale Bibliotheek | V:0.697 G:0.400 C:0.608
2. Koninklijke Bibliotheek | V:0.676 G:0.400 C:0.593
3. Huis van het Boek | V:0.630 G:0.400 C:0.561
4. Bibliotheek Hoeksche Waard | V:0.613 G:0.400 C:0.549
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436 <- No inheritance (different city)
```
## Configuration
### Weights (in `HybridRetriever.__init__`)
```python
self.vector_weight = 0.7 # Semantic similarity importance
self.graph_weight = 0.3 # Knowledge graph importance
```
### Inheritance Factor
```python
INHERITANCE_FACTOR = 0.5 # In _combine_and_rank()
```
**Tuning considerations:**
- Higher factor (0.6-0.8): Stronger influence from graph relationships
- Lower factor (0.3-0.4): More conservative, vector similarity dominates
- Current value (0.5): Balanced approach
## Logging
The implementation includes detailed logging for debugging:
```python
# INFO level (always visible)
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")
# DEBUG level (when LOG_LEVEL=DEBUG)
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")
```
**Check logs on production:**
```bash
ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"
```
## API Response Structure
The graph score is exposed in the API response:
```json
{
"retrieved_results": [
{
"ghcid": "NL-UT-UTR-M-CM",
"name": "Centraal Museum",
"scores": {
"vector": 0.589,
"graph": 0.400, // <-- Now populated via inheritance
"combined": 0.532
},
"related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
}
]
}
```
## Deployment
**File to deploy:**
```bash
scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/
```
**Restart service:**
```bash
ssh root@91.98.224.44 "systemctl restart glam-rag-api"
```
**Verify:**
```bash
curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
-H 'Content-Type: application/json' \
-d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"
```
## Related Files
| File | Purpose |
|------|---------|
| `hybrid_retriever.py` | Main implementation with `_combine_and_rank()` |
| `dspy_heritage_rag.py` | RAG pipeline that calls `retriever.search()` |
| `main.py` | FastAPI endpoints serving the RAG API |
## Future Improvements
1. **Dynamic inheritance factor**: Adjust based on query type (geographic vs. thematic)
2. **Multi-hop expansion**: Inherit from institutions 2+ hops away
3. **Weighted inheritance**: Weight by relationship type (same_city=0.8, same_type=0.5)
4. **Negative inheritance**: Penalize results unrelated to graph findings
---
**Last Updated:** 2025-12-24
**Implemented:** 2025-12-23
**Status:** Production (archief.support)