9.2 KiB
Graph Score Inheritance in Hybrid Retrieval
Overview
The Heritage RAG system uses a hybrid retrieval approach that combines:
- Vector search (semantic similarity via embeddings)
- Knowledge graph expansion (SPARQL-based relationship discovery)
This document explains the graph score inheritance feature that ensures vector search results benefit from knowledge graph relationships.
The Problem
Before graph score inheritance, the hybrid retrieval had a scoring gap:
| Result Source | Vector Score | Graph Score | Combined Score |
|---|---|---|---|
| Vector search results | 0.5-0.8 | 0.0 | 0.35-0.56 |
| Graph expansion results | 0.0 | 0.5-0.8 | 0.15-0.24 |
Why this happened:
- Vector search finds institutions semantically similar to the query
- Graph expansion finds different institutions (same city/type) with different GHCIDs
- Since GHCIDs don't match, no direct merging occurs
- Vector results always dominate because
combined = 0.7 * vector + 0.3 * graph
Example before fix:
Query: "Archieven in Amsterdam"
1. Stadsarchief Amsterdam | V:0.659 G:0.000 C:0.461
2. Noord-Hollands Archief | V:0.675 G:0.000 C:0.472
3. The Black Archives | V:0.636 G:0.000 C:0.445
The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.
The Solution: Graph Score Inheritance
Vector results now inherit graph scores from related institutions found via graph expansion.
How It Works
1. Vector Search
└── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores
2. Graph Expansion (for top 5 vector results)
└── For Inst_A in Amsterdam:
└── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
└── These get graph_score=0.8 (same_city)
└── They track: related_institutions=[Inst_A.ghcid]
3. Inheritance Calculation
└── Inst_A inherits from [Inst_X, Inst_Y]:
inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
└── Inst_A.graph_score = max(0.0, 0.4) = 0.4
4. Combined Scoring
└── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!
Inheritance Factor
INHERITANCE_FACTOR = 0.5 # Inherit 50% of related institutions' graph scores
This means:
- Same-city institutions (graph_score=0.8) → inherited score of 0.40
- Same-type institutions (graph_score=0.5) → inherited score of 0.25
Implementation Details
File Location
/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py
Key Method: _combine_and_rank()
Located at lines ~1539-1671, this method:
- Creates lookup by GHCID for merging
- Handles direct merges when graph result GHCID matches vector result
- Builds inheritance map tracking which vector results each graph result was expanded from
- Applies inheritance calculating inherited scores for vector results
- Computes combined scores with the formula:
0.7 * vector + 0.3 * graph
Code Structure
def _combine_and_rank(
self,
vector_results: list[RetrievedInstitution],
graph_results: list[RetrievedInstitution],
k: int
) -> list[RetrievedInstitution]:
"""Combine vector and graph results with weighted scoring and graph inheritance."""
# 1. Create lookup by GHCID
results_by_ghcid: dict[str, RetrievedInstitution] = {}
vector_ghcids = set()
# 2. Add vector results
for inst in vector_results:
results_by_ghcid[inst.ghcid] = inst
vector_ghcids.add(inst.ghcid)
# 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}
for inst in graph_results:
if inst.ghcid in results_by_ghcid:
# Direct merge
existing = results_by_ghcid[inst.ghcid]
existing.graph_score = max(existing.graph_score, inst.graph_score)
else:
# New from graph - track for inheritance
results_by_ghcid[inst.ghcid] = inst
for seed_ghcid in inst.related_institutions:
if seed_ghcid in inheritance_map:
inheritance_map[seed_ghcid].append(
(inst.ghcid, inst.graph_score, inst.expansion_reason)
)
# 4. Apply inheritance
INHERITANCE_FACTOR = 0.5
for vector_ghcid, related_list in inheritance_map.items():
if related_list:
inst = results_by_ghcid[vector_ghcid]
related_scores = [score for _, score, _ in related_list]
inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
inst.graph_score = max(inst.graph_score, inherited_score)
# 5. Calculate combined scores
for inst in results_by_ghcid.values():
inst.combined_score = (
self.vector_weight * inst.vector_score +
self.graph_weight * inst.graph_score
)
return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]
Graph Expansion Scores
The _expand_via_graph() method assigns these base scores:
| Expansion Type | Graph Score | SPARQL Pattern |
|---|---|---|
| Same city | 0.8 | ?s schema:location ?loc . ?loc hc:cityCode ?cityCode |
| Same institution type | 0.5 | ?s hc:institutionType ?type |
Results
Before (Graph Score = 0.0)
Query: "Welke musea zijn er in Utrecht?"
1. Centraal Museum | V:0.589 G:0.000 C:0.412
2. Museum Speelklok | V:0.591 G:0.000 C:0.414
3. Universiteitsmuseum Utrecht | V:0.641 G:0.000 C:0.449
After (Graph Score Inherited)
Query: "Welke musea zijn er in Utrecht?"
1. Universiteitsmuseum Utrecht | V:0.641 G:0.400 C:0.569
2. Museum Speelklok | V:0.591 G:0.400 C:0.534
3. Centraal Museum | V:0.589 G:0.400 C:0.532
Key improvements:
- Graph scores now 0.400 (inherited from same-city museums)
- Combined scores increased by ~25% (0.412 → 0.532)
- Ranking now considers geographic relevance
More Examples
Query: "Bibliotheken in Den Haag"
1. Centrale Bibliotheek | V:0.697 G:0.400 C:0.608
2. Koninklijke Bibliotheek | V:0.676 G:0.400 C:0.593
3. Huis van het Boek | V:0.630 G:0.400 C:0.561
4. Bibliotheek Hoeksche Waard | V:0.613 G:0.400 C:0.549
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436 <- No inheritance (different city)
Configuration
Weights (in HybridRetriever.__init__)
self.vector_weight = 0.7 # Semantic similarity importance
self.graph_weight = 0.3 # Knowledge graph importance
Inheritance Factor
INHERITANCE_FACTOR = 0.5 # In _combine_and_rank()
Tuning considerations:
- Higher factor (0.6-0.8): Stronger influence from graph relationships
- Lower factor (0.3-0.4): More conservative, vector similarity dominates
- Current value (0.5): Balanced approach
Logging
The implementation includes detailed logging for debugging:
# INFO level (always visible)
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")
# DEBUG level (when LOG_LEVEL=DEBUG)
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")
Check logs on production:
ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"
API Response Structure
The graph score is exposed in the API response:
{
"retrieved_results": [
{
"ghcid": "NL-UT-UTR-M-CM",
"name": "Centraal Museum",
"scores": {
"vector": 0.589,
"graph": 0.400, // <-- Now populated via inheritance
"combined": 0.532
},
"related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
}
]
}
Deployment
File to deploy:
scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/
Restart service:
ssh root@91.98.224.44 "systemctl restart glam-rag-api"
Verify:
curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
-H 'Content-Type: application/json' \
-d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"
Related Files
| File | Purpose |
|---|---|
hybrid_retriever.py |
Main implementation with _combine_and_rank() |
dspy_heritage_rag.py |
RAG pipeline that calls retriever.search() |
main.py |
FastAPI endpoints serving the RAG API |
Future Improvements
- Dynamic inheritance factor: Adjust based on query type (geographic vs. thematic)
- Multi-hop expansion: Inherit from institutions 2+ hops away
- Weighted inheritance: Weight by relationship type (same_city=0.8, same_type=0.5)
- Negative inheritance: Penalize results unrelated to graph findings
Last Updated: 2025-12-24
Implemented: 2025-12-23
Status: Production (archief.support)