# Graph Score Inheritance in Hybrid Retrieval ## Overview The Heritage RAG system uses a **hybrid retrieval** approach that combines: 1. **Vector search** (semantic similarity via embeddings) 2. **Knowledge graph expansion** (SPARQL-based relationship discovery) This document explains the **graph score inheritance** feature that ensures vector search results benefit from knowledge graph relationships. ## The Problem Before graph score inheritance, the hybrid retrieval had a scoring gap: | Result Source | Vector Score | Graph Score | Combined Score | |---------------|--------------|-------------|----------------| | Vector search results | 0.5-0.8 | **0.0** | 0.35-0.56 | | Graph expansion results | 0.0 | 0.5-0.8 | 0.15-0.24 | **Why this happened:** - Vector search finds institutions semantically similar to the query - Graph expansion finds **different** institutions (same city/type) with different GHCIDs - Since GHCIDs don't match, no direct merging occurs - Vector results always dominate because `combined = 0.7 * vector + 0.3 * graph` **Example before fix:** ``` Query: "Archieven in Amsterdam" 1. Stadsarchief Amsterdam | V:0.659 G:0.000 C:0.461 2. Noord-Hollands Archief | V:0.675 G:0.000 C:0.472 3. The Black Archives | V:0.636 G:0.000 C:0.445 ``` The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores. ## The Solution: Graph Score Inheritance Vector results now **inherit** graph scores from related institutions found via graph expansion. ### How It Works ``` 1. Vector Search └── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores 2. Graph Expansion (for top 5 vector results) └── For Inst_A in Amsterdam: └── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam └── These get graph_score=0.8 (same_city) └── They track: related_institutions=[Inst_A.ghcid] 3. Inheritance Calculation └── Inst_A inherits from [Inst_X, Inst_Y]: inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4 └── Inst_A.graph_score = max(0.0, 0.4) = 0.4 4. Combined Scoring └── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank! ``` ### Inheritance Factor ```python INHERITANCE_FACTOR = 0.5 # Inherit 50% of related institutions' graph scores ``` This means: - Same-city institutions (graph_score=0.8) → inherited score of **0.40** - Same-type institutions (graph_score=0.5) → inherited score of **0.25** ## Implementation Details ### File Location ``` /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py ``` ### Key Method: `_combine_and_rank()` Located at lines ~1539-1671, this method: 1. **Creates lookup by GHCID** for merging 2. **Handles direct merges** when graph result GHCID matches vector result 3. **Builds inheritance map** tracking which vector results each graph result was expanded from 4. **Applies inheritance** calculating inherited scores for vector results 5. **Computes combined scores** with the formula: `0.7 * vector + 0.3 * graph` ### Code Structure ```python def _combine_and_rank( self, vector_results: list[RetrievedInstitution], graph_results: list[RetrievedInstitution], k: int ) -> list[RetrievedInstitution]: """Combine vector and graph results with weighted scoring and graph inheritance.""" # 1. Create lookup by GHCID results_by_ghcid: dict[str, RetrievedInstitution] = {} vector_ghcids = set() # 2. Add vector results for inst in vector_results: results_by_ghcid[inst.ghcid] = inst vector_ghcids.add(inst.ghcid) # 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)] inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids} for inst in graph_results: if inst.ghcid in results_by_ghcid: # Direct merge existing = results_by_ghcid[inst.ghcid] existing.graph_score = max(existing.graph_score, inst.graph_score) else: # New from graph - track for inheritance results_by_ghcid[inst.ghcid] = inst for seed_ghcid in inst.related_institutions: if seed_ghcid in inheritance_map: inheritance_map[seed_ghcid].append( (inst.ghcid, inst.graph_score, inst.expansion_reason) ) # 4. Apply inheritance INHERITANCE_FACTOR = 0.5 for vector_ghcid, related_list in inheritance_map.items(): if related_list: inst = results_by_ghcid[vector_ghcid] related_scores = [score for _, score, _ in related_list] inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR inst.graph_score = max(inst.graph_score, inherited_score) # 5. Calculate combined scores for inst in results_by_ghcid.values(): inst.combined_score = ( self.vector_weight * inst.vector_score + self.graph_weight * inst.graph_score ) return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k] ``` ### Graph Expansion Scores The `_expand_via_graph()` method assigns these base scores: | Expansion Type | Graph Score | SPARQL Pattern | |----------------|-------------|----------------| | Same city | 0.8 | `?s schema:location ?loc . ?loc hc:cityCode ?cityCode` | | Same institution type | 0.5 | `?s hc:institutionType ?type` | ## Results ### Before (Graph Score = 0.0) ``` Query: "Welke musea zijn er in Utrecht?" 1. Centraal Museum | V:0.589 G:0.000 C:0.412 2. Museum Speelklok | V:0.591 G:0.000 C:0.414 3. Universiteitsmuseum Utrecht | V:0.641 G:0.000 C:0.449 ``` ### After (Graph Score Inherited) ``` Query: "Welke musea zijn er in Utrecht?" 1. Universiteitsmuseum Utrecht | V:0.641 G:0.400 C:0.569 2. Museum Speelklok | V:0.591 G:0.400 C:0.534 3. Centraal Museum | V:0.589 G:0.400 C:0.532 ``` **Key improvements:** - Graph scores now **0.400** (inherited from same-city museums) - Combined scores **increased by ~25%** (0.412 → 0.532) - Ranking now considers **geographic relevance** ### More Examples ``` Query: "Bibliotheken in Den Haag" 1. Centrale Bibliotheek | V:0.697 G:0.400 C:0.608 2. Koninklijke Bibliotheek | V:0.676 G:0.400 C:0.593 3. Huis van het Boek | V:0.630 G:0.400 C:0.561 4. Bibliotheek Hoeksche Waard | V:0.613 G:0.400 C:0.549 5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436 <- No inheritance (different city) ``` ## Configuration ### Weights (in `HybridRetriever.__init__`) ```python self.vector_weight = 0.7 # Semantic similarity importance self.graph_weight = 0.3 # Knowledge graph importance ``` ### Inheritance Factor ```python INHERITANCE_FACTOR = 0.5 # In _combine_and_rank() ``` **Tuning considerations:** - Higher factor (0.6-0.8): Stronger influence from graph relationships - Lower factor (0.3-0.4): More conservative, vector similarity dominates - Current value (0.5): Balanced approach ## Logging The implementation includes detailed logging for debugging: ```python # INFO level (always visible) logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...") # DEBUG level (when LOG_LEVEL=DEBUG) logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)") ``` **Check logs on production:** ```bash ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance" ``` ## API Response Structure The graph score is exposed in the API response: ```json { "retrieved_results": [ { "ghcid": "NL-UT-UTR-M-CM", "name": "Centraal Museum", "scores": { "vector": 0.589, "graph": 0.400, // <-- Now populated via inheritance "combined": 0.532 }, "related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"] } ] } ``` ## Deployment **File to deploy:** ```bash scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \ root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/ ``` **Restart service:** ```bash ssh root@91.98.224.44 "systemctl restart glam-rag-api" ``` **Verify:** ```bash curl -s -X POST 'https://archief.support/api/rag/dspy/query' \ -H 'Content-Type: application/json' \ -d '{"question": "Musea in Rotterdam", "language": "nl"}' | \ python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))" ``` ## Related Files | File | Purpose | |------|---------| | `hybrid_retriever.py` | Main implementation with `_combine_and_rank()` | | `dspy_heritage_rag.py` | RAG pipeline that calls `retriever.search()` | | `main.py` | FastAPI endpoints serving the RAG API | ## Future Improvements 1. **Dynamic inheritance factor**: Adjust based on query type (geographic vs. thematic) 2. **Multi-hop expansion**: Inherit from institutions 2+ hops away 3. **Weighted inheritance**: Weight by relationship type (same_city=0.8, same_type=0.5) 4. **Negative inheritance**: Penalize results unrelated to graph findings --- **Last Updated:** 2025-12-24 **Implemented:** 2025-12-23 **Status:** Production (archief.support)