292 lines
9.2 KiB
Markdown
292 lines
9.2 KiB
Markdown
# Graph Score Inheritance in Hybrid Retrieval
|
|
|
|
## Overview
|
|
|
|
The Heritage RAG system uses a **hybrid retrieval** approach that combines:
|
|
1. **Vector search** (semantic similarity via embeddings)
|
|
2. **Knowledge graph expansion** (SPARQL-based relationship discovery)
|
|
|
|
This document explains the **graph score inheritance** feature that ensures vector search results benefit from knowledge graph relationships.
|
|
|
|
## The Problem
|
|
|
|
Before graph score inheritance, the hybrid retrieval had a scoring gap:
|
|
|
|
| Result Source | Vector Score | Graph Score | Combined Score |
|
|
|---------------|--------------|-------------|----------------|
|
|
| Vector search results | 0.5-0.8 | **0.0** | 0.35-0.56 |
|
|
| Graph expansion results | 0.0 | 0.5-0.8 | 0.15-0.24 |
|
|
|
|
**Why this happened:**
|
|
- Vector search finds institutions semantically similar to the query
|
|
- Graph expansion finds **different** institutions (same city/type) with different GHCIDs
|
|
- Since GHCIDs don't match, no direct merging occurs
|
|
- Vector results always dominate because `combined = 0.7 * vector + 0.3 * graph`
|
|
|
|
**Example before fix:**
|
|
```
|
|
Query: "Archieven in Amsterdam"
|
|
|
|
1. Stadsarchief Amsterdam | V:0.659 G:0.000 C:0.461
|
|
2. Noord-Hollands Archief | V:0.675 G:0.000 C:0.472
|
|
3. The Black Archives | V:0.636 G:0.000 C:0.445
|
|
```
|
|
|
|
The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.
|
|
|
|
## The Solution: Graph Score Inheritance
|
|
|
|
Vector results now **inherit** graph scores from related institutions found via graph expansion.
|
|
|
|
### How It Works
|
|
|
|
```
|
|
1. Vector Search
|
|
└── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores
|
|
|
|
2. Graph Expansion (for top 5 vector results)
|
|
└── For Inst_A in Amsterdam:
|
|
└── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
|
|
└── These get graph_score=0.8 (same_city)
|
|
└── They track: related_institutions=[Inst_A.ghcid]
|
|
|
|
3. Inheritance Calculation
|
|
└── Inst_A inherits from [Inst_X, Inst_Y]:
|
|
inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
|
|
└── Inst_A.graph_score = max(0.0, 0.4) = 0.4
|
|
|
|
4. Combined Scoring
|
|
└── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!
|
|
```
|
|
|
|
### Inheritance Factor
|
|
|
|
```python
|
|
INHERITANCE_FACTOR = 0.5 # Inherit 50% of related institutions' graph scores
|
|
```
|
|
|
|
This means:
|
|
- Same-city institutions (graph_score=0.8) → inherited score of **0.40**
|
|
- Same-type institutions (graph_score=0.5) → inherited score of **0.25**
|
|
|
|
## Implementation Details
|
|
|
|
### File Location
|
|
|
|
```
|
|
/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py
|
|
```
|
|
|
|
### Key Method: `_combine_and_rank()`
|
|
|
|
Located at lines ~1539-1671, this method:
|
|
|
|
1. **Creates lookup by GHCID** for merging
|
|
2. **Handles direct merges** when graph result GHCID matches vector result
|
|
3. **Builds inheritance map** tracking which vector results each graph result was expanded from
|
|
4. **Applies inheritance** calculating inherited scores for vector results
|
|
5. **Computes combined scores** with the formula: `0.7 * vector + 0.3 * graph`
|
|
|
|
### Code Structure
|
|
|
|
```python
|
|
def _combine_and_rank(
|
|
self,
|
|
vector_results: list[RetrievedInstitution],
|
|
graph_results: list[RetrievedInstitution],
|
|
k: int
|
|
) -> list[RetrievedInstitution]:
|
|
"""Combine vector and graph results with weighted scoring and graph inheritance."""
|
|
|
|
# 1. Create lookup by GHCID
|
|
results_by_ghcid: dict[str, RetrievedInstitution] = {}
|
|
vector_ghcids = set()
|
|
|
|
# 2. Add vector results
|
|
for inst in vector_results:
|
|
results_by_ghcid[inst.ghcid] = inst
|
|
vector_ghcids.add(inst.ghcid)
|
|
|
|
# 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
|
|
inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}
|
|
|
|
for inst in graph_results:
|
|
if inst.ghcid in results_by_ghcid:
|
|
# Direct merge
|
|
existing = results_by_ghcid[inst.ghcid]
|
|
existing.graph_score = max(existing.graph_score, inst.graph_score)
|
|
else:
|
|
# New from graph - track for inheritance
|
|
results_by_ghcid[inst.ghcid] = inst
|
|
for seed_ghcid in inst.related_institutions:
|
|
if seed_ghcid in inheritance_map:
|
|
inheritance_map[seed_ghcid].append(
|
|
(inst.ghcid, inst.graph_score, inst.expansion_reason)
|
|
)
|
|
|
|
# 4. Apply inheritance
|
|
INHERITANCE_FACTOR = 0.5
|
|
for vector_ghcid, related_list in inheritance_map.items():
|
|
if related_list:
|
|
inst = results_by_ghcid[vector_ghcid]
|
|
related_scores = [score for _, score, _ in related_list]
|
|
inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
|
|
inst.graph_score = max(inst.graph_score, inherited_score)
|
|
|
|
# 5. Calculate combined scores
|
|
for inst in results_by_ghcid.values():
|
|
inst.combined_score = (
|
|
self.vector_weight * inst.vector_score +
|
|
self.graph_weight * inst.graph_score
|
|
)
|
|
|
|
return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]
|
|
```
|
|
|
|
### Graph Expansion Scores
|
|
|
|
The `_expand_via_graph()` method assigns these base scores:
|
|
|
|
| Expansion Type | Graph Score | SPARQL Pattern |
|
|
|----------------|-------------|----------------|
|
|
| Same city | 0.8 | `?s schema:location ?loc . ?loc hc:cityCode ?cityCode` |
|
|
| Same institution type | 0.5 | `?s hc:institutionType ?type` |
|
|
|
|
## Results
|
|
|
|
### Before (Graph Score = 0.0)
|
|
|
|
```
|
|
Query: "Welke musea zijn er in Utrecht?"
|
|
|
|
1. Centraal Museum | V:0.589 G:0.000 C:0.412
|
|
2. Museum Speelklok | V:0.591 G:0.000 C:0.414
|
|
3. Universiteitsmuseum Utrecht | V:0.641 G:0.000 C:0.449
|
|
```
|
|
|
|
### After (Graph Score Inherited)
|
|
|
|
```
|
|
Query: "Welke musea zijn er in Utrecht?"
|
|
|
|
1. Universiteitsmuseum Utrecht | V:0.641 G:0.400 C:0.569
|
|
2. Museum Speelklok | V:0.591 G:0.400 C:0.534
|
|
3. Centraal Museum | V:0.589 G:0.400 C:0.532
|
|
```
|
|
|
|
**Key improvements:**
|
|
- Graph scores now **0.400** (inherited from same-city museums)
|
|
- Combined scores **increased by ~25%** (0.412 → 0.532)
|
|
- Ranking now considers **geographic relevance**
|
|
|
|
### More Examples
|
|
|
|
```
|
|
Query: "Bibliotheken in Den Haag"
|
|
|
|
1. Centrale Bibliotheek | V:0.697 G:0.400 C:0.608
|
|
2. Koninklijke Bibliotheek | V:0.676 G:0.400 C:0.593
|
|
3. Huis van het Boek | V:0.630 G:0.400 C:0.561
|
|
4. Bibliotheek Hoeksche Waard | V:0.613 G:0.400 C:0.549
|
|
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436 <- No inheritance (different city)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Weights (in `HybridRetriever.__init__`)
|
|
|
|
```python
|
|
self.vector_weight = 0.7 # Semantic similarity importance
|
|
self.graph_weight = 0.3 # Knowledge graph importance
|
|
```
|
|
|
|
### Inheritance Factor
|
|
|
|
```python
|
|
INHERITANCE_FACTOR = 0.5 # In _combine_and_rank()
|
|
```
|
|
|
|
**Tuning considerations:**
|
|
- Higher factor (0.6-0.8): Stronger influence from graph relationships
|
|
- Lower factor (0.3-0.4): More conservative, vector similarity dominates
|
|
- Current value (0.5): Balanced approach
|
|
|
|
## Logging
|
|
|
|
The implementation includes detailed logging for debugging:
|
|
|
|
```python
|
|
# INFO level (always visible)
|
|
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")
|
|
|
|
# DEBUG level (when LOG_LEVEL=DEBUG)
|
|
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")
|
|
```
|
|
|
|
**Check logs on production:**
|
|
```bash
|
|
ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"
|
|
```
|
|
|
|
## API Response Structure
|
|
|
|
The graph score is exposed in the API response:
|
|
|
|
```json
|
|
{
|
|
"retrieved_results": [
|
|
{
|
|
"ghcid": "NL-UT-UTR-M-CM",
|
|
"name": "Centraal Museum",
|
|
"scores": {
|
|
"vector": 0.589,
|
|
"graph": 0.400, // <-- Now populated via inheritance
|
|
"combined": 0.532
|
|
},
|
|
"related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Deployment
|
|
|
|
**File to deploy:**
|
|
```bash
|
|
scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
|
|
root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/
|
|
```
|
|
|
|
**Restart service:**
|
|
```bash
|
|
ssh root@91.98.224.44 "systemctl restart glam-rag-api"
|
|
```
|
|
|
|
**Verify:**
|
|
```bash
|
|
curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
|
|
python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"
|
|
```
|
|
|
|
## Related Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `hybrid_retriever.py` | Main implementation with `_combine_and_rank()` |
|
|
| `dspy_heritage_rag.py` | RAG pipeline that calls `retriever.search()` |
|
|
| `main.py` | FastAPI endpoints serving the RAG API |
|
|
|
|
## Future Improvements
|
|
|
|
1. **Dynamic inheritance factor**: Adjust based on query type (geographic vs. thematic)
|
|
2. **Multi-hop expansion**: Inherit from institutions 2+ hops away
|
|
3. **Weighted inheritance**: Weight by relationship type (same_city=0.8, same_type=0.5)
|
|
4. **Negative inheritance**: Penalize results unrelated to graph findings
|
|
|
|
---
|
|
|
|
**Last Updated:** 2025-12-24
|
|
**Implemented:** 2025-12-23
|
|
**Status:** Production (archief.support)
|