glam/docs/GRAPH_SCORE_INHERITANCE.md

# Graph Score Inheritance in Hybrid Retrieval

## Overview

The Heritage RAG system uses a **hybrid retrieval** approach that combines:
1. **Vector search** (semantic similarity via embeddings)
2. **Knowledge graph expansion** (SPARQL-based relationship discovery)

This document explains the **graph score inheritance** feature that ensures vector search results benefit from knowledge graph relationships.

## The Problem

Before graph score inheritance, the hybrid retrieval had a scoring gap:

| Result Source | Vector Score | Graph Score | Combined Score |
|---------------|--------------|-------------|----------------|
| Vector search results | 0.5-0.8 | **0.0** | 0.35-0.56 |
| Graph expansion results | 0.0 | 0.5-0.8 | 0.15-0.24 |

**Why this happened:**
- Vector search finds institutions semantically similar to the query
- Graph expansion finds **different** institutions (same city/type) with different GHCIDs
- Since GHCIDs don't match, no direct merging occurs
- Vector results always dominate because `combined = 0.7 * vector + 0.3 * graph`

**Example before fix:**
```
Query: "Archieven in Amsterdam"

1. Stadsarchief Amsterdam      | V:0.659 G:0.000 C:0.461
2. Noord-Hollands Archief      | V:0.675 G:0.000 C:0.472
3. The Black Archives          | V:0.636 G:0.000 C:0.445
```

The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.

## The Solution: Graph Score Inheritance

Vector results now **inherit** graph scores from related institutions found via graph expansion.

### How It Works

```
1. Vector Search
   └── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores

2. Graph Expansion (for top 5 vector results)
   └── For Inst_A in Amsterdam:
       └── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
       └── These get graph_score=0.8 (same_city)
       └── They track: related_institutions=[Inst_A.ghcid]

3. Inheritance Calculation
   └── Inst_A inherits from [Inst_X, Inst_Y]:
       inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
   └── Inst_A.graph_score = max(0.0, 0.4) = 0.4

4. Combined Scoring
   └── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!
```

### Inheritance Factor

```python
INHERITANCE_FACTOR = 0.5  # Inherit 50% of related institutions' graph scores
```

This means:
- Same-city institutions (graph_score=0.8) → inherited score of **0.40**
- Same-type institutions (graph_score=0.5) → inherited score of **0.25**

## Implementation Details

### File Location

```
/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py
```

### Key Method: `_combine_and_rank()`

Located at lines ~1539-1671, this method:

1. **Creates lookup by GHCID** for merging
2. **Handles direct merges** when graph result GHCID matches vector result
3. **Builds inheritance map** tracking which vector results each graph result was expanded from
4. **Applies inheritance** calculating inherited scores for vector results
5. **Computes combined scores** with the formula: `0.7 * vector + 0.3 * graph`

### Code Structure

```python
def _combine_and_rank(
    self,
    vector_results: list[RetrievedInstitution],
    graph_results: list[RetrievedInstitution],
    k: int
) -> list[RetrievedInstitution]:
    """Combine vector and graph results with weighted scoring and graph inheritance."""

    # 1. Create lookup by GHCID
    results_by_ghcid: dict[str, RetrievedInstitution] = {}
    vector_ghcids = set()

    # 2. Add vector results
    for inst in vector_results:
        results_by_ghcid[inst.ghcid] = inst
        vector_ghcids.add(inst.ghcid)

    # 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
    inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}

    for inst in graph_results:
        if inst.ghcid in results_by_ghcid:
            # Direct merge
            existing = results_by_ghcid[inst.ghcid]
            existing.graph_score = max(existing.graph_score, inst.graph_score)
        else:
            # New from graph - track for inheritance
            results_by_ghcid[inst.ghcid] = inst
            for seed_ghcid in inst.related_institutions:
                if seed_ghcid in inheritance_map:
                    inheritance_map[seed_ghcid].append(
                        (inst.ghcid, inst.graph_score, inst.expansion_reason)
                    )

    # 4. Apply inheritance
    INHERITANCE_FACTOR = 0.5
    for vector_ghcid, related_list in inheritance_map.items():
        if related_list:
            inst = results_by_ghcid[vector_ghcid]
            related_scores = [score for _, score, _ in related_list]
            inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
            inst.graph_score = max(inst.graph_score, inherited_score)

    # 5. Calculate combined scores
    for inst in results_by_ghcid.values():
        inst.combined_score = (
            self.vector_weight * inst.vector_score +
            self.graph_weight * inst.graph_score
        )

    return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]
```

### Graph Expansion Scores

The `_expand_via_graph()` method assigns these base scores:

| Expansion Type | Graph Score | SPARQL Pattern |
|----------------|-------------|----------------|
| Same city | 0.8 | `?s schema:location ?loc . ?loc hc:cityCode ?cityCode` |
| Same institution type | 0.5 | `?s hc:institutionType ?type` |

## Results

### Before (Graph Score = 0.0)

```
Query: "Welke musea zijn er in Utrecht?"

1. Centraal Museum              | V:0.589 G:0.000 C:0.412
2. Museum Speelklok             | V:0.591 G:0.000 C:0.414
3. Universiteitsmuseum Utrecht  | V:0.641 G:0.000 C:0.449
```

### After (Graph Score Inherited)

```
Query: "Welke musea zijn er in Utrecht?"

1. Universiteitsmuseum Utrecht  | V:0.641 G:0.400 C:0.569
2. Museum Speelklok             | V:0.591 G:0.400 C:0.534
3. Centraal Museum              | V:0.589 G:0.400 C:0.532
```

**Key improvements:**
- Graph scores now **0.400** (inherited from same-city museums)
- Combined scores **increased by ~25%** (0.412 → 0.532)
- Ranking now considers **geographic relevance**

### More Examples

```
Query: "Bibliotheken in Den Haag"

1. Centrale Bibliotheek         | V:0.697 G:0.400 C:0.608
2. Koninklijke Bibliotheek      | V:0.676 G:0.400 C:0.593
3. Huis van het Boek            | V:0.630 G:0.400 C:0.561
4. Bibliotheek Hoeksche Waard   | V:0.613 G:0.400 C:0.549
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436  <- No inheritance (different city)
```

## Configuration

### Weights (in `HybridRetriever.__init__`)

```python
self.vector_weight = 0.7  # Semantic similarity importance
self.graph_weight = 0.3   # Knowledge graph importance
```

### Inheritance Factor

```python
INHERITANCE_FACTOR = 0.5  # In _combine_and_rank()
```

**Tuning considerations:**
- Higher factor (0.6-0.8): Stronger influence from graph relationships
- Lower factor (0.3-0.4): More conservative, vector similarity dominates
- Current value (0.5): Balanced approach

## Logging

The implementation includes detailed logging for debugging:

```python
# INFO level (always visible)
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")

# DEBUG level (when LOG_LEVEL=DEBUG)
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")
```

**Check logs on production:**
```bash
ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"
```

## API Response Structure

The graph score is exposed in the API response:

```json
{
  "retrieved_results": [
    {
      "ghcid": "NL-UT-UTR-M-CM",
      "name": "Centraal Museum",
      "scores": {
        "vector": 0.589,
        "graph": 0.400,    // <-- Now populated via inheritance
        "combined": 0.532
      },
      "related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
    }
  ]
}
```

## Deployment

**File to deploy:**
```bash
scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
    root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/
```

**Restart service:**
```bash
ssh root@91.98.224.44 "systemctl restart glam-rag-api"
```

**Verify:**
```bash
curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
  -H 'Content-Type: application/json' \
  -d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
  python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"
```

## Related Files

| File | Purpose |
|------|---------|
| `hybrid_retriever.py` | Main implementation with `_combine_and_rank()` |
| `dspy_heritage_rag.py` | RAG pipeline that calls `retriever.search()` |
| `main.py` | FastAPI endpoints serving the RAG API |

## Future Improvements

1. **Dynamic inheritance factor**: Adjust based on query type (geographic vs. thematic)
2. **Multi-hop expansion**: Inherit from institutions 2+ hops away
3. **Weighted inheritance**: Weight by relationship type (same_city=0.8, same_type=0.5)
4. **Negative inheritance**: Penalize results unrelated to graph findings

---

**Last Updated:** 2025-12-24
**Implemented:** 2025-12-23
**Status:** Production (archief.support)