glam/docs/GRAPH_SCORE_INHERITANCE.md
2025-12-26 14:30:31 +01:00

9.2 KiB

Graph Score Inheritance in Hybrid Retrieval

Overview

The Heritage RAG system uses a hybrid retrieval approach that combines:

  1. Vector search (semantic similarity via embeddings)
  2. Knowledge graph expansion (SPARQL-based relationship discovery)

This document explains the graph score inheritance feature that ensures vector search results benefit from knowledge graph relationships.

The Problem

Before graph score inheritance, the hybrid retrieval had a scoring gap:

Result Source Vector Score Graph Score Combined Score
Vector search results 0.5-0.8 0.0 0.35-0.56
Graph expansion results 0.0 0.5-0.8 0.15-0.24

Why this happened:

  • Vector search finds institutions semantically similar to the query
  • Graph expansion finds different institutions (same city/type) with different GHCIDs
  • Since GHCIDs don't match, no direct merging occurs
  • Vector results always dominate because combined = 0.7 * vector + 0.3 * graph

Example before fix:

Query: "Archieven in Amsterdam"

1. Stadsarchief Amsterdam      | V:0.659 G:0.000 C:0.461
2. Noord-Hollands Archief      | V:0.675 G:0.000 C:0.472
3. The Black Archives          | V:0.636 G:0.000 C:0.445

The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.

The Solution: Graph Score Inheritance

Vector results now inherit graph scores from related institutions found via graph expansion.

How It Works

1. Vector Search
   └── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores
   
2. Graph Expansion (for top 5 vector results)
   └── For Inst_A in Amsterdam:
       └── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
       └── These get graph_score=0.8 (same_city)
       └── They track: related_institutions=[Inst_A.ghcid]

3. Inheritance Calculation
   └── Inst_A inherits from [Inst_X, Inst_Y]:
       inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
   └── Inst_A.graph_score = max(0.0, 0.4) = 0.4

4. Combined Scoring
   └── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!

Inheritance Factor

INHERITANCE_FACTOR = 0.5  # Inherit 50% of related institutions' graph scores

This means:

  • Same-city institutions (graph_score=0.8) → inherited score of 0.40
  • Same-type institutions (graph_score=0.5) → inherited score of 0.25

Implementation Details

File Location

/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py

Key Method: _combine_and_rank()

Located at lines ~1539-1671, this method:

  1. Creates lookup by GHCID for merging
  2. Handles direct merges when graph result GHCID matches vector result
  3. Builds inheritance map tracking which vector results each graph result was expanded from
  4. Applies inheritance calculating inherited scores for vector results
  5. Computes combined scores with the formula: 0.7 * vector + 0.3 * graph

Code Structure

def _combine_and_rank(
    self,
    vector_results: list[RetrievedInstitution],
    graph_results: list[RetrievedInstitution],
    k: int
) -> list[RetrievedInstitution]:
    """Combine vector and graph results with weighted scoring and graph inheritance."""
    
    # 1. Create lookup by GHCID
    results_by_ghcid: dict[str, RetrievedInstitution] = {}
    vector_ghcids = set()
    
    # 2. Add vector results
    for inst in vector_results:
        results_by_ghcid[inst.ghcid] = inst
        vector_ghcids.add(inst.ghcid)
    
    # 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
    inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}
    
    for inst in graph_results:
        if inst.ghcid in results_by_ghcid:
            # Direct merge
            existing = results_by_ghcid[inst.ghcid]
            existing.graph_score = max(existing.graph_score, inst.graph_score)
        else:
            # New from graph - track for inheritance
            results_by_ghcid[inst.ghcid] = inst
            for seed_ghcid in inst.related_institutions:
                if seed_ghcid in inheritance_map:
                    inheritance_map[seed_ghcid].append(
                        (inst.ghcid, inst.graph_score, inst.expansion_reason)
                    )
    
    # 4. Apply inheritance
    INHERITANCE_FACTOR = 0.5
    for vector_ghcid, related_list in inheritance_map.items():
        if related_list:
            inst = results_by_ghcid[vector_ghcid]
            related_scores = [score for _, score, _ in related_list]
            inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
            inst.graph_score = max(inst.graph_score, inherited_score)
    
    # 5. Calculate combined scores
    for inst in results_by_ghcid.values():
        inst.combined_score = (
            self.vector_weight * inst.vector_score +
            self.graph_weight * inst.graph_score
        )
    
    return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]

Graph Expansion Scores

The _expand_via_graph() method assigns these base scores:

Expansion Type Graph Score SPARQL Pattern
Same city 0.8 ?s schema:location ?loc . ?loc hc:cityCode ?cityCode
Same institution type 0.5 ?s hc:institutionType ?type

Results

Before (Graph Score = 0.0)

Query: "Welke musea zijn er in Utrecht?"

1. Centraal Museum              | V:0.589 G:0.000 C:0.412
2. Museum Speelklok             | V:0.591 G:0.000 C:0.414
3. Universiteitsmuseum Utrecht  | V:0.641 G:0.000 C:0.449

After (Graph Score Inherited)

Query: "Welke musea zijn er in Utrecht?"

1. Universiteitsmuseum Utrecht  | V:0.641 G:0.400 C:0.569
2. Museum Speelklok             | V:0.591 G:0.400 C:0.534
3. Centraal Museum              | V:0.589 G:0.400 C:0.532

Key improvements:

  • Graph scores now 0.400 (inherited from same-city museums)
  • Combined scores increased by ~25% (0.412 → 0.532)
  • Ranking now considers geographic relevance

More Examples

Query: "Bibliotheken in Den Haag"

1. Centrale Bibliotheek         | V:0.697 G:0.400 C:0.608
2. Koninklijke Bibliotheek      | V:0.676 G:0.400 C:0.593
3. Huis van het Boek            | V:0.630 G:0.400 C:0.561
4. Bibliotheek Hoeksche Waard   | V:0.613 G:0.400 C:0.549
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436  <- No inheritance (different city)

Configuration

Weights (in HybridRetriever.__init__)

self.vector_weight = 0.7  # Semantic similarity importance
self.graph_weight = 0.3   # Knowledge graph importance

Inheritance Factor

INHERITANCE_FACTOR = 0.5  # In _combine_and_rank()

Tuning considerations:

  • Higher factor (0.6-0.8): Stronger influence from graph relationships
  • Lower factor (0.3-0.4): More conservative, vector similarity dominates
  • Current value (0.5): Balanced approach

Logging

The implementation includes detailed logging for debugging:

# INFO level (always visible)
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")

# DEBUG level (when LOG_LEVEL=DEBUG)
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")

Check logs on production:

ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"

API Response Structure

The graph score is exposed in the API response:

{
  "retrieved_results": [
    {
      "ghcid": "NL-UT-UTR-M-CM",
      "name": "Centraal Museum",
      "scores": {
        "vector": 0.589,
        "graph": 0.400,    // <-- Now populated via inheritance
        "combined": 0.532
      },
      "related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
    }
  ]
}

Deployment

File to deploy:

scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
    root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/

Restart service:

ssh root@91.98.224.44 "systemctl restart glam-rag-api"

Verify:

curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
  -H 'Content-Type: application/json' \
  -d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
  python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"
File Purpose
hybrid_retriever.py Main implementation with _combine_and_rank()
dspy_heritage_rag.py RAG pipeline that calls retriever.search()
main.py FastAPI endpoints serving the RAG API

Future Improvements

  1. Dynamic inheritance factor: Adjust based on query type (geographic vs. thematic)
  2. Multi-hop expansion: Inherit from institutions 2+ hops away
  3. Weighted inheritance: Weight by relationship type (same_city=0.8, same_type=0.5)
  4. Negative inheritance: Penalize results unrelated to graph findings

Last Updated: 2025-12-24
Implemented: 2025-12-23
Status: Production (archief.support)