kempersc/glam

Fork 0

kempersc ca219340f2 enrich entries

2025-12-26 14:30:31 +01:00

9.2 KiB

Raw Blame History

Graph Score Inheritance in Hybrid Retrieval

Overview

The Heritage RAG system uses a hybrid retrieval approach that combines:

Vector search (semantic similarity via embeddings)
Knowledge graph expansion (SPARQL-based relationship discovery)

This document explains the graph score inheritance feature that ensures vector search results benefit from knowledge graph relationships.

The Problem

Before graph score inheritance, the hybrid retrieval had a scoring gap:

Result Source	Vector Score	Graph Score	Combined Score
Vector search results	0.5-0.8	0.0	0.35-0.56
Graph expansion results	0.0	0.5-0.8	0.15-0.24

Why this happened:

Vector search finds institutions semantically similar to the query
Graph expansion finds different institutions (same city/type) with different GHCIDs
Since GHCIDs don't match, no direct merging occurs
Vector results always dominate because combined = 0.7 * vector + 0.3 * graph

Example before fix:

Query: "Archieven in Amsterdam"

1. Stadsarchief Amsterdam      | V:0.659 G:0.000 C:0.461
2. Noord-Hollands Archief      | V:0.675 G:0.000 C:0.472
3. The Black Archives          | V:0.636 G:0.000 C:0.445

The graph expansion was finding related institutions in Amsterdam, but that information wasn't reflected in the scores.

The Solution: Graph Score Inheritance

Vector results now inherit graph scores from related institutions found via graph expansion.

How It Works

1. Vector Search
   └── Returns: [Inst_A, Inst_B, Inst_C] with vector_scores
   
2. Graph Expansion (for top 5 vector results)
   └── For Inst_A in Amsterdam:
       └── SPARQL finds: [Inst_X, Inst_Y] also in Amsterdam
       └── These get graph_score=0.8 (same_city)
       └── They track: related_institutions=[Inst_A.ghcid]

3. Inheritance Calculation
   └── Inst_A inherits from [Inst_X, Inst_Y]:
       inherited_score = avg([0.8, 0.8]) * 0.5 = 0.4
   └── Inst_A.graph_score = max(0.0, 0.4) = 0.4

4. Combined Scoring
   └── Inst_A.combined = 0.7 * vector + 0.3 * 0.4 = higher rank!

Inheritance Factor

INHERITANCE_FACTOR = 0.5  # Inherit 50% of related institutions' graph scores

This means:

Same-city institutions (graph_score=0.8) → inherited score of 0.40
Same-type institutions (graph_score=0.5) → inherited score of 0.25

Implementation Details

File Location

/Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py

Key Method: `_combine_and_rank()`

Located at lines ~1539-1671, this method:

Creates lookup by GHCID for merging
Handles direct merges when graph result GHCID matches vector result
Builds inheritance map tracking which vector results each graph result was expanded from
Applies inheritance calculating inherited scores for vector results
Computes combined scores with the formula: 0.7 * vector + 0.3 * graph

Code Structure

def _combine_and_rank(
    self,
    vector_results: list[RetrievedInstitution],
    graph_results: list[RetrievedInstitution],
    k: int
) -> list[RetrievedInstitution]:
    """Combine vector and graph results with weighted scoring and graph inheritance."""
    
    # 1. Create lookup by GHCID
    results_by_ghcid: dict[str, RetrievedInstitution] = {}
    vector_ghcids = set()
    
    # 2. Add vector results
    for inst in vector_results:
        results_by_ghcid[inst.ghcid] = inst
        vector_ghcids.add(inst.ghcid)
    
    # 3. Build inheritance map: vector_ghcid -> [(related_ghcid, graph_score, reason)]
    inheritance_map: dict[str, list[tuple[str, float, str]]] = {g: [] for g in vector_ghcids}
    
    for inst in graph_results:
        if inst.ghcid in results_by_ghcid:
            # Direct merge
            existing = results_by_ghcid[inst.ghcid]
            existing.graph_score = max(existing.graph_score, inst.graph_score)
        else:
            # New from graph - track for inheritance
            results_by_ghcid[inst.ghcid] = inst
            for seed_ghcid in inst.related_institutions:
                if seed_ghcid in inheritance_map:
                    inheritance_map[seed_ghcid].append(
                        (inst.ghcid, inst.graph_score, inst.expansion_reason)
                    )
    
    # 4. Apply inheritance
    INHERITANCE_FACTOR = 0.5
    for vector_ghcid, related_list in inheritance_map.items():
        if related_list:
            inst = results_by_ghcid[vector_ghcid]
            related_scores = [score for _, score, _ in related_list]
            inherited_score = (sum(related_scores) / len(related_scores)) * INHERITANCE_FACTOR
            inst.graph_score = max(inst.graph_score, inherited_score)
    
    # 5. Calculate combined scores
    for inst in results_by_ghcid.values():
        inst.combined_score = (
            self.vector_weight * inst.vector_score +
            self.graph_weight * inst.graph_score
        )
    
    return sorted(results_by_ghcid.values(), key=lambda x: x.combined_score, reverse=True)[:k]

Graph Expansion Scores

The _expand_via_graph() method assigns these base scores:

Expansion Type	Graph Score	SPARQL Pattern
Same city	0.8	`?s schema:location ?loc . ?loc hc:cityCode ?cityCode`
Same institution type	0.5	`?s hc:institutionType ?type`

Results

Before (Graph Score = 0.0)

Query: "Welke musea zijn er in Utrecht?"

1. Centraal Museum              | V:0.589 G:0.000 C:0.412
2. Museum Speelklok             | V:0.591 G:0.000 C:0.414
3. Universiteitsmuseum Utrecht  | V:0.641 G:0.000 C:0.449

After (Graph Score Inherited)

Query: "Welke musea zijn er in Utrecht?"

1. Universiteitsmuseum Utrecht  | V:0.641 G:0.400 C:0.569
2. Museum Speelklok             | V:0.591 G:0.400 C:0.534
3. Centraal Museum              | V:0.589 G:0.400 C:0.532

Key improvements:

Graph scores now 0.400 (inherited from same-city museums)
Combined scores increased by ~25% (0.412 → 0.532)
Ranking now considers geographic relevance

More Examples

Query: "Bibliotheken in Den Haag"

1. Centrale Bibliotheek         | V:0.697 G:0.400 C:0.608
2. Koninklijke Bibliotheek      | V:0.676 G:0.400 C:0.593
3. Huis van het Boek            | V:0.630 G:0.400 C:0.561
4. Bibliotheek Hoeksche Waard   | V:0.613 G:0.400 C:0.549
5. Centrale Bibliotheek (other) | V:0.623 G:0.000 C:0.436  <- No inheritance (different city)

Configuration

Weights (in `HybridRetriever.init`)

self.vector_weight = 0.7  # Semantic similarity importance
self.graph_weight = 0.3   # Knowledge graph importance

Inheritance Factor

INHERITANCE_FACTOR = 0.5  # In _combine_and_rank()

Tuning considerations:

Higher factor (0.6-0.8): Stronger influence from graph relationships
Lower factor (0.3-0.4): More conservative, vector similarity dominates
Current value (0.5): Balanced approach

Logging

The implementation includes detailed logging for debugging:

# INFO level (always visible)
logger.info(f"Graph inheritance applied to {len(inheritance_boosts)} vector results: {ghcids}...")

# DEBUG level (when LOG_LEVEL=DEBUG)
logger.debug(f"Inheritance: {ghcid} graph_score: {old:.3f} -> {new:.3f} (from {n} related)")

Check logs on production:

ssh root@91.98.224.44 "journalctl -u glam-rag-api --since '5 minutes ago' | grep -i inheritance"

API Response Structure

The graph score is exposed in the API response:

{
  "retrieved_results": [
    {
      "ghcid": "NL-UT-UTR-M-CM",
      "name": "Centraal Museum",
      "scores": {
        "vector": 0.589,
        "graph": 0.400,    // <-- Now populated via inheritance
        "combined": 0.532
      },
      "related_institutions": ["NL-UT-UTR-M-MS", "NL-UT-UTR-M-UMUU"]
    }
  ]
}

Deployment

File to deploy:

scp /Users/kempersc/apps/glam/src/glam_extractor/api/hybrid_retriever.py \
    root@91.98.224.44:/opt/glam-backend/rag/glam_extractor/api/

Restart service:

ssh root@91.98.224.44 "systemctl restart glam-rag-api"

Verify:

curl -s -X POST 'https://archief.support/api/rag/dspy/query' \
  -H 'Content-Type: application/json' \
  -d '{"question": "Musea in Rotterdam", "language": "nl"}' | \
  python3 -c "import sys,json; r=json.load(sys.stdin)['retrieved_results']; print('\n'.join(f\"{x['name'][:30]:30} G:{x['scores']['graph']:.2f}\" for x in r[:5]))"

File	Purpose
`hybrid_retriever.py`	Main implementation with `_combine_and_rank()`
`dspy_heritage_rag.py`	RAG pipeline that calls `retriever.search()`
`main.py`	FastAPI endpoints serving the RAG API

Future Improvements

Dynamic inheritance factor: Adjust based on query type (geographic vs. thematic)
Multi-hop expansion: Inherit from institutions 2+ hops away
Weighted inheritance: Weight by relationship type (same_city=0.8, same_type=0.5)
Negative inheritance: Penalize results unrelated to graph findings

Last Updated: 2025-12-24
Implemented: 2025-12-23
Status: Production (archief.support)

9.2 KiB Raw Blame History

Graph Score Inheritance in Hybrid Retrieval

Overview

The Problem

The Solution: Graph Score Inheritance

How It Works

Inheritance Factor

Implementation Details

File Location

Key Method: _combine_and_rank()

Code Structure

Graph Expansion Scores

Results

Before (Graph Score = 0.0)

After (Graph Score Inherited)

More Examples

Configuration

Weights (in HybridRetriever.__init__)

Inheritance Factor

Logging

API Response Structure

Deployment

Related Files

Future Improvements

9.2 KiB

Raw Blame History

Key Method: `_combine_and_rank()`

Weights (in `HybridRetriever.init`)