glam/docs/dspy_rag/09-performance-guide.md
2025-12-16 11:57:34 +01:00

9.4 KiB

RAG Performance Guide

Overview

This document provides performance benchmarks and optimization guidelines for the multi-database RAG (Retrieval-Augmented Generation) pipeline used in bronhouder.nl.

Database Architecture

The RAG pipeline integrates four databases, each optimized for different query patterns:

Database Purpose Vectors/Records Typical Latency
Qdrant Vector semantic search 27,000+ vectors 38-44ms (cached)
Oxigraph RDF/SPARQL knowledge graph 148,000+ triples 36-143ms (cached)
TypeDB 3.0 Graph relationships 27,452 custodians 200-230ms (connection overhead)
PostgreSQL/PostGIS Source data & geospatial Full dataset 11-14ms (cached)

Performance Benchmarks

Tested Configuration

  • Server: Hetzner Cloud (91.98.224.44)
  • Date: December 2025
  • Test: /api/rag/query endpoint with generate_answer: false

Results by Source Combination

Sources First Query Cached Query Notes
qdrant only ~2,200ms 38-44ms First query loads embedding model
qdrant + sparql 357-382ms 36-143ms Recommended default
typedb only ~1,750ms 200-230ms Connection warmup on first
qdrant + typedb ~320ms 200-250ms TypeDB adds ~200ms overhead
postgis only ~160ms 11-14ms Very fast, needs geo context
qdrant + postgis ~73ms Fast combination
All sources ~2,100ms 230-330ms TypeDB dominates latency

Key Insights

  1. Qdrant + SPARQL is optimal for most queries (~350ms → ~40ms cached)
  2. TypeDB adds ~200ms overhead due to connection management
  3. PostGIS is fastest but only useful for geographic queries
  4. First query penalty exists for embedding model loading (~2s)

Default: Semantic + Knowledge Graph

{
  "sources": ["qdrant", "sparql"],
  "k": 10,
  "generate_answer": true
}

Best for: General heritage institution queries, name searches, collection queries.

Relationship Queries

{
  "sources": ["qdrant", "sparql", "typedb"],
  "k": 10,
  "generate_answer": true
}

Best for: Complex relationship queries, organizational hierarchies, custodian networks.

Geographic Queries

{
  "sources": ["qdrant", "postgis"],
  "k": 10,
  "generate_answer": true
}

Best for: Location-based queries, "museums in Amsterdam", proximity searches.

Maximum Coverage (Slower)

{
  "sources": ["qdrant", "sparql", "typedb", "postgis"],
  "k": 10,
  "generate_answer": true
}

Best for: Comprehensive searches where accuracy is more important than speed.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                          User Query                                  │
│                    "museums in Amsterdam"                            │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Source Router                                  │
│              (based on selectedSources array)                        │
└─────────────────────────────────────────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│    Qdrant     │       │   Oxigraph    │       │    TypeDB     │
│  (Semantic)   │       │   (SPARQL)    │       │    (Graph)    │
│   ~40ms       │       │   ~100ms      │       │   ~200ms      │
└───────────────┘       └───────────────┘       └───────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Result Merger & Ranker                            │
│         (deduplication, relevance scoring, top-k selection)          │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Answer Generator (LLM)                           │
│              (Claude via DSPy, only if requested)                    │
└─────────────────────────────────────────────────────────────────────┘

Optimization History

December 2025: SPARQL LLM Bypass

Problem: Queries with sources=["qdrant", "sparql"] were taking 7-8 seconds.

Root Cause: The retrieve_from_sparql() function was redundantly calling DSPy/Claude LLM (~7s) for SPARQL query generation, even though graph expansion was already handled by batch SPARQL in retrieve_from_qdrant().

Solution: Modified /opt/glam-backend/rag/main.py (line ~736) to skip LLM-based SPARQL retrieval:

elif source == DataSource.SPARQL:
    # OPTIMIZATION: Skip LLM-based SPARQL generation for RAG queries.
    # Graph expansion is already handled in retrieve_from_qdrant() via batch SPARQL.
    logger.debug("SPARQL source selected - graph expansion handled by Qdrant retrieval")
    continue

Result: Query time reduced from 7-8s to 350-380ms (20x improvement).

Caching Behavior

Server-Side Caching

  • Semantic Cache: Embeds questions and matches similar queries
  • TTL: 15 minutes default
  • Scope: Per-query result caching

Client-Side Caching (IndexedDB)

  • Database: GLAM_SemanticCache
  • Purpose: Reduces redundant API calls
  • Clear: Use "Clear cache" button in UI

Cache Invalidation

# Server-side cache clear
curl -X POST "http://localhost:8010/api/cache/invalidate" \
  -H "Content-Type: application/json" \
  -d '{"clear_all": true}'

# Or via service restart
systemctl restart glam-rag-api

Monitoring

Service Status

ssh root@91.98.224.44
systemctl status glam-rag-api
journalctl -u glam-rag-api -n 50 --no-pager

Performance Testing

# Test specific source combination
curl -s -X POST "http://localhost:8010/api/rag/query" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "museums in Amsterdam",
    "sources": ["qdrant", "sparql"],
    "k": 5,
    "generate_answer": false
  }' | jq '{time_ms: .query_time_ms, count: (.results | length)}'

Expected Results

{
  "time_ms": 45.2,
  "count": 5
}

Troubleshooting

Slow SPARQL Queries (>1s)

  1. Check if LLM bypass is active in rag/main.py
  2. Verify Oxigraph is running: systemctl status oxigraph
  3. Check for complex graph patterns in query

TypeDB Connection Timeout

  1. TypeDB has 200ms connection overhead - this is expected
  2. First query after idle may take longer (~1.7s)
  3. Consider excluding TypeDB for latency-critical queries

High First-Query Latency

  1. Embedding model loads on first query (~2s)
  2. This is expected behavior
  3. Subsequent queries will use cached model

API Reference

Query Endpoint

POST /api/rag/query

Request Body

{
  "question": "string",
  "sources": ["qdrant", "sparql", "typedb", "postgis"],
  "k": 10,
  "generate_answer": true,
  "language": "en"
}

Response

{
  "question": "museums in Amsterdam",
  "answer": "There are several museums...",
  "results": [...],
  "sources_used": ["qdrant", "sparql"],
  "query_time_ms": 45.2
}