glam/docs/dspy_rag/09-performance-guide.md

# RAG Performance Guide

## Overview

This document provides performance benchmarks and optimization guidelines for the multi-database RAG (Retrieval-Augmented Generation) pipeline used in bronhouder.nl.

## Database Architecture

The RAG pipeline integrates four databases, each optimized for different query patterns:

| Database | Purpose | Vectors/Records | Typical Latency |
|----------|---------|-----------------|-----------------|
| **Qdrant** | Vector semantic search | 27,000+ vectors | **38-44ms** (cached) |
| **Oxigraph** | RDF/SPARQL knowledge graph | 148,000+ triples | **36-143ms** (cached) |
| **TypeDB 3.0** | Graph relationships | 27,452 custodians | **200-230ms** (connection overhead) |
| **PostgreSQL/PostGIS** | Source data & geospatial | Full dataset | **11-14ms** (cached) |

## Performance Benchmarks

### Tested Configuration
- Server: Hetzner Cloud (91.98.224.44)
- Date: December 2025
- Test: `/api/rag/query` endpoint with `generate_answer: false`

### Results by Source Combination

| Sources | First Query | Cached Query | Notes |
|---------|------------:|-------------:|-------|
| **qdrant** only | ~2,200ms | 38-44ms | First query loads embedding model |
| **qdrant + sparql** | 357-382ms | 36-143ms | **✅ Recommended default** |
| **typedb** only | ~1,750ms | 200-230ms | Connection warmup on first |
| **qdrant + typedb** | ~320ms | 200-250ms | TypeDB adds ~200ms overhead |
| **postgis** only | ~160ms | 11-14ms | Very fast, needs geo context |
| **qdrant + postgis** | ~73ms | Fast combination |
| **All sources** | ~2,100ms | 230-330ms | TypeDB dominates latency |

### Key Insights

1. **Qdrant + SPARQL is optimal for most queries** (~350ms → ~40ms cached)
2. **TypeDB adds ~200ms overhead** due to connection management
3. **PostGIS is fastest** but only useful for geographic queries
4. **First query penalty** exists for embedding model loading (~2s)

## Recommended Configurations

### Default: Semantic + Knowledge Graph
```json
{
  "sources": ["qdrant", "sparql"],
  "k": 10,
  "generate_answer": true
}
```
Best for: General heritage institution queries, name searches, collection queries.

### Relationship Queries
```json
{
  "sources": ["qdrant", "sparql", "typedb"],
  "k": 10,
  "generate_answer": true
}
```
Best for: Complex relationship queries, organizational hierarchies, custodian networks.

### Geographic Queries
```json
{
  "sources": ["qdrant", "postgis"],
  "k": 10,
  "generate_answer": true
}
```
Best for: Location-based queries, "museums in Amsterdam", proximity searches.

### Maximum Coverage (Slower)
```json
{
  "sources": ["qdrant", "sparql", "typedb", "postgis"],
  "k": 10,
  "generate_answer": true
}
```
Best for: Comprehensive searches where accuracy is more important than speed.

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────┐
│                          User Query                                  │
│                    "museums in Amsterdam"                            │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Source Router                                  │
│              (based on selectedSources array)                        │
└─────────────────────────────────────────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│    Qdrant     │       │   Oxigraph    │       │    TypeDB     │
│  (Semantic)   │       │   (SPARQL)    │       │    (Graph)    │
│   ~40ms       │       │   ~100ms      │       │   ~200ms      │
└───────────────┘       └───────────────┘       └───────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Result Merger & Ranker                            │
│         (deduplication, relevance scoring, top-k selection)          │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Answer Generator (LLM)                           │
│              (Claude via DSPy, only if requested)                    │
└─────────────────────────────────────────────────────────────────────┘
```

## Optimization History

### December 2025: SPARQL LLM Bypass
**Problem**: Queries with `sources=["qdrant", "sparql"]` were taking 7-8 seconds.

**Root Cause**: The `retrieve_from_sparql()` function was redundantly calling DSPy/Claude LLM (~7s) for SPARQL query generation, even though graph expansion was already handled by batch SPARQL in `retrieve_from_qdrant()`.

**Solution**: Modified `/opt/glam-backend/rag/main.py` (line ~736) to skip LLM-based SPARQL retrieval:

```python
elif source == DataSource.SPARQL:
    # OPTIMIZATION: Skip LLM-based SPARQL generation for RAG queries.
    # Graph expansion is already handled in retrieve_from_qdrant() via batch SPARQL.
    logger.debug("SPARQL source selected - graph expansion handled by Qdrant retrieval")
    continue
```

**Result**: Query time reduced from 7-8s to 350-380ms (20x improvement).

## Caching Behavior

### Server-Side Caching
- **Semantic Cache**: Embeds questions and matches similar queries
- **TTL**: 15 minutes default
- **Scope**: Per-query result caching

### Client-Side Caching (IndexedDB)
- **Database**: `GLAM_SemanticCache`
- **Purpose**: Reduces redundant API calls
- **Clear**: Use "Clear cache" button in UI

### Cache Invalidation
```bash
# Server-side cache clear
curl -X POST "http://localhost:8010/api/cache/invalidate" \
  -H "Content-Type: application/json" \
  -d '{"clear_all": true}'

# Or via service restart
systemctl restart glam-rag-api
```

## Monitoring

### Service Status
```bash
ssh root@91.98.224.44
systemctl status glam-rag-api
journalctl -u glam-rag-api -n 50 --no-pager
```

### Performance Testing
```bash
# Test specific source combination
curl -s -X POST "http://localhost:8010/api/rag/query" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "museums in Amsterdam",
    "sources": ["qdrant", "sparql"],
    "k": 5,
    "generate_answer": false
  }' | jq '{time_ms: .query_time_ms, count: (.results | length)}'
```

### Expected Results
```json
{
  "time_ms": 45.2,
  "count": 5
}
```

## Troubleshooting

### Slow SPARQL Queries (>1s)
1. Check if LLM bypass is active in `rag/main.py`
2. Verify Oxigraph is running: `systemctl status oxigraph`
3. Check for complex graph patterns in query

### TypeDB Connection Timeout
1. TypeDB has 200ms connection overhead - this is expected
2. First query after idle may take longer (~1.7s)
3. Consider excluding TypeDB for latency-critical queries

### High First-Query Latency
1. Embedding model loads on first query (~2s)
2. This is expected behavior
3. Subsequent queries will use cached model

## API Reference

### Query Endpoint
```
POST /api/rag/query
```

### Request Body
```json
{
  "question": "string",
  "sources": ["qdrant", "sparql", "typedb", "postgis"],
  "k": 10,
  "generate_answer": true,
  "language": "en"
}
```

### Response
```json
{
  "question": "museums in Amsterdam",
  "answer": "There are several museums...",
  "results": [...],
  "sources_used": ["qdrant", "sparql"],
  "query_time_ms": 45.2
}
```

## Related Documentation

- [Architecture](./01-architecture.md) - System components
- [Retrieval Patterns](./06-retrieval-patterns.md) - Hybrid search strategies
- [SPARQL Templates](./07-sparql-templates.md) - Query patterns
- [Evaluation](./08-evaluation.md) - Metrics and benchmarks