glam/.opencode/SPARQL_PREDICATE_ARCHITECTURE.md
2026-01-08 15:56:28 +01:00

151 lines
4.7 KiB
Markdown

# SPARQL Predicate Architecture
## Overview
The GLAM RAG system uses two different predicate URI styles that coexist:
1. **LinkML Schema** - Uses semantic URIs from base ontologies
2. **RAG SPARQL Queries** - Uses custom `hc:` prefixed predicates
This document explains why this dual system exists and how it's handled.
---
## The Two Predicate Systems
### 1. LinkML Schema Predicates (Semantic URIs)
The LinkML schema in `schemas/20251121/linkml/` uses `slot_uri` properties that map to established ontology vocabularies:
| Slot | slot_uri | Ontology |
|------|----------|----------|
| `custodian_type` | `org:classification` | W3C Organization Ontology |
| `settlement` | `schema:location` | Schema.org |
| `country` | `schema:addressCountry` | Schema.org |
| `name` | `skos:prefLabel` | SKOS |
**Rationale**: Semantic interoperability with linked data ecosystems (Europeana, Wikidata, etc.)
### 2. RAG SPARQL Predicates (Custom hc: prefix)
The RAG system generates SPARQL queries using custom `hc:` prefixed predicates:
| Predicate | Purpose |
|-----------|---------|
| `hc:institutionType` | Filter by heritage type (M, L, A, G, etc.) |
| `hc:settlementName` | Filter by city name |
| `hc:subregionCode` | Filter by province/state (NL-NH, NL-GE) |
| `hc:countryCode` | Filter by country (ISO 3166-1 alpha-2) |
| `hc:ghcid` | Global Heritage Custodian Identifier |
**Rationale**: Simplified, consistent predicates for RAG query generation
---
## Why Two Systems?
### Historical Context
1. **LinkML Schema** was designed for semantic web interoperability and RDF serialization
2. **RAG Queries** evolved independently for efficient knowledge graph querying
3. The Oxigraph knowledge graph stores data using the `hc:` namespace
### Technical Trade-offs
| Aspect | Semantic URIs | Custom hc: URIs |
|--------|---------------|-----------------|
| **Interoperability** | ✅ Standards-compliant | ❌ Project-specific |
| **Query Simplicity** | ❌ Long URIs | ✅ Short, memorable |
| **LLM Generation** | ❌ Harder to generate | ✅ Easier patterns |
| **Validation** | ✅ LinkML tooling | ⚠️ Custom validation |
---
## How SPARQLValidator Handles This
The `SPARQLValidator` class in `backend/rag/template_sparql.py` includes BOTH predicate systems:
```python
def __init__(self):
# 1. Core RAG predicates (always included)
hc_predicates = set(self._FALLBACK_HC_PREDICATES)
# 2. Schema predicates from OntologyLoader (semantic URIs)
schema_predicates = ontology.get_predicates()
if schema_predicates:
hc_predicates = hc_predicates | schema_predicates
# 3. External predicates (base ontology URIs)
self._all_predicates = hc_predicates | self.VALID_EXTERNAL_PREDICATES
```
### Predicate Categories
| Category | Count | Source |
|----------|-------|--------|
| Core RAG predicates | 12 | `_FALLBACK_HC_PREDICATES` |
| Schema predicates | 286 | OntologyLoader (LinkML) |
| External predicates | ~40 | `VALID_EXTERNAL_PREDICATES` |
---
## Future Considerations
### Option A: Unify to Semantic URIs (Recommended Long-term)
1. Update Oxigraph data to use semantic URIs
2. Update RAG query templates to use `org:classification` etc.
3. Deprecate custom `hc:` predicates
**Pros**: Single source of truth, better interoperability
**Cons**: Migration effort, breaking changes
### Option B: Maintain Dual System
1. Keep custom `hc:` predicates for RAG queries
2. Add URI mapping layer in Oxigraph (CONSTRUCT queries)
3. Document both systems
**Pros**: No breaking changes
**Cons**: Ongoing maintenance, potential confusion
### Option C: Namespace Aliasing
Configure Oxigraph to treat `hc:institutionType` as equivalent to `org:classification`:
```sparql
# SPARQL 1.1 Property Paths with owl:equivalentProperty
hc:institutionType owl:equivalentProperty org:classification .
```
**Pros**: Transparent to RAG system
**Cons**: Reasoning overhead, complexity
---
## Current State (January 2025)
- **SPARQLValidator**: Accepts both predicate systems ✅
- **SynonymResolver**: Uses OntologyLoader for type codes ✅
- **SchemaAwareSlotValidator**: Uses validation rules JSON ✅
- **Oxigraph**: Uses `hc:` namespace for data storage
---
## Related Files
| File | Purpose |
|------|---------|
| `backend/rag/template_sparql.py` | SPARQLValidator, OntologyLoader |
| `data/validation/sparql_validation_rules.json` | Enum definitions, mappings |
| `schemas/20251121/linkml/modules/slots/*.yaml` | LinkML slot definitions |
| `.opencode/rules/slot-centralization-and-semantic-uri-rule.md` | Rule 38 |
---
## References
- [W3C Organization Ontology](https://www.w3.org/TR/vocab-org/)
- [Schema.org](https://schema.org/)
- [SKOS](https://www.w3.org/TR/skos-reference/)
- [LinkML slot_uri documentation](https://linkml.io/linkml/schemas/uris-and-mappings.html)