glam/.opencode/SPARQL_PREDICATE_ARCHITECTURE.md
2026-01-08 15:56:28 +01:00

4.7 KiB

SPARQL Predicate Architecture

Overview

The GLAM RAG system uses two different predicate URI styles that coexist:

  1. LinkML Schema - Uses semantic URIs from base ontologies
  2. RAG SPARQL Queries - Uses custom hc: prefixed predicates

This document explains why this dual system exists and how it's handled.


The Two Predicate Systems

1. LinkML Schema Predicates (Semantic URIs)

The LinkML schema in schemas/20251121/linkml/ uses slot_uri properties that map to established ontology vocabularies:

Slot slot_uri Ontology
custodian_type org:classification W3C Organization Ontology
settlement schema:location Schema.org
country schema:addressCountry Schema.org
name skos:prefLabel SKOS

Rationale: Semantic interoperability with linked data ecosystems (Europeana, Wikidata, etc.)

2. RAG SPARQL Predicates (Custom hc: prefix)

The RAG system generates SPARQL queries using custom hc: prefixed predicates:

Predicate Purpose
hc:institutionType Filter by heritage type (M, L, A, G, etc.)
hc:settlementName Filter by city name
hc:subregionCode Filter by province/state (NL-NH, NL-GE)
hc:countryCode Filter by country (ISO 3166-1 alpha-2)
hc:ghcid Global Heritage Custodian Identifier

Rationale: Simplified, consistent predicates for RAG query generation


Why Two Systems?

Historical Context

  1. LinkML Schema was designed for semantic web interoperability and RDF serialization
  2. RAG Queries evolved independently for efficient knowledge graph querying
  3. The Oxigraph knowledge graph stores data using the hc: namespace

Technical Trade-offs

Aspect Semantic URIs Custom hc: URIs
Interoperability Standards-compliant Project-specific
Query Simplicity Long URIs Short, memorable
LLM Generation Harder to generate Easier patterns
Validation LinkML tooling ⚠️ Custom validation

How SPARQLValidator Handles This

The SPARQLValidator class in backend/rag/template_sparql.py includes BOTH predicate systems:

def __init__(self):
    # 1. Core RAG predicates (always included)
    hc_predicates = set(self._FALLBACK_HC_PREDICATES)
    
    # 2. Schema predicates from OntologyLoader (semantic URIs)
    schema_predicates = ontology.get_predicates()
    if schema_predicates:
        hc_predicates = hc_predicates | schema_predicates
    
    # 3. External predicates (base ontology URIs)
    self._all_predicates = hc_predicates | self.VALID_EXTERNAL_PREDICATES

Predicate Categories

Category Count Source
Core RAG predicates 12 _FALLBACK_HC_PREDICATES
Schema predicates 286 OntologyLoader (LinkML)
External predicates ~40 VALID_EXTERNAL_PREDICATES

Future Considerations

  1. Update Oxigraph data to use semantic URIs
  2. Update RAG query templates to use org:classification etc.
  3. Deprecate custom hc: predicates

Pros: Single source of truth, better interoperability Cons: Migration effort, breaking changes

Option B: Maintain Dual System

  1. Keep custom hc: predicates for RAG queries
  2. Add URI mapping layer in Oxigraph (CONSTRUCT queries)
  3. Document both systems

Pros: No breaking changes Cons: Ongoing maintenance, potential confusion

Option C: Namespace Aliasing

Configure Oxigraph to treat hc:institutionType as equivalent to org:classification:

# SPARQL 1.1 Property Paths with owl:equivalentProperty
hc:institutionType owl:equivalentProperty org:classification .

Pros: Transparent to RAG system Cons: Reasoning overhead, complexity


Current State (January 2025)

  • SPARQLValidator: Accepts both predicate systems
  • SynonymResolver: Uses OntologyLoader for type codes
  • SchemaAwareSlotValidator: Uses validation rules JSON
  • Oxigraph: Uses hc: namespace for data storage

File Purpose
backend/rag/template_sparql.py SPARQLValidator, OntologyLoader
data/validation/sparql_validation_rules.json Enum definitions, mappings
schemas/20251121/linkml/modules/slots/*.yaml LinkML slot definitions
.opencode/rules/slot-centralization-and-semantic-uri-rule.md Rule 38

References