glam/.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md
2025-12-14 17:09:55 +01:00

6.5 KiB

Custodian YAML Files Are the Single Source of Truth

Rule: AGENTS.md Rule 22
Status: ACTIVE
Created: 2025-12-12

Summary

The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.

The Data Hierarchy

data/custodian/*.yaml          <- SINGLE SOURCE OF TRUTH (edit this!)
       |
       v
+------+------+------+------+------+
|      |      |      |      |      |
v      v      v      v      v      v
Ducklake  PostgreSQL  TypeDB  Oxigraph  Qdrant
(analytics) (geo API)  (graph) (RDF/SPARQL) (vector)
       |
       v
REST API responses             <- DERIVED (serve from databases)
       |
       v
Frontend display               <- DERIVED (render from API)

ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.

The Five Database Backends

Database Purpose Data Flow
Ducklake Analytics, aggregations Import from YAML → Query
PostgreSQL Geographic API, PostGIS Import from YAML → Serve API
TypeDB Graph queries, relationships Import from YAML → Graph traversal
Oxigraph RDF/SPARQL, Linked Data Import from YAML → SPARQL endpoint
Qdrant Vector search, semantic Import from YAML → Similarity search

Critical Rule: No Database-Level Enrichment

🚨 DATABASES MUST NEVER:

  • Add new fields not present in YAML
  • Modify existing data without YAML update
  • Store enrichment results directly
  • Create "derived" or "computed" fields that aren't in YAML

DATABASES SHOULD ONLY:

  • Import/sync data FROM custodian YAML files
  • Serve data that exists in YAML
  • Provide specialized query capabilities (spatial, graph, vector, SPARQL)
  • Create indexes for performance (not new data)

What This Means for AI Agents

When Enriching Data

  1. ALWAYS write enrichment data to the custodian YAML file first
  2. NEVER write data directly to the database without updating YAML
  3. VALIDATE data quality before writing (see social media validation)
  4. VERIFY that API responses match YAML file contents

File Location

Custodian files are located at:

data/custodian/{GHCID}.yaml

Example: data/custodian/NL-NH-MID-M-AMW.yaml

Data Categories

All of these MUST be stored in custodian YAML files:

Category YAML Section Example
Basic metadata Root level name, ghcid, institution_type
Location location: or locations: city, country, coordinates
Identifiers identifiers: isil_code, wikidata_id
Social media social_media: facebook, twitter, instagram
Opening hours opening_hours: Day-by-day schedule
Contact info Root level phone, email, website
Google Maps data google_maps_enrichment: place_id, rating, reviews
Wikidata data wikidata_enrichment: Claims, sitelinks
Web scrape data web_enrichment: Scraped metadata

Correct Enrichment Workflow

import yaml

def enrich_custodian(ghcid: str, enrichment_data: dict):
    """Correct workflow: Always update YAML first."""
    
    # Step 1: Read existing YAML
    yaml_path = f"data/custodian/{ghcid}.yaml"
    with open(yaml_path, 'r') as f:
        custodian = yaml.safe_load(f) or {}
    
    # Step 2: Validate enrichment data
    validated_data = validate_enrichment(enrichment_data)
    
    # Step 3: Merge into custodian record
    custodian.update(validated_data)
    
    # Step 4: Write back to YAML (MANDATORY!)
    with open(yaml_path, 'w') as f:
        yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True)
    
    # Step 5: Optionally import to database for API serving
    # import_to_database(ghcid, custodian)
    
    return custodian

Anti-Pattern: Ghost Data

Ghost data is data that appears in API responses but doesn't exist in the custodian YAML file.

How Ghost Data Happens

# WRONG - Writing directly to database
cursor.execute(
    "UPDATE institutions SET facebook = %s WHERE ghcid = %s",
    ("https://www.facebook.com/facebook", ghcid)  # Garbage data!
)
# Now API returns data that doesn't exist in YAML!

Detecting Ghost Data

# Step 1: Query API
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# Returns: {"facebook": "https://www.facebook.com/facebook"}

# Step 2: Check YAML file
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
# (no output - section doesn't exist!)

# Conclusion: Ghost data detected! API has data that YAML doesn't.

Resolving Ghost Data

  1. If data is valid: Add it to the YAML file
  2. If data is invalid: Remove it from the database
  3. NEVER: Leave ghost data in database without YAML source

Validation Requirements

Before writing any data to custodian YAML:

  1. Check for empty/null values - Don't write empty strings
  2. Validate URLs - Ensure they're well-formed
  3. Validate social media - See .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md
  4. Check for duplicates - Don't add duplicate entries
  5. Preserve existing data - Enrichment is additive (Rule 5)

Example YAML Structure

# data/custodian/NL-NH-AMS-M-RM.yaml
name: Rijksmuseum
ghcid:
  ghcid_current: NL-NH-AMS-M-RM
institution_type: MUSEUM

location:
  city: Amsterdam
  country: NL
  coordinates:
    latitude: 52.3600
    longitude: 4.8852

social_media:
  facebook: https://www.facebook.com/rijksmuseum/
  twitter: https://twitter.com/rijksmuseum
  instagram: https://www.instagram.com/rijksmuseum/
  youtube: https://www.youtube.com/@Rijksmuseum

opening_hours:
  monday: "09:00-17:00"
  tuesday: "09:00-17:00"
  wednesday: "09:00-17:00"
  thursday: "09:00-17:00"
  friday: "09:00-17:00"
  saturday: "09:00-17:00"
  sunday: "09:00-17:00"

google_maps_enrichment:
  place_id: ChIJeVCRJE8JxkcR...
  rating: 4.7
  total_ratings: 52847
  enrichment_date: "2025-12-10T00:00:00Z"

provenance:
  data_source: CSV_REGISTRY
  data_tier: TIER_1_AUTHORITATIVE
  extraction_date: "2025-11-01T00:00:00Z"
  • Rule 5: Data enrichment is ADDITIVE ONLY
  • Rule 21: Data Fabrication is Strictly Prohibited
  • Rule 23: Social Media Link Validation

See Also

  • .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md - Social media validation rules
  • .opencode/DATA_PRESERVATION_RULES.md - Data preservation guidelines
  • AGENTS.md - Complete agent instructions