kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

6.5 KiB

Raw Blame History

Custodian YAML Files Are the Single Source of Truth

Rule: AGENTS.md Rule 22
Status: ACTIVE
Created: 2025-12-12

Summary

The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.

The Data Hierarchy

data/custodian/*.yaml          <- SINGLE SOURCE OF TRUTH (edit this!)
       |
       v
+------+------+------+------+------+
|      |      |      |      |      |
v      v      v      v      v      v
Ducklake  PostgreSQL  TypeDB  Oxigraph  Qdrant
(analytics) (geo API)  (graph) (RDF/SPARQL) (vector)
       |
       v
REST API responses             <- DERIVED (serve from databases)
       |
       v
Frontend display               <- DERIVED (render from API)

ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.

The Five Database Backends

Database	Purpose	Data Flow
Ducklake	Analytics, aggregations	Import from YAML → Query
PostgreSQL	Geographic API, PostGIS	Import from YAML → Serve API
TypeDB	Graph queries, relationships	Import from YAML → Graph traversal
Oxigraph	RDF/SPARQL, Linked Data	Import from YAML → SPARQL endpoint
Qdrant	Vector search, semantic	Import from YAML → Similarity search

Critical Rule: No Database-Level Enrichment

🚨 DATABASES MUST NEVER:

Add new fields not present in YAML
Modify existing data without YAML update
Store enrichment results directly
Create "derived" or "computed" fields that aren't in YAML

✅ DATABASES SHOULD ONLY:

Import/sync data FROM custodian YAML files
Serve data that exists in YAML
Provide specialized query capabilities (spatial, graph, vector, SPARQL)
Create indexes for performance (not new data)

What This Means for AI Agents

When Enriching Data

ALWAYS write enrichment data to the custodian YAML file first
NEVER write data directly to the database without updating YAML
VALIDATE data quality before writing (see social media validation)
VERIFY that API responses match YAML file contents

File Location

Custodian files are located at:

data/custodian/{GHCID}.yaml

Example: data/custodian/NL-NH-MID-M-AMW.yaml

Data Categories

All of these MUST be stored in custodian YAML files:

Category	YAML Section	Example
Basic metadata	Root level	`name`, `ghcid`, `institution_type`
Location	`location:` or `locations:`	`city`, `country`, `coordinates`
Identifiers	`identifiers:`	`isil_code`, `wikidata_id`
Social media	`social_media:`	`facebook`, `twitter`, `instagram`
Opening hours	`opening_hours:`	Day-by-day schedule
Contact info	Root level	`phone`, `email`, `website`
Google Maps data	`google_maps_enrichment:`	`place_id`, `rating`, `reviews`
Wikidata data	`wikidata_enrichment:`	Claims, sitelinks
Web scrape data	`web_enrichment:`	Scraped metadata

Correct Enrichment Workflow

import yaml

def enrich_custodian(ghcid: str, enrichment_data: dict):
    """Correct workflow: Always update YAML first."""
    
    # Step 1: Read existing YAML
    yaml_path = f"data/custodian/{ghcid}.yaml"
    with open(yaml_path, 'r') as f:
        custodian = yaml.safe_load(f) or {}
    
    # Step 2: Validate enrichment data
    validated_data = validate_enrichment(enrichment_data)
    
    # Step 3: Merge into custodian record
    custodian.update(validated_data)
    
    # Step 4: Write back to YAML (MANDATORY!)
    with open(yaml_path, 'w') as f:
        yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True)
    
    # Step 5: Optionally import to database for API serving
    # import_to_database(ghcid, custodian)
    
    return custodian

Anti-Pattern: Ghost Data

Ghost data is data that appears in API responses but doesn't exist in the custodian YAML file.

How Ghost Data Happens

# WRONG - Writing directly to database
cursor.execute(
    "UPDATE institutions SET facebook = %s WHERE ghcid = %s",
    ("https://www.facebook.com/facebook", ghcid)  # Garbage data!
)
# Now API returns data that doesn't exist in YAML!

Detecting Ghost Data

# Step 1: Query API
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# Returns: {"facebook": "https://www.facebook.com/facebook"}

# Step 2: Check YAML file
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
# (no output - section doesn't exist!)

# Conclusion: Ghost data detected! API has data that YAML doesn't.

Resolving Ghost Data

If data is valid: Add it to the YAML file
If data is invalid: Remove it from the database
NEVER: Leave ghost data in database without YAML source

Validation Requirements

Before writing any data to custodian YAML:

Check for empty/null values - Don't write empty strings
Validate URLs - Ensure they're well-formed
Validate social media - See .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md
Check for duplicates - Don't add duplicate entries
Preserve existing data - Enrichment is additive (Rule 5)

Example YAML Structure

# data/custodian/NL-NH-AMS-M-RM.yaml
name: Rijksmuseum
ghcid:
  ghcid_current: NL-NH-AMS-M-RM
institution_type: MUSEUM

location:
  city: Amsterdam
  country: NL
  coordinates:
    latitude: 52.3600
    longitude: 4.8852

social_media:
  facebook: https://www.facebook.com/rijksmuseum/
  twitter: https://twitter.com/rijksmuseum
  instagram: https://www.instagram.com/rijksmuseum/
  youtube: https://www.youtube.com/@Rijksmuseum

opening_hours:
  monday: "09:00-17:00"
  tuesday: "09:00-17:00"
  wednesday: "09:00-17:00"
  thursday: "09:00-17:00"
  friday: "09:00-17:00"
  saturday: "09:00-17:00"
  sunday: "09:00-17:00"

google_maps_enrichment:
  place_id: ChIJeVCRJE8JxkcR...
  rating: 4.7
  total_ratings: 52847
  enrichment_date: "2025-12-10T00:00:00Z"

provenance:
  data_source: CSV_REGISTRY
  data_tier: TIER_1_AUTHORITATIVE
  extraction_date: "2025-11-01T00:00:00Z"

Rule 5: Data enrichment is ADDITIVE ONLY
Rule 21: Data Fabrication is Strictly Prohibited
Rule 23: Social Media Link Validation

6.5 KiB Raw Blame History