# Custodian YAML Files Are the Single Source of Truth **Rule**: AGENTS.md Rule 22 **Status**: ACTIVE **Created**: 2025-12-12 ## Summary The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files. ## The Data Hierarchy ``` data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH (edit this!) | v +------+------+------+------+------+ | | | | | | v v v v v v Ducklake PostgreSQL TypeDB Oxigraph Qdrant (analytics) (geo API) (graph) (RDF/SPARQL) (vector) | v REST API responses <- DERIVED (serve from databases) | v Frontend display <- DERIVED (render from API) ``` **ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.** ### The Five Database Backends | Database | Purpose | Data Flow | |----------|---------|-----------| | **Ducklake** | Analytics, aggregations | Import from YAML → Query | | **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API | | **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal | | **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint | | **Qdrant** | Vector search, semantic | Import from YAML → Similarity search | ### Critical Rule: No Database-Level Enrichment **🚨 DATABASES MUST NEVER:** - Add new fields not present in YAML - Modify existing data without YAML update - Store enrichment results directly - Create "derived" or "computed" fields that aren't in YAML **✅ DATABASES SHOULD ONLY:** - Import/sync data FROM custodian YAML files - Serve data that exists in YAML - Provide specialized query capabilities (spatial, graph, vector, SPARQL) - Create indexes for performance (not new data) ## What This Means for AI Agents ### When Enriching Data 1. **ALWAYS** write enrichment data to the custodian YAML file first 2. **NEVER** write data directly to the database without updating YAML 3. **VALIDATE** data quality before writing (see social media validation) 4. **VERIFY** that API responses match YAML file contents ### File Location Custodian files are located at: ``` data/custodian/{GHCID}.yaml ``` Example: `data/custodian/NL-NH-MID-M-AMW.yaml` ## Data Categories All of these MUST be stored in custodian YAML files: | Category | YAML Section | Example | |----------|--------------|---------| | Basic metadata | Root level | `name`, `ghcid`, `institution_type` | | Location | `location:` or `locations:` | `city`, `country`, `coordinates` | | Identifiers | `identifiers:` | `isil_code`, `wikidata_id` | | Social media | `social_media:` | `facebook`, `twitter`, `instagram` | | Opening hours | `opening_hours:` | Day-by-day schedule | | Contact info | Root level | `phone`, `email`, `website` | | Google Maps data | `google_maps_enrichment:` | `place_id`, `rating`, `reviews` | | Wikidata data | `wikidata_enrichment:` | Claims, sitelinks | | Web scrape data | `web_enrichment:` | Scraped metadata | ## Correct Enrichment Workflow ```python import yaml def enrich_custodian(ghcid: str, enrichment_data: dict): """Correct workflow: Always update YAML first.""" # Step 1: Read existing YAML yaml_path = f"data/custodian/{ghcid}.yaml" with open(yaml_path, 'r') as f: custodian = yaml.safe_load(f) or {} # Step 2: Validate enrichment data validated_data = validate_enrichment(enrichment_data) # Step 3: Merge into custodian record custodian.update(validated_data) # Step 4: Write back to YAML (MANDATORY!) with open(yaml_path, 'w') as f: yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True) # Step 5: Optionally import to database for API serving # import_to_database(ghcid, custodian) return custodian ``` ## Anti-Pattern: Ghost Data **Ghost data** is data that appears in API responses but doesn't exist in the custodian YAML file. ### How Ghost Data Happens ```python # WRONG - Writing directly to database cursor.execute( "UPDATE institutions SET facebook = %s WHERE ghcid = %s", ("https://www.facebook.com/facebook", ghcid) # Garbage data! ) # Now API returns data that doesn't exist in YAML! ``` ### Detecting Ghost Data ```bash # Step 1: Query API curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media' # Returns: {"facebook": "https://www.facebook.com/facebook"} # Step 2: Check YAML file grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml # (no output - section doesn't exist!) # Conclusion: Ghost data detected! API has data that YAML doesn't. ``` ### Resolving Ghost Data 1. **If data is valid**: Add it to the YAML file 2. **If data is invalid**: Remove it from the database 3. **NEVER**: Leave ghost data in database without YAML source ## Validation Requirements Before writing any data to custodian YAML: 1. **Check for empty/null values** - Don't write empty strings 2. **Validate URLs** - Ensure they're well-formed 3. **Validate social media** - See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` 4. **Check for duplicates** - Don't add duplicate entries 5. **Preserve existing data** - Enrichment is additive (Rule 5) ## Example YAML Structure ```yaml # data/custodian/NL-NH-AMS-M-RM.yaml name: Rijksmuseum ghcid: ghcid_current: NL-NH-AMS-M-RM institution_type: MUSEUM location: city: Amsterdam country: NL coordinates: latitude: 52.3600 longitude: 4.8852 social_media: facebook: https://www.facebook.com/rijksmuseum/ twitter: https://twitter.com/rijksmuseum instagram: https://www.instagram.com/rijksmuseum/ youtube: https://www.youtube.com/@Rijksmuseum opening_hours: monday: "09:00-17:00" tuesday: "09:00-17:00" wednesday: "09:00-17:00" thursday: "09:00-17:00" friday: "09:00-17:00" saturday: "09:00-17:00" sunday: "09:00-17:00" google_maps_enrichment: place_id: ChIJeVCRJE8JxkcR... rating: 4.7 total_ratings: 52847 enrichment_date: "2025-12-10T00:00:00Z" provenance: data_source: CSV_REGISTRY data_tier: TIER_1_AUTHORITATIVE extraction_date: "2025-11-01T00:00:00Z" ``` ## Related Rules - **Rule 5**: Data enrichment is ADDITIVE ONLY - **Rule 21**: Data Fabrication is Strictly Prohibited - **Rule 23**: Social Media Link Validation ## See Also - `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` - Social media validation rules - `.opencode/DATA_PRESERVATION_RULES.md` - Data preservation guidelines - `AGENTS.md` - Complete agent instructions