6.5 KiB
6.5 KiB
Custodian YAML Files Are the Single Source of Truth
Rule: AGENTS.md Rule 22
Status: ACTIVE
Created: 2025-12-12
Summary
The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.
The Data Hierarchy
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH (edit this!)
|
v
+------+------+------+------+------+
| | | | | |
v v v v v v
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
v
REST API responses <- DERIVED (serve from databases)
|
v
Frontend display <- DERIVED (render from API)
ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.
The Five Database Backends
| Database | Purpose | Data Flow |
|---|---|---|
| Ducklake | Analytics, aggregations | Import from YAML → Query |
| PostgreSQL | Geographic API, PostGIS | Import from YAML → Serve API |
| TypeDB | Graph queries, relationships | Import from YAML → Graph traversal |
| Oxigraph | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
| Qdrant | Vector search, semantic | Import from YAML → Similarity search |
Critical Rule: No Database-Level Enrichment
🚨 DATABASES MUST NEVER:
- Add new fields not present in YAML
- Modify existing data without YAML update
- Store enrichment results directly
- Create "derived" or "computed" fields that aren't in YAML
✅ DATABASES SHOULD ONLY:
- Import/sync data FROM custodian YAML files
- Serve data that exists in YAML
- Provide specialized query capabilities (spatial, graph, vector, SPARQL)
- Create indexes for performance (not new data)
What This Means for AI Agents
When Enriching Data
- ALWAYS write enrichment data to the custodian YAML file first
- NEVER write data directly to the database without updating YAML
- VALIDATE data quality before writing (see social media validation)
- VERIFY that API responses match YAML file contents
File Location
Custodian files are located at:
data/custodian/{GHCID}.yaml
Example: data/custodian/NL-NH-MID-M-AMW.yaml
Data Categories
All of these MUST be stored in custodian YAML files:
| Category | YAML Section | Example |
|---|---|---|
| Basic metadata | Root level | name, ghcid, institution_type |
| Location | location: or locations: |
city, country, coordinates |
| Identifiers | identifiers: |
isil_code, wikidata_id |
| Social media | social_media: |
facebook, twitter, instagram |
| Opening hours | opening_hours: |
Day-by-day schedule |
| Contact info | Root level | phone, email, website |
| Google Maps data | google_maps_enrichment: |
place_id, rating, reviews |
| Wikidata data | wikidata_enrichment: |
Claims, sitelinks |
| Web scrape data | web_enrichment: |
Scraped metadata |
Correct Enrichment Workflow
import yaml
def enrich_custodian(ghcid: str, enrichment_data: dict):
"""Correct workflow: Always update YAML first."""
# Step 1: Read existing YAML
yaml_path = f"data/custodian/{ghcid}.yaml"
with open(yaml_path, 'r') as f:
custodian = yaml.safe_load(f) or {}
# Step 2: Validate enrichment data
validated_data = validate_enrichment(enrichment_data)
# Step 3: Merge into custodian record
custodian.update(validated_data)
# Step 4: Write back to YAML (MANDATORY!)
with open(yaml_path, 'w') as f:
yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True)
# Step 5: Optionally import to database for API serving
# import_to_database(ghcid, custodian)
return custodian
Anti-Pattern: Ghost Data
Ghost data is data that appears in API responses but doesn't exist in the custodian YAML file.
How Ghost Data Happens
# WRONG - Writing directly to database
cursor.execute(
"UPDATE institutions SET facebook = %s WHERE ghcid = %s",
("https://www.facebook.com/facebook", ghcid) # Garbage data!
)
# Now API returns data that doesn't exist in YAML!
Detecting Ghost Data
# Step 1: Query API
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# Returns: {"facebook": "https://www.facebook.com/facebook"}
# Step 2: Check YAML file
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
# (no output - section doesn't exist!)
# Conclusion: Ghost data detected! API has data that YAML doesn't.
Resolving Ghost Data
- If data is valid: Add it to the YAML file
- If data is invalid: Remove it from the database
- NEVER: Leave ghost data in database without YAML source
Validation Requirements
Before writing any data to custodian YAML:
- Check for empty/null values - Don't write empty strings
- Validate URLs - Ensure they're well-formed
- Validate social media - See
.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md - Check for duplicates - Don't add duplicate entries
- Preserve existing data - Enrichment is additive (Rule 5)
Example YAML Structure
# data/custodian/NL-NH-AMS-M-RM.yaml
name: Rijksmuseum
ghcid:
ghcid_current: NL-NH-AMS-M-RM
institution_type: MUSEUM
location:
city: Amsterdam
country: NL
coordinates:
latitude: 52.3600
longitude: 4.8852
social_media:
facebook: https://www.facebook.com/rijksmuseum/
twitter: https://twitter.com/rijksmuseum
instagram: https://www.instagram.com/rijksmuseum/
youtube: https://www.youtube.com/@Rijksmuseum
opening_hours:
monday: "09:00-17:00"
tuesday: "09:00-17:00"
wednesday: "09:00-17:00"
thursday: "09:00-17:00"
friday: "09:00-17:00"
saturday: "09:00-17:00"
sunday: "09:00-17:00"
google_maps_enrichment:
place_id: ChIJeVCRJE8JxkcR...
rating: 4.7
total_ratings: 52847
enrichment_date: "2025-12-10T00:00:00Z"
provenance:
data_source: CSV_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_date: "2025-11-01T00:00:00Z"
Related Rules
- Rule 5: Data enrichment is ADDITIVE ONLY
- Rule 21: Data Fabrication is Strictly Prohibited
- Rule 23: Social Media Link Validation
See Also
.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md- Social media validation rules.opencode/DATA_PRESERVATION_RULES.md- Data preservation guidelinesAGENTS.md- Complete agent instructions