217 lines
6.5 KiB
Markdown
217 lines
6.5 KiB
Markdown
# Custodian YAML Files Are the Single Source of Truth
|
|
|
|
**Rule**: AGENTS.md Rule 22
|
|
**Status**: ACTIVE
|
|
**Created**: 2025-12-12
|
|
|
|
## Summary
|
|
|
|
The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.
|
|
|
|
## The Data Hierarchy
|
|
|
|
```
|
|
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH (edit this!)
|
|
|
|
|
v
|
|
+------+------+------+------+------+
|
|
| | | | | |
|
|
v v v v v v
|
|
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
|
|
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
|
|
|
|
v
|
|
REST API responses <- DERIVED (serve from databases)
|
|
|
|
|
v
|
|
Frontend display <- DERIVED (render from API)
|
|
```
|
|
|
|
**ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.**
|
|
|
|
### The Five Database Backends
|
|
|
|
| Database | Purpose | Data Flow |
|
|
|----------|---------|-----------|
|
|
| **Ducklake** | Analytics, aggregations | Import from YAML → Query |
|
|
| **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API |
|
|
| **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal |
|
|
| **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
|
|
| **Qdrant** | Vector search, semantic | Import from YAML → Similarity search |
|
|
|
|
### Critical Rule: No Database-Level Enrichment
|
|
|
|
**🚨 DATABASES MUST NEVER:**
|
|
- Add new fields not present in YAML
|
|
- Modify existing data without YAML update
|
|
- Store enrichment results directly
|
|
- Create "derived" or "computed" fields that aren't in YAML
|
|
|
|
**✅ DATABASES SHOULD ONLY:**
|
|
- Import/sync data FROM custodian YAML files
|
|
- Serve data that exists in YAML
|
|
- Provide specialized query capabilities (spatial, graph, vector, SPARQL)
|
|
- Create indexes for performance (not new data)
|
|
|
|
## What This Means for AI Agents
|
|
|
|
### When Enriching Data
|
|
|
|
1. **ALWAYS** write enrichment data to the custodian YAML file first
|
|
2. **NEVER** write data directly to the database without updating YAML
|
|
3. **VALIDATE** data quality before writing (see social media validation)
|
|
4. **VERIFY** that API responses match YAML file contents
|
|
|
|
### File Location
|
|
|
|
Custodian files are located at:
|
|
```
|
|
data/custodian/{GHCID}.yaml
|
|
```
|
|
|
|
Example: `data/custodian/NL-NH-MID-M-AMW.yaml`
|
|
|
|
## Data Categories
|
|
|
|
All of these MUST be stored in custodian YAML files:
|
|
|
|
| Category | YAML Section | Example |
|
|
|----------|--------------|---------|
|
|
| Basic metadata | Root level | `name`, `ghcid`, `institution_type` |
|
|
| Location | `location:` or `locations:` | `city`, `country`, `coordinates` |
|
|
| Identifiers | `identifiers:` | `isil_code`, `wikidata_id` |
|
|
| Social media | `social_media:` | `facebook`, `twitter`, `instagram` |
|
|
| Opening hours | `opening_hours:` | Day-by-day schedule |
|
|
| Contact info | Root level | `phone`, `email`, `website` |
|
|
| Google Maps data | `google_maps_enrichment:` | `place_id`, `rating`, `reviews` |
|
|
| Wikidata data | `wikidata_enrichment:` | Claims, sitelinks |
|
|
| Web scrape data | `web_enrichment:` | Scraped metadata |
|
|
|
|
## Correct Enrichment Workflow
|
|
|
|
```python
|
|
import yaml
|
|
|
|
def enrich_custodian(ghcid: str, enrichment_data: dict):
|
|
"""Correct workflow: Always update YAML first."""
|
|
|
|
# Step 1: Read existing YAML
|
|
yaml_path = f"data/custodian/{ghcid}.yaml"
|
|
with open(yaml_path, 'r') as f:
|
|
custodian = yaml.safe_load(f) or {}
|
|
|
|
# Step 2: Validate enrichment data
|
|
validated_data = validate_enrichment(enrichment_data)
|
|
|
|
# Step 3: Merge into custodian record
|
|
custodian.update(validated_data)
|
|
|
|
# Step 4: Write back to YAML (MANDATORY!)
|
|
with open(yaml_path, 'w') as f:
|
|
yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True)
|
|
|
|
# Step 5: Optionally import to database for API serving
|
|
# import_to_database(ghcid, custodian)
|
|
|
|
return custodian
|
|
```
|
|
|
|
## Anti-Pattern: Ghost Data
|
|
|
|
**Ghost data** is data that appears in API responses but doesn't exist in the custodian YAML file.
|
|
|
|
### How Ghost Data Happens
|
|
|
|
```python
|
|
# WRONG - Writing directly to database
|
|
cursor.execute(
|
|
"UPDATE institutions SET facebook = %s WHERE ghcid = %s",
|
|
("https://www.facebook.com/facebook", ghcid) # Garbage data!
|
|
)
|
|
# Now API returns data that doesn't exist in YAML!
|
|
```
|
|
|
|
### Detecting Ghost Data
|
|
|
|
```bash
|
|
# Step 1: Query API
|
|
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
|
|
# Returns: {"facebook": "https://www.facebook.com/facebook"}
|
|
|
|
# Step 2: Check YAML file
|
|
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
|
|
# (no output - section doesn't exist!)
|
|
|
|
# Conclusion: Ghost data detected! API has data that YAML doesn't.
|
|
```
|
|
|
|
### Resolving Ghost Data
|
|
|
|
1. **If data is valid**: Add it to the YAML file
|
|
2. **If data is invalid**: Remove it from the database
|
|
3. **NEVER**: Leave ghost data in database without YAML source
|
|
|
|
## Validation Requirements
|
|
|
|
Before writing any data to custodian YAML:
|
|
|
|
1. **Check for empty/null values** - Don't write empty strings
|
|
2. **Validate URLs** - Ensure they're well-formed
|
|
3. **Validate social media** - See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md`
|
|
4. **Check for duplicates** - Don't add duplicate entries
|
|
5. **Preserve existing data** - Enrichment is additive (Rule 5)
|
|
|
|
## Example YAML Structure
|
|
|
|
```yaml
|
|
# data/custodian/NL-NH-AMS-M-RM.yaml
|
|
name: Rijksmuseum
|
|
ghcid:
|
|
ghcid_current: NL-NH-AMS-M-RM
|
|
institution_type: MUSEUM
|
|
|
|
location:
|
|
city: Amsterdam
|
|
country: NL
|
|
coordinates:
|
|
latitude: 52.3600
|
|
longitude: 4.8852
|
|
|
|
social_media:
|
|
facebook: https://www.facebook.com/rijksmuseum/
|
|
twitter: https://twitter.com/rijksmuseum
|
|
instagram: https://www.instagram.com/rijksmuseum/
|
|
youtube: https://www.youtube.com/@Rijksmuseum
|
|
|
|
opening_hours:
|
|
monday: "09:00-17:00"
|
|
tuesday: "09:00-17:00"
|
|
wednesday: "09:00-17:00"
|
|
thursday: "09:00-17:00"
|
|
friday: "09:00-17:00"
|
|
saturday: "09:00-17:00"
|
|
sunday: "09:00-17:00"
|
|
|
|
google_maps_enrichment:
|
|
place_id: ChIJeVCRJE8JxkcR...
|
|
rating: 4.7
|
|
total_ratings: 52847
|
|
enrichment_date: "2025-12-10T00:00:00Z"
|
|
|
|
provenance:
|
|
data_source: CSV_REGISTRY
|
|
data_tier: TIER_1_AUTHORITATIVE
|
|
extraction_date: "2025-11-01T00:00:00Z"
|
|
```
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 5**: Data enrichment is ADDITIVE ONLY
|
|
- **Rule 21**: Data Fabrication is Strictly Prohibited
|
|
- **Rule 23**: Social Media Link Validation
|
|
|
|
## See Also
|
|
|
|
- `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` - Social media validation rules
|
|
- `.opencode/DATA_PRESERVATION_RULES.md` - Data preservation guidelines
|
|
- `AGENTS.md` - Complete agent instructions
|