glam/.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md
2025-12-14 17:09:55 +01:00

217 lines
6.5 KiB
Markdown

# Custodian YAML Files Are the Single Source of Truth
**Rule**: AGENTS.md Rule 22
**Status**: ACTIVE
**Created**: 2025-12-12
## Summary
The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.
## The Data Hierarchy
```
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH (edit this!)
|
v
+------+------+------+------+------+
| | | | | |
v v v v v v
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
v
REST API responses <- DERIVED (serve from databases)
|
v
Frontend display <- DERIVED (render from API)
```
**ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.**
### The Five Database Backends
| Database | Purpose | Data Flow |
|----------|---------|-----------|
| **Ducklake** | Analytics, aggregations | Import from YAML → Query |
| **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API |
| **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal |
| **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
| **Qdrant** | Vector search, semantic | Import from YAML → Similarity search |
### Critical Rule: No Database-Level Enrichment
**🚨 DATABASES MUST NEVER:**
- Add new fields not present in YAML
- Modify existing data without YAML update
- Store enrichment results directly
- Create "derived" or "computed" fields that aren't in YAML
**✅ DATABASES SHOULD ONLY:**
- Import/sync data FROM custodian YAML files
- Serve data that exists in YAML
- Provide specialized query capabilities (spatial, graph, vector, SPARQL)
- Create indexes for performance (not new data)
## What This Means for AI Agents
### When Enriching Data
1. **ALWAYS** write enrichment data to the custodian YAML file first
2. **NEVER** write data directly to the database without updating YAML
3. **VALIDATE** data quality before writing (see social media validation)
4. **VERIFY** that API responses match YAML file contents
### File Location
Custodian files are located at:
```
data/custodian/{GHCID}.yaml
```
Example: `data/custodian/NL-NH-MID-M-AMW.yaml`
## Data Categories
All of these MUST be stored in custodian YAML files:
| Category | YAML Section | Example |
|----------|--------------|---------|
| Basic metadata | Root level | `name`, `ghcid`, `institution_type` |
| Location | `location:` or `locations:` | `city`, `country`, `coordinates` |
| Identifiers | `identifiers:` | `isil_code`, `wikidata_id` |
| Social media | `social_media:` | `facebook`, `twitter`, `instagram` |
| Opening hours | `opening_hours:` | Day-by-day schedule |
| Contact info | Root level | `phone`, `email`, `website` |
| Google Maps data | `google_maps_enrichment:` | `place_id`, `rating`, `reviews` |
| Wikidata data | `wikidata_enrichment:` | Claims, sitelinks |
| Web scrape data | `web_enrichment:` | Scraped metadata |
## Correct Enrichment Workflow
```python
import yaml
def enrich_custodian(ghcid: str, enrichment_data: dict):
"""Correct workflow: Always update YAML first."""
# Step 1: Read existing YAML
yaml_path = f"data/custodian/{ghcid}.yaml"
with open(yaml_path, 'r') as f:
custodian = yaml.safe_load(f) or {}
# Step 2: Validate enrichment data
validated_data = validate_enrichment(enrichment_data)
# Step 3: Merge into custodian record
custodian.update(validated_data)
# Step 4: Write back to YAML (MANDATORY!)
with open(yaml_path, 'w') as f:
yaml.dump(custodian, f, default_flow_style=False, allow_unicode=True)
# Step 5: Optionally import to database for API serving
# import_to_database(ghcid, custodian)
return custodian
```
## Anti-Pattern: Ghost Data
**Ghost data** is data that appears in API responses but doesn't exist in the custodian YAML file.
### How Ghost Data Happens
```python
# WRONG - Writing directly to database
cursor.execute(
"UPDATE institutions SET facebook = %s WHERE ghcid = %s",
("https://www.facebook.com/facebook", ghcid) # Garbage data!
)
# Now API returns data that doesn't exist in YAML!
```
### Detecting Ghost Data
```bash
# Step 1: Query API
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# Returns: {"facebook": "https://www.facebook.com/facebook"}
# Step 2: Check YAML file
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
# (no output - section doesn't exist!)
# Conclusion: Ghost data detected! API has data that YAML doesn't.
```
### Resolving Ghost Data
1. **If data is valid**: Add it to the YAML file
2. **If data is invalid**: Remove it from the database
3. **NEVER**: Leave ghost data in database without YAML source
## Validation Requirements
Before writing any data to custodian YAML:
1. **Check for empty/null values** - Don't write empty strings
2. **Validate URLs** - Ensure they're well-formed
3. **Validate social media** - See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md`
4. **Check for duplicates** - Don't add duplicate entries
5. **Preserve existing data** - Enrichment is additive (Rule 5)
## Example YAML Structure
```yaml
# data/custodian/NL-NH-AMS-M-RM.yaml
name: Rijksmuseum
ghcid:
ghcid_current: NL-NH-AMS-M-RM
institution_type: MUSEUM
location:
city: Amsterdam
country: NL
coordinates:
latitude: 52.3600
longitude: 4.8852
social_media:
facebook: https://www.facebook.com/rijksmuseum/
twitter: https://twitter.com/rijksmuseum
instagram: https://www.instagram.com/rijksmuseum/
youtube: https://www.youtube.com/@Rijksmuseum
opening_hours:
monday: "09:00-17:00"
tuesday: "09:00-17:00"
wednesday: "09:00-17:00"
thursday: "09:00-17:00"
friday: "09:00-17:00"
saturday: "09:00-17:00"
sunday: "09:00-17:00"
google_maps_enrichment:
place_id: ChIJeVCRJE8JxkcR...
rating: 4.7
total_ratings: 52847
enrichment_date: "2025-12-10T00:00:00Z"
provenance:
data_source: CSV_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_date: "2025-11-01T00:00:00Z"
```
## Related Rules
- **Rule 5**: Data enrichment is ADDITIVE ONLY
- **Rule 21**: Data Fabrication is Strictly Prohibited
- **Rule 23**: Social Media Link Validation
## See Also
- `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` - Social media validation rules
- `.opencode/DATA_PRESERVATION_RULES.md` - Data preservation guidelines
- `AGENTS.md` - Complete agent instructions