# Data Governance Guide This document outlines the data governance principles and rules for the GLAM Heritage Custodian project. ## Single Source of Truth ### Custodian Data The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data. ``` data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH | v +------+------+------+------+------+ | | | | | | v v v v v v Ducklake PostgreSQL TypeDB Oxigraph Qdrant (analytics) (geo API) (graph) (RDF/SPARQL) (vector) | v REST API responses <- DERIVED (serve from databases) | v Frontend display <- DERIVED (render from API) ``` **ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.** ### The Five Database Backends | Database | Purpose | Data Flow | |----------|---------|-----------| | **Ducklake** | Analytics, aggregations | Import from YAML → Query | | **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API | | **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal | | **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint | | **Qdrant** | Vector search, semantic | Import from YAML → Similarity search | ### Schema Definition The LinkML schema files in `schemas/20251121/linkml/` are the **SINGLE SOURCE OF TRUTH** for the ontology definition. All RDF, TypeDB, and UML files are DERIVED from LinkML schemas. ## Data Quality Rules ### Rule 1: All Enrichment Must Be Written to YAML When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file. **Correct workflow**: ``` 1. Fetch data from external source 2. Validate data quality 3. Write to data/custodian/{GHCID}.yaml <- MANDATORY 4. Import to database (optional) ``` **Never do this**: ``` # WRONG - Writing directly to database without updating YAML cursor.execute("UPDATE institutions SET field = value ...") ``` ### Rule 2: Social Media Link Validation Social media links must point to the specific institution's page, NOT to generic platform pages. **Invalid (must reject)**: - `facebook.com/` - Generic homepage - `facebook.com/facebook` - Facebook's own page - `twitter.com/` - Generic homepage - `twitter.com/twitter` - Twitter's own account **Valid (can store)**: - `facebook.com/rijksmuseum/` - Institution's page - `twitter.com/rijksmuseum` - Institution's account See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for validation patterns. ### Rule 3: No Data Fabrication All data must be real and verifiable. Never create placeholder or fake data. **Forbidden**: - Inventing names, titles, or descriptions - Creating fictional URLs or identifiers - Generating fallback data when extraction fails **Allowed**: - Returning `null` or empty fields for missing data - Skipping records that cannot be extracted - Logging extraction failures ### Rule 4: Data Enrichment is Additive Never delete enriched data. Enrichment operations should only add or update, not remove. **Exception**: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment. ### Rule 5: Provenance Tracking All data must include provenance metadata: ```yaml provenance: data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ... data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED extraction_date: "2025-12-12T00:00:00Z" extraction_method: "Description of how data was extracted" confidence_score: 0.0 - 1.0 # Optional ``` ## Ghost Data Detection "Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files. ### How to Detect ```bash # Query the API curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field' # Check the YAML file grep "field:" data/custodian/{GHCID}.yaml # If API returns data but YAML doesn't have it = Ghost data! ``` ### How to Resolve 1. **If data is valid**: Add it to the YAML file 2. **If data is invalid**: Remove it from the database 3. **Never**: Leave ghost data without resolution ## File Naming Conventions ### Custodian Files ``` data/custodian/{GHCID}.yaml ``` Example: `data/custodian/NL-NH-AMS-M-RM.yaml` ### Person Entity Files ``` data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json ``` Example: `data/custodian/person/entity/john-smith-12345_20251212T000000Z.json` ## Validation Checklist Before committing enrichment data: - [ ] Data written to custodian YAML file - [ ] Provenance metadata included - [ ] Social media links validated (no generic URLs) - [ ] No fabricated/placeholder data - [ ] Existing data preserved (additive enrichment) - [ ] API responses match YAML file content ## Related Documentation | Document | Purpose | |----------|---------| | `AGENTS.md` | AI agent instructions with all rules | | `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` | Rule 22 detailed documentation | | `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` | Rule 23 validation patterns | | `.opencode/DATA_PRESERVATION_RULES.md` | Rule 5 data preservation | | `.opencode/DATA_FABRICATION_PROHIBITION.md` | Rule 21 anti-fabrication | ## Quick Reference | Rule | Summary | |------|---------| | **Rule 5** | Data enrichment is ADDITIVE ONLY | | **Rule 21** | Never fabricate data | | **Rule 22** | Custodian YAML is single source of truth | | **Rule 23** | Validate social media links |