5.5 KiB
Data Governance Guide
This document outlines the data governance principles and rules for the GLAM Heritage Custodian project.
Single Source of Truth
Custodian Data
The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data.
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH
|
v
+------+------+------+------+------+
| | | | | |
v v v v v v
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
v
REST API responses <- DERIVED (serve from databases)
|
v
Frontend display <- DERIVED (render from API)
ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.
The Five Database Backends
| Database | Purpose | Data Flow |
|---|---|---|
| Ducklake | Analytics, aggregations | Import from YAML → Query |
| PostgreSQL | Geographic API, PostGIS | Import from YAML → Serve API |
| TypeDB | Graph queries, relationships | Import from YAML → Graph traversal |
| Oxigraph | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
| Qdrant | Vector search, semantic | Import from YAML → Similarity search |
Schema Definition
The LinkML schema files in schemas/20251121/linkml/ are the SINGLE SOURCE OF TRUTH for the ontology definition.
All RDF, TypeDB, and UML files are DERIVED from LinkML schemas.
Data Quality Rules
Rule 1: All Enrichment Must Be Written to YAML
When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file.
Correct workflow:
1. Fetch data from external source
2. Validate data quality
3. Write to data/custodian/{GHCID}.yaml <- MANDATORY
4. Import to database (optional)
Never do this:
# WRONG - Writing directly to database without updating YAML
cursor.execute("UPDATE institutions SET field = value ...")
Rule 2: Social Media Link Validation
Social media links must point to the specific institution's page, NOT to generic platform pages.
Invalid (must reject):
facebook.com/- Generic homepagefacebook.com/facebook- Facebook's own pagetwitter.com/- Generic homepagetwitter.com/twitter- Twitter's own account
Valid (can store):
facebook.com/rijksmuseum/- Institution's pagetwitter.com/rijksmuseum- Institution's account
See .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md for validation patterns.
Rule 3: No Data Fabrication
All data must be real and verifiable. Never create placeholder or fake data.
Forbidden:
- Inventing names, titles, or descriptions
- Creating fictional URLs or identifiers
- Generating fallback data when extraction fails
Allowed:
- Returning
nullor empty fields for missing data - Skipping records that cannot be extracted
- Logging extraction failures
Rule 4: Data Enrichment is Additive
Never delete enriched data. Enrichment operations should only add or update, not remove.
Exception: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment.
Rule 5: Provenance Tracking
All data must include provenance metadata:
provenance:
data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ...
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: "2025-12-12T00:00:00Z"
extraction_method: "Description of how data was extracted"
confidence_score: 0.0 - 1.0 # Optional
Ghost Data Detection
"Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files.
How to Detect
# Query the API
curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field'
# Check the YAML file
grep "field:" data/custodian/{GHCID}.yaml
# If API returns data but YAML doesn't have it = Ghost data!
How to Resolve
- If data is valid: Add it to the YAML file
- If data is invalid: Remove it from the database
- Never: Leave ghost data without resolution
File Naming Conventions
Custodian Files
data/custodian/{GHCID}.yaml
Example: data/custodian/NL-NH-AMS-M-RM.yaml
Person Entity Files
data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json
Example: data/custodian/person/entity/john-smith-12345_20251212T000000Z.json
Validation Checklist
Before committing enrichment data:
- Data written to custodian YAML file
- Provenance metadata included
- Social media links validated (no generic URLs)
- No fabricated/placeholder data
- Existing data preserved (additive enrichment)
- API responses match YAML file content
Related Documentation
| Document | Purpose |
|---|---|
AGENTS.md |
AI agent instructions with all rules |
.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md |
Rule 22 detailed documentation |
.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md |
Rule 23 validation patterns |
.opencode/DATA_PRESERVATION_RULES.md |
Rule 5 data preservation |
.opencode/DATA_FABRICATION_PROHIBITION.md |
Rule 21 anti-fabrication |
Quick Reference
| Rule | Summary |
|---|---|
| Rule 5 | Data enrichment is ADDITIVE ONLY |
| Rule 21 | Never fabricate data |
| Rule 22 | Custodian YAML is single source of truth |
| Rule 23 | Validate social media links |