glam/docs/DATA_GOVERNANCE.md
2025-12-14 17:09:55 +01:00

5.5 KiB

Data Governance Guide

This document outlines the data governance principles and rules for the GLAM Heritage Custodian project.

Single Source of Truth

Custodian Data

The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data.

data/custodian/*.yaml          <- SINGLE SOURCE OF TRUTH
       |
       v
+------+------+------+------+------+
|      |      |      |      |      |
v      v      v      v      v      v
Ducklake  PostgreSQL  TypeDB  Oxigraph  Qdrant
(analytics) (geo API)  (graph) (RDF/SPARQL) (vector)
       |
       v
REST API responses             <- DERIVED (serve from databases)
       |
       v
Frontend display               <- DERIVED (render from API)

ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.

The Five Database Backends

Database Purpose Data Flow
Ducklake Analytics, aggregations Import from YAML → Query
PostgreSQL Geographic API, PostGIS Import from YAML → Serve API
TypeDB Graph queries, relationships Import from YAML → Graph traversal
Oxigraph RDF/SPARQL, Linked Data Import from YAML → SPARQL endpoint
Qdrant Vector search, semantic Import from YAML → Similarity search

Schema Definition

The LinkML schema files in schemas/20251121/linkml/ are the SINGLE SOURCE OF TRUTH for the ontology definition.

All RDF, TypeDB, and UML files are DERIVED from LinkML schemas.

Data Quality Rules

Rule 1: All Enrichment Must Be Written to YAML

When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file.

Correct workflow:

1. Fetch data from external source
2. Validate data quality
3. Write to data/custodian/{GHCID}.yaml  <- MANDATORY
4. Import to database (optional)

Never do this:

# WRONG - Writing directly to database without updating YAML
cursor.execute("UPDATE institutions SET field = value ...")

Social media links must point to the specific institution's page, NOT to generic platform pages.

Invalid (must reject):

  • facebook.com/ - Generic homepage
  • facebook.com/facebook - Facebook's own page
  • twitter.com/ - Generic homepage
  • twitter.com/twitter - Twitter's own account

Valid (can store):

  • facebook.com/rijksmuseum/ - Institution's page
  • twitter.com/rijksmuseum - Institution's account

See .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md for validation patterns.

Rule 3: No Data Fabrication

All data must be real and verifiable. Never create placeholder or fake data.

Forbidden:

  • Inventing names, titles, or descriptions
  • Creating fictional URLs or identifiers
  • Generating fallback data when extraction fails

Allowed:

  • Returning null or empty fields for missing data
  • Skipping records that cannot be extracted
  • Logging extraction failures

Rule 4: Data Enrichment is Additive

Never delete enriched data. Enrichment operations should only add or update, not remove.

Exception: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment.

Rule 5: Provenance Tracking

All data must include provenance metadata:

provenance:
  data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ...
  data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
  extraction_date: "2025-12-12T00:00:00Z"
  extraction_method: "Description of how data was extracted"
  confidence_score: 0.0 - 1.0  # Optional

Ghost Data Detection

"Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files.

How to Detect

# Query the API
curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field'

# Check the YAML file
grep "field:" data/custodian/{GHCID}.yaml

# If API returns data but YAML doesn't have it = Ghost data!

How to Resolve

  1. If data is valid: Add it to the YAML file
  2. If data is invalid: Remove it from the database
  3. Never: Leave ghost data without resolution

File Naming Conventions

Custodian Files

data/custodian/{GHCID}.yaml

Example: data/custodian/NL-NH-AMS-M-RM.yaml

Person Entity Files

data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json

Example: data/custodian/person/entity/john-smith-12345_20251212T000000Z.json

Validation Checklist

Before committing enrichment data:

  • Data written to custodian YAML file
  • Provenance metadata included
  • Social media links validated (no generic URLs)
  • No fabricated/placeholder data
  • Existing data preserved (additive enrichment)
  • API responses match YAML file content
Document Purpose
AGENTS.md AI agent instructions with all rules
.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md Rule 22 detailed documentation
.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md Rule 23 validation patterns
.opencode/DATA_PRESERVATION_RULES.md Rule 5 data preservation
.opencode/DATA_FABRICATION_PROHIBITION.md Rule 21 anti-fabrication

Quick Reference

Rule Summary
Rule 5 Data enrichment is ADDITIVE ONLY
Rule 21 Never fabricate data
Rule 22 Custodian YAML is single source of truth
Rule 23 Validate social media links