glam/docs/DATA_GOVERNANCE.md

# Data Governance Guide

This document outlines the data governance principles and rules for the GLAM Heritage Custodian project.

## Single Source of Truth

### Custodian Data

The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data.

```
data/custodian/*.yaml          <- SINGLE SOURCE OF TRUTH
       |
       v
+------+------+------+------+------+
|      |      |      |      |      |
v      v      v      v      v      v
Ducklake  PostgreSQL  TypeDB  Oxigraph  Qdrant
(analytics) (geo API)  (graph) (RDF/SPARQL) (vector)
       |
       v
REST API responses             <- DERIVED (serve from databases)
       |
       v
Frontend display               <- DERIVED (render from API)
```

**ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.**

### The Five Database Backends

| Database | Purpose | Data Flow |
|----------|---------|-----------|
| **Ducklake** | Analytics, aggregations | Import from YAML → Query |
| **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API |
| **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal |
| **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
| **Qdrant** | Vector search, semantic | Import from YAML → Similarity search |

### Schema Definition

The LinkML schema files in `schemas/20251121/linkml/` are the **SINGLE SOURCE OF TRUTH** for the ontology definition.

All RDF, TypeDB, and UML files are DERIVED from LinkML schemas.

## Data Quality Rules

### Rule 1: All Enrichment Must Be Written to YAML

When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file.

**Correct workflow**:
```
1. Fetch data from external source
2. Validate data quality
3. Write to data/custodian/{GHCID}.yaml  <- MANDATORY
4. Import to database (optional)
```

**Never do this**:
```
# WRONG - Writing directly to database without updating YAML
cursor.execute("UPDATE institutions SET field = value ...")
```

### Rule 2: Social Media Link Validation

Social media links must point to the specific institution's page, NOT to generic platform pages.

**Invalid (must reject)**:
- `facebook.com/` - Generic homepage
- `facebook.com/facebook` - Facebook's own page
- `twitter.com/` - Generic homepage
- `twitter.com/twitter` - Twitter's own account

**Valid (can store)**:
- `facebook.com/rijksmuseum/` - Institution's page
- `twitter.com/rijksmuseum` - Institution's account

See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for validation patterns.

### Rule 3: No Data Fabrication

All data must be real and verifiable. Never create placeholder or fake data.

**Forbidden**:
- Inventing names, titles, or descriptions
- Creating fictional URLs or identifiers
- Generating fallback data when extraction fails

**Allowed**:
- Returning `null` or empty fields for missing data
- Skipping records that cannot be extracted
- Logging extraction failures

### Rule 4: Data Enrichment is Additive

Never delete enriched data. Enrichment operations should only add or update, not remove.

**Exception**: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment.

### Rule 5: Provenance Tracking

All data must include provenance metadata:

```yaml
provenance:
  data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ...
  data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
  extraction_date: "2025-12-12T00:00:00Z"
  extraction_method: "Description of how data was extracted"
  confidence_score: 0.0 - 1.0  # Optional
```

## Ghost Data Detection

"Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files.

### How to Detect

```bash
# Query the API
curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field'

# Check the YAML file
grep "field:" data/custodian/{GHCID}.yaml

# If API returns data but YAML doesn't have it = Ghost data!
```

### How to Resolve

1. **If data is valid**: Add it to the YAML file
2. **If data is invalid**: Remove it from the database
3. **Never**: Leave ghost data without resolution

## File Naming Conventions

### Custodian Files

```
data/custodian/{GHCID}.yaml
```

Example: `data/custodian/NL-NH-AMS-M-RM.yaml`

### Person Entity Files

```
data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json
```

Example: `data/custodian/person/entity/john-smith-12345_20251212T000000Z.json`

## Validation Checklist

Before committing enrichment data:

- [ ] Data written to custodian YAML file
- [ ] Provenance metadata included
- [ ] Social media links validated (no generic URLs)
- [ ] No fabricated/placeholder data
- [ ] Existing data preserved (additive enrichment)
- [ ] API responses match YAML file content

## Related Documentation

| Document | Purpose |
|----------|---------|
| `AGENTS.md` | AI agent instructions with all rules |
| `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` | Rule 22 detailed documentation |
| `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` | Rule 23 validation patterns |
| `.opencode/DATA_PRESERVATION_RULES.md` | Rule 5 data preservation |
| `.opencode/DATA_FABRICATION_PROHIBITION.md` | Rule 21 anti-fabrication |

## Quick Reference

| Rule | Summary |
|------|---------|
| **Rule 5** | Data enrichment is ADDITIVE ONLY |
| **Rule 21** | Never fabricate data |
| **Rule 22** | Custodian YAML is single source of truth |
| **Rule 23** | Validate social media links |