glam/docs/DATA_GOVERNANCE.md
2025-12-14 17:09:55 +01:00

183 lines
5.5 KiB
Markdown

# Data Governance Guide
This document outlines the data governance principles and rules for the GLAM Heritage Custodian project.
## Single Source of Truth
### Custodian Data
The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data.
```
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH
|
v
+------+------+------+------+------+
| | | | | |
v v v v v v
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
v
REST API responses <- DERIVED (serve from databases)
|
v
Frontend display <- DERIVED (render from API)
```
**ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.**
### The Five Database Backends
| Database | Purpose | Data Flow |
|----------|---------|-----------|
| **Ducklake** | Analytics, aggregations | Import from YAML → Query |
| **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API |
| **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal |
| **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
| **Qdrant** | Vector search, semantic | Import from YAML → Similarity search |
### Schema Definition
The LinkML schema files in `schemas/20251121/linkml/` are the **SINGLE SOURCE OF TRUTH** for the ontology definition.
All RDF, TypeDB, and UML files are DERIVED from LinkML schemas.
## Data Quality Rules
### Rule 1: All Enrichment Must Be Written to YAML
When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file.
**Correct workflow**:
```
1. Fetch data from external source
2. Validate data quality
3. Write to data/custodian/{GHCID}.yaml <- MANDATORY
4. Import to database (optional)
```
**Never do this**:
```
# WRONG - Writing directly to database without updating YAML
cursor.execute("UPDATE institutions SET field = value ...")
```
### Rule 2: Social Media Link Validation
Social media links must point to the specific institution's page, NOT to generic platform pages.
**Invalid (must reject)**:
- `facebook.com/` - Generic homepage
- `facebook.com/facebook` - Facebook's own page
- `twitter.com/` - Generic homepage
- `twitter.com/twitter` - Twitter's own account
**Valid (can store)**:
- `facebook.com/rijksmuseum/` - Institution's page
- `twitter.com/rijksmuseum` - Institution's account
See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for validation patterns.
### Rule 3: No Data Fabrication
All data must be real and verifiable. Never create placeholder or fake data.
**Forbidden**:
- Inventing names, titles, or descriptions
- Creating fictional URLs or identifiers
- Generating fallback data when extraction fails
**Allowed**:
- Returning `null` or empty fields for missing data
- Skipping records that cannot be extracted
- Logging extraction failures
### Rule 4: Data Enrichment is Additive
Never delete enriched data. Enrichment operations should only add or update, not remove.
**Exception**: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment.
### Rule 5: Provenance Tracking
All data must include provenance metadata:
```yaml
provenance:
data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ...
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: "2025-12-12T00:00:00Z"
extraction_method: "Description of how data was extracted"
confidence_score: 0.0 - 1.0 # Optional
```
## Ghost Data Detection
"Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files.
### How to Detect
```bash
# Query the API
curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field'
# Check the YAML file
grep "field:" data/custodian/{GHCID}.yaml
# If API returns data but YAML doesn't have it = Ghost data!
```
### How to Resolve
1. **If data is valid**: Add it to the YAML file
2. **If data is invalid**: Remove it from the database
3. **Never**: Leave ghost data without resolution
## File Naming Conventions
### Custodian Files
```
data/custodian/{GHCID}.yaml
```
Example: `data/custodian/NL-NH-AMS-M-RM.yaml`
### Person Entity Files
```
data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json
```
Example: `data/custodian/person/entity/john-smith-12345_20251212T000000Z.json`
## Validation Checklist
Before committing enrichment data:
- [ ] Data written to custodian YAML file
- [ ] Provenance metadata included
- [ ] Social media links validated (no generic URLs)
- [ ] No fabricated/placeholder data
- [ ] Existing data preserved (additive enrichment)
- [ ] API responses match YAML file content
## Related Documentation
| Document | Purpose |
|----------|---------|
| `AGENTS.md` | AI agent instructions with all rules |
| `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` | Rule 22 detailed documentation |
| `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` | Rule 23 validation patterns |
| `.opencode/DATA_PRESERVATION_RULES.md` | Rule 5 data preservation |
| `.opencode/DATA_FABRICATION_PROHIBITION.md` | Rule 21 anti-fabrication |
## Quick Reference
| Rule | Summary |
|------|---------|
| **Rule 5** | Data enrichment is ADDITIVE ONLY |
| **Rule 21** | Never fabricate data |
| **Rule 22** | Custodian YAML is single source of truth |
| **Rule 23** | Validate social media links |