183 lines
5.5 KiB
Markdown
183 lines
5.5 KiB
Markdown
# Data Governance Guide
|
|
|
|
This document outlines the data governance principles and rules for the GLAM Heritage Custodian project.
|
|
|
|
## Single Source of Truth
|
|
|
|
### Custodian Data
|
|
|
|
The `data/custodian/*.yaml` files are the **SINGLE SOURCE OF TRUTH** for all heritage institution enrichment data.
|
|
|
|
```
|
|
data/custodian/*.yaml <- SINGLE SOURCE OF TRUTH
|
|
|
|
|
v
|
|
+------+------+------+------+------+
|
|
| | | | | |
|
|
v v v v v v
|
|
Ducklake PostgreSQL TypeDB Oxigraph Qdrant
|
|
(analytics) (geo API) (graph) (RDF/SPARQL) (vector)
|
|
|
|
|
v
|
|
REST API responses <- DERIVED (serve from databases)
|
|
|
|
|
v
|
|
Frontend display <- DERIVED (render from API)
|
|
```
|
|
|
|
**ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.**
|
|
|
|
### The Five Database Backends
|
|
|
|
| Database | Purpose | Data Flow |
|
|
|----------|---------|-----------|
|
|
| **Ducklake** | Analytics, aggregations | Import from YAML → Query |
|
|
| **PostgreSQL** | Geographic API, PostGIS | Import from YAML → Serve API |
|
|
| **TypeDB** | Graph queries, relationships | Import from YAML → Graph traversal |
|
|
| **Oxigraph** | RDF/SPARQL, Linked Data | Import from YAML → SPARQL endpoint |
|
|
| **Qdrant** | Vector search, semantic | Import from YAML → Similarity search |
|
|
|
|
### Schema Definition
|
|
|
|
The LinkML schema files in `schemas/20251121/linkml/` are the **SINGLE SOURCE OF TRUTH** for the ontology definition.
|
|
|
|
All RDF, TypeDB, and UML files are DERIVED from LinkML schemas.
|
|
|
|
## Data Quality Rules
|
|
|
|
### Rule 1: All Enrichment Must Be Written to YAML
|
|
|
|
When enriching institution data from any source (Google Maps, Wikidata, LinkedIn, web scraping), the data MUST be written to the custodian YAML file.
|
|
|
|
**Correct workflow**:
|
|
```
|
|
1. Fetch data from external source
|
|
2. Validate data quality
|
|
3. Write to data/custodian/{GHCID}.yaml <- MANDATORY
|
|
4. Import to database (optional)
|
|
```
|
|
|
|
**Never do this**:
|
|
```
|
|
# WRONG - Writing directly to database without updating YAML
|
|
cursor.execute("UPDATE institutions SET field = value ...")
|
|
```
|
|
|
|
### Rule 2: Social Media Link Validation
|
|
|
|
Social media links must point to the specific institution's page, NOT to generic platform pages.
|
|
|
|
**Invalid (must reject)**:
|
|
- `facebook.com/` - Generic homepage
|
|
- `facebook.com/facebook` - Facebook's own page
|
|
- `twitter.com/` - Generic homepage
|
|
- `twitter.com/twitter` - Twitter's own account
|
|
|
|
**Valid (can store)**:
|
|
- `facebook.com/rijksmuseum/` - Institution's page
|
|
- `twitter.com/rijksmuseum` - Institution's account
|
|
|
|
See `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for validation patterns.
|
|
|
|
### Rule 3: No Data Fabrication
|
|
|
|
All data must be real and verifiable. Never create placeholder or fake data.
|
|
|
|
**Forbidden**:
|
|
- Inventing names, titles, or descriptions
|
|
- Creating fictional URLs or identifiers
|
|
- Generating fallback data when extraction fails
|
|
|
|
**Allowed**:
|
|
- Returning `null` or empty fields for missing data
|
|
- Skipping records that cannot be extracted
|
|
- Logging extraction failures
|
|
|
|
### Rule 4: Data Enrichment is Additive
|
|
|
|
Never delete enriched data. Enrichment operations should only add or update, not remove.
|
|
|
|
**Exception**: Invalid/garbage data (like generic social media links) can be removed as it was never valid enrichment.
|
|
|
|
### Rule 5: Provenance Tracking
|
|
|
|
All data must include provenance metadata:
|
|
|
|
```yaml
|
|
provenance:
|
|
data_source: CSV_REGISTRY | CONVERSATION_NLP | WIKIDATA | GOOGLE_MAPS | ...
|
|
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
|
|
extraction_date: "2025-12-12T00:00:00Z"
|
|
extraction_method: "Description of how data was extracted"
|
|
confidence_score: 0.0 - 1.0 # Optional
|
|
```
|
|
|
|
## Ghost Data Detection
|
|
|
|
"Ghost data" is data that appears in API responses but doesn't exist in custodian YAML files.
|
|
|
|
### How to Detect
|
|
|
|
```bash
|
|
# Query the API
|
|
curl -s "http://localhost:8002/institution/{GHCID}" | jq '.field'
|
|
|
|
# Check the YAML file
|
|
grep "field:" data/custodian/{GHCID}.yaml
|
|
|
|
# If API returns data but YAML doesn't have it = Ghost data!
|
|
```
|
|
|
|
### How to Resolve
|
|
|
|
1. **If data is valid**: Add it to the YAML file
|
|
2. **If data is invalid**: Remove it from the database
|
|
3. **Never**: Leave ghost data without resolution
|
|
|
|
## File Naming Conventions
|
|
|
|
### Custodian Files
|
|
|
|
```
|
|
data/custodian/{GHCID}.yaml
|
|
```
|
|
|
|
Example: `data/custodian/NL-NH-AMS-M-RM.yaml`
|
|
|
|
### Person Entity Files
|
|
|
|
```
|
|
data/custodian/person/entity/{linkedin-slug}_{ISO-timestamp}.json
|
|
```
|
|
|
|
Example: `data/custodian/person/entity/john-smith-12345_20251212T000000Z.json`
|
|
|
|
## Validation Checklist
|
|
|
|
Before committing enrichment data:
|
|
|
|
- [ ] Data written to custodian YAML file
|
|
- [ ] Provenance metadata included
|
|
- [ ] Social media links validated (no generic URLs)
|
|
- [ ] No fabricated/placeholder data
|
|
- [ ] Existing data preserved (additive enrichment)
|
|
- [ ] API responses match YAML file content
|
|
|
|
## Related Documentation
|
|
|
|
| Document | Purpose |
|
|
|----------|---------|
|
|
| `AGENTS.md` | AI agent instructions with all rules |
|
|
| `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` | Rule 22 detailed documentation |
|
|
| `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` | Rule 23 validation patterns |
|
|
| `.opencode/DATA_PRESERVATION_RULES.md` | Rule 5 data preservation |
|
|
| `.opencode/DATA_FABRICATION_PROHIBITION.md` | Rule 21 anti-fabrication |
|
|
|
|
## Quick Reference
|
|
|
|
| Rule | Summary |
|
|
|------|---------|
|
|
| **Rule 5** | Data enrichment is ADDITIVE ONLY |
|
|
| **Rule 21** | Never fabricate data |
|
|
| **Rule 22** | Custodian YAML is single source of truth |
|
|
| **Rule 23** | Validate social media links |
|