glam/.opencode/DATA_PRESERVATION_RULES.md
kempersc 30162e6526 Add script to validate KB library entries and generate enrichment report
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
2025-11-28 14:48:33 +01:00

349 lines
9.1 KiB
Markdown

# Data Preservation Rules for AI Agents
**Version**: 1.0
**Created**: 2025-11-28
**Status**: MANDATORY
---
## 🚨 CRITICAL RULE: NEVER DELETE ENRICHED DATA
**Data enrichment is ADDITIVE ONLY.**
When AI agents restructure, update, or refactor enriched institution records, they MUST preserve ALL existing enriched content. This rule has NO exceptions.
---
## Protected Data Sources
The following data sources contain enriched content that MUST NEVER be deleted:
### 1. Google Maps Enrichment
**Protected Fields**:
```yaml
google_maps_enrichment:
place_id: "ChIJ..." # NEVER DELETE
rating: 4.5 # NEVER DELETE
total_ratings: 127 # NEVER DELETE
reviews: [...] # NEVER DELETE
photo_count: 45 # NEVER DELETE
popular_times: {...} # NEVER DELETE
business_status: "OPERATIONAL" # NEVER DELETE
address_components: [...] # NEVER DELETE
opening_hours: {...} # NEVER DELETE
website: "https://..." # NEVER DELETE
phone: "+31..." # NEVER DELETE
```
**Why This Data Is Valuable**:
- Reviews contain visitor experiences and heritage descriptions
- Ratings indicate public engagement with heritage sites
- Photo counts show visual documentation availability
- Popular times reveal visitor patterns for heritage planning
- Historical snapshots capture changing institutional data
### 2. OpenStreetMap Enrichment
**Protected Fields**:
```yaml
osm_enrichment:
osm_id: 12345678 # NEVER DELETE
osm_type: "way" # NEVER DELETE
osm_tags: # NEVER DELETE
heritage: "2"
building: "museum"
amenity: "museum"
name: "..."
operator: "..."
geometry: {...} # NEVER DELETE
centroid: [lat, lon] # NEVER DELETE
```
**Why This Data Is Valuable**:
- OSM tags contain community-verified heritage classifications
- Geometry enables spatial analysis and mapping
- Heritage tags indicate official heritage status
### 3. Wikidata Enrichment
**Protected Fields**:
```yaml
wikidata_enrichment:
wikidata_id: "Q12345" # NEVER DELETE
claims: {...} # NEVER DELETE
sitelinks: {...} # NEVER DELETE
aliases: [...] # NEVER DELETE
descriptions: {...} # NEVER DELETE
labels: {...} # NEVER DELETE
instance_of: [...] # NEVER DELETE
part_of: [...] # NEVER DELETE
coordinate_location: {...} # NEVER DELETE
```
**Why This Data Is Valuable**:
- Wikidata IDs are persistent identifiers in the Linked Open Data ecosystem
- Claims contain structured facts about institutions
- Sitelinks connect to Wikipedia articles in multiple languages
### 4. Website Scrape Data
**Protected Fields**:
```yaml
website_enrichment:
fetch_timestamp: "..." # NEVER DELETE
fetch_status: "SUCCESS" # NEVER DELETE
source_url: "https://..." # NEVER DELETE
organization_details: # NEVER DELETE
full_name: "..."
short_name: "..."
legal_form: "..."
founded: "..."
kvk_number: "..."
anbi_status: true
collections: [...] # NEVER DELETE
exhibitions: [...] # NEVER DELETE
opening_hours: {...} # NEVER DELETE
contact: {...} # NEVER DELETE
social_media: {...} # NEVER DELETE
accessibility: {...} # NEVER DELETE
staff: [...] # NEVER DELETE
publications: [...] # NEVER DELETE
```
**Why This Data Is Valuable**:
- Website scrapes capture institutional self-descriptions
- Collection information enables heritage discovery
- Contact details enable institutional verification
- Historical scrapes show how institutions evolve
### 5. ISIL Registry Data
**Protected Fields**:
```yaml
isil_enrichment:
isil_code: "NL-..." # NEVER DELETE
assigned_date: "2020-01-15" # NEVER DELETE
remarks: "..." # NEVER DELETE
registry_entry: {...} # NEVER DELETE
```
---
## Allowed Operations
### ✅ RESTRUCTURING (Preserving All Content)
You MAY reorganize data structure while keeping all values:
```yaml
# BEFORE (flat)
google_rating: 4.5
google_reviews: 127
osm_heritage_tag: "2"
# AFTER (nested) - ALLOWED if ALL values preserved
enrichment:
google_maps:
rating: 4.5 # ← Same value
reviews: 127 # ← Same value
osm:
heritage: "2" # ← Same value
```
### ✅ ADDING NEW DATA
You MAY add new fields or enrichment sources:
```yaml
# BEFORE
website_enrichment:
organization_details: {...}
# AFTER - ALLOWED (added new section)
website_enrichment:
organization_details: {...} # ← Original preserved
new_section: # ← New addition
additional_data: "..."
```
### ✅ FILE RENAMING
You MAY rename files to be more descriptive:
```bash
# ALLOWED
mv 0139_unknown.yaml 0139_de_hollandse_cirkel.yaml
```
### ✅ MERGING SOURCES
You MAY merge data from multiple sources while preserving all:
```yaml
# Source A
website_enrichment:
organization_details: {...}
# Source B
google_maps_enrichment:
rating: 4.5
# MERGED - ALLOWED (both preserved)
enrichment:
website:
organization_details: {...} # ← From Source A
google_maps:
rating: 4.5 # ← From Source B
```
---
## Forbidden Operations
### ❌ DELETION
**NEVER delete enriched content:**
```yaml
# BEFORE
google_maps_enrichment:
rating: 4.5
reviews: 127
popular_times: {...}
# AFTER - FORBIDDEN!
enrichment_status: enriched
# Where did rating, reviews, popular_times go?!
```
### ❌ SUMMARIZATION
**NEVER replace detailed data with summaries:**
```yaml
# BEFORE
reviews:
- text: "Amazing collection of historical documents..."
rating: 5
author: "Jan de Vries"
date: "2024-03-15"
- text: "Small but well-curated museum..."
rating: 4
author: "Marie Jansen"
date: "2024-02-20"
# AFTER - FORBIDDEN!
reviews_summary: "Generally positive reviews (avg 4.5)"
# Original review text is LOST!
```
### ❌ TRUNCATION
**NEVER truncate long content:**
```yaml
# BEFORE
description: |
The museum was founded in 1892 by Count Willem van der Berg,
who donated his extensive collection of medieval manuscripts...
[500 more words of historical detail]
# AFTER - FORBIDDEN!
description: "The museum was founded in 1892..."
# Historical detail is LOST!
```
### ❌ OVERWRITING
**NEVER overwrite existing enrichment with new data:**
```yaml
# BEFORE (scraped 2024-06-01)
website_enrichment:
fetch_timestamp: "2024-06-01T..."
organization_details:
staff_count: 45
# AFTER (scraped 2024-11-28) - FORBIDDEN!
website_enrichment:
fetch_timestamp: "2024-11-28T..."
organization_details:
staff_count: 52
# June scrape data is LOST!
# CORRECT approach - preserve history:
website_enrichment_history:
- fetch_timestamp: "2024-06-01T..."
organization_details:
staff_count: 45
- fetch_timestamp: "2024-11-28T..."
organization_details:
staff_count: 52
```
---
## Verification Checklist
Before writing ANY enriched file, agents MUST verify:
- [ ] **Read First**: Did I read the entire file before editing?
- [ ] **Field Count**: Does the new file have >= the same number of fields?
- [ ] **Data Preservation**: Is every piece of enriched data from the original still present?
- [ ] **No Summaries**: Did I avoid summarizing or truncating any content?
- [ ] **Timestamps Preserved**: Are all fetch/scrape timestamps still present?
- [ ] **Reviews Intact**: If reviews existed, are ALL reviews still present?
- [ ] **Metadata Intact**: Are all IDs (place_id, osm_id, wikidata_id) preserved?
---
## Recovery Procedures
If data loss is detected:
1. **Check Git History**: `git log --oneline -- path/to/file.yaml`
2. **Restore Previous Version**: `git checkout HEAD~1 -- path/to/file.yaml`
3. **Document Incident**: Note which agent/session caused the data loss
4. **Merge Correctly**: Manually merge the lost data with any valid new additions
---
## Rationale
**Why is this rule so strict?**
1. **Cost of Collection**:
- Google Maps API calls cost money and have rate limits
- Website scraping takes time and may be blocked
- Wikidata queries have usage quotas
- Re-collecting data wastes resources
2. **Historical Value**:
- Heritage institutions change over time
- Historical snapshots enable temporal analysis
- Old reviews may reference events no longer on websites
3. **Data Provenance**:
- Linked Data principles require traceability
- Deleting source data breaks the provenance chain
- Academic citations may reference specific versions
4. **Community Trust**:
- Users expect stable, growing datasets
- Data loss undermines confidence in the project
- Heritage professionals depend on comprehensive records
---
## References
- **AGENTS.md**: Rule 5 - NEVER Delete Enriched Data
- **W3C PROV-O**: https://www.w3.org/TR/prov-o/
- **FAIR Principles**: https://www.go-fair.org/fair-principles/
---
**Remember**: When in doubt, PRESERVE everything. It's always possible to add data later, but deleted data may be gone forever.