- Implemented a Python script to validate KB library YAML files for required fields and data quality. - Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics. - Created a comprehensive markdown report summarizing validation results and enrichment quality. - Included error handling for file loading and validation processes. - Generated JSON statistics for further analysis.
349 lines
9.1 KiB
Markdown
349 lines
9.1 KiB
Markdown
# Data Preservation Rules for AI Agents
|
|
|
|
**Version**: 1.0
|
|
**Created**: 2025-11-28
|
|
**Status**: MANDATORY
|
|
|
|
---
|
|
|
|
## 🚨 CRITICAL RULE: NEVER DELETE ENRICHED DATA
|
|
|
|
**Data enrichment is ADDITIVE ONLY.**
|
|
|
|
When AI agents restructure, update, or refactor enriched institution records, they MUST preserve ALL existing enriched content. This rule has NO exceptions.
|
|
|
|
---
|
|
|
|
## Protected Data Sources
|
|
|
|
The following data sources contain enriched content that MUST NEVER be deleted:
|
|
|
|
### 1. Google Maps Enrichment
|
|
|
|
**Protected Fields**:
|
|
```yaml
|
|
google_maps_enrichment:
|
|
place_id: "ChIJ..." # NEVER DELETE
|
|
rating: 4.5 # NEVER DELETE
|
|
total_ratings: 127 # NEVER DELETE
|
|
reviews: [...] # NEVER DELETE
|
|
photo_count: 45 # NEVER DELETE
|
|
popular_times: {...} # NEVER DELETE
|
|
business_status: "OPERATIONAL" # NEVER DELETE
|
|
address_components: [...] # NEVER DELETE
|
|
opening_hours: {...} # NEVER DELETE
|
|
website: "https://..." # NEVER DELETE
|
|
phone: "+31..." # NEVER DELETE
|
|
```
|
|
|
|
**Why This Data Is Valuable**:
|
|
- Reviews contain visitor experiences and heritage descriptions
|
|
- Ratings indicate public engagement with heritage sites
|
|
- Photo counts show visual documentation availability
|
|
- Popular times reveal visitor patterns for heritage planning
|
|
- Historical snapshots capture changing institutional data
|
|
|
|
### 2. OpenStreetMap Enrichment
|
|
|
|
**Protected Fields**:
|
|
```yaml
|
|
osm_enrichment:
|
|
osm_id: 12345678 # NEVER DELETE
|
|
osm_type: "way" # NEVER DELETE
|
|
osm_tags: # NEVER DELETE
|
|
heritage: "2"
|
|
building: "museum"
|
|
amenity: "museum"
|
|
name: "..."
|
|
operator: "..."
|
|
geometry: {...} # NEVER DELETE
|
|
centroid: [lat, lon] # NEVER DELETE
|
|
```
|
|
|
|
**Why This Data Is Valuable**:
|
|
- OSM tags contain community-verified heritage classifications
|
|
- Geometry enables spatial analysis and mapping
|
|
- Heritage tags indicate official heritage status
|
|
|
|
### 3. Wikidata Enrichment
|
|
|
|
**Protected Fields**:
|
|
```yaml
|
|
wikidata_enrichment:
|
|
wikidata_id: "Q12345" # NEVER DELETE
|
|
claims: {...} # NEVER DELETE
|
|
sitelinks: {...} # NEVER DELETE
|
|
aliases: [...] # NEVER DELETE
|
|
descriptions: {...} # NEVER DELETE
|
|
labels: {...} # NEVER DELETE
|
|
instance_of: [...] # NEVER DELETE
|
|
part_of: [...] # NEVER DELETE
|
|
coordinate_location: {...} # NEVER DELETE
|
|
```
|
|
|
|
**Why This Data Is Valuable**:
|
|
- Wikidata IDs are persistent identifiers in the Linked Open Data ecosystem
|
|
- Claims contain structured facts about institutions
|
|
- Sitelinks connect to Wikipedia articles in multiple languages
|
|
|
|
### 4. Website Scrape Data
|
|
|
|
**Protected Fields**:
|
|
```yaml
|
|
website_enrichment:
|
|
fetch_timestamp: "..." # NEVER DELETE
|
|
fetch_status: "SUCCESS" # NEVER DELETE
|
|
source_url: "https://..." # NEVER DELETE
|
|
|
|
organization_details: # NEVER DELETE
|
|
full_name: "..."
|
|
short_name: "..."
|
|
legal_form: "..."
|
|
founded: "..."
|
|
kvk_number: "..."
|
|
anbi_status: true
|
|
|
|
collections: [...] # NEVER DELETE
|
|
exhibitions: [...] # NEVER DELETE
|
|
opening_hours: {...} # NEVER DELETE
|
|
contact: {...} # NEVER DELETE
|
|
social_media: {...} # NEVER DELETE
|
|
accessibility: {...} # NEVER DELETE
|
|
staff: [...] # NEVER DELETE
|
|
publications: [...] # NEVER DELETE
|
|
```
|
|
|
|
**Why This Data Is Valuable**:
|
|
- Website scrapes capture institutional self-descriptions
|
|
- Collection information enables heritage discovery
|
|
- Contact details enable institutional verification
|
|
- Historical scrapes show how institutions evolve
|
|
|
|
### 5. ISIL Registry Data
|
|
|
|
**Protected Fields**:
|
|
```yaml
|
|
isil_enrichment:
|
|
isil_code: "NL-..." # NEVER DELETE
|
|
assigned_date: "2020-01-15" # NEVER DELETE
|
|
remarks: "..." # NEVER DELETE
|
|
registry_entry: {...} # NEVER DELETE
|
|
```
|
|
|
|
---
|
|
|
|
## Allowed Operations
|
|
|
|
### ✅ RESTRUCTURING (Preserving All Content)
|
|
|
|
You MAY reorganize data structure while keeping all values:
|
|
|
|
```yaml
|
|
# BEFORE (flat)
|
|
google_rating: 4.5
|
|
google_reviews: 127
|
|
osm_heritage_tag: "2"
|
|
|
|
# AFTER (nested) - ALLOWED if ALL values preserved
|
|
enrichment:
|
|
google_maps:
|
|
rating: 4.5 # ← Same value
|
|
reviews: 127 # ← Same value
|
|
osm:
|
|
heritage: "2" # ← Same value
|
|
```
|
|
|
|
### ✅ ADDING NEW DATA
|
|
|
|
You MAY add new fields or enrichment sources:
|
|
|
|
```yaml
|
|
# BEFORE
|
|
website_enrichment:
|
|
organization_details: {...}
|
|
|
|
# AFTER - ALLOWED (added new section)
|
|
website_enrichment:
|
|
organization_details: {...} # ← Original preserved
|
|
new_section: # ← New addition
|
|
additional_data: "..."
|
|
```
|
|
|
|
### ✅ FILE RENAMING
|
|
|
|
You MAY rename files to be more descriptive:
|
|
|
|
```bash
|
|
# ALLOWED
|
|
mv 0139_unknown.yaml 0139_de_hollandse_cirkel.yaml
|
|
```
|
|
|
|
### ✅ MERGING SOURCES
|
|
|
|
You MAY merge data from multiple sources while preserving all:
|
|
|
|
```yaml
|
|
# Source A
|
|
website_enrichment:
|
|
organization_details: {...}
|
|
|
|
# Source B
|
|
google_maps_enrichment:
|
|
rating: 4.5
|
|
|
|
# MERGED - ALLOWED (both preserved)
|
|
enrichment:
|
|
website:
|
|
organization_details: {...} # ← From Source A
|
|
google_maps:
|
|
rating: 4.5 # ← From Source B
|
|
```
|
|
|
|
---
|
|
|
|
## Forbidden Operations
|
|
|
|
### ❌ DELETION
|
|
|
|
**NEVER delete enriched content:**
|
|
|
|
```yaml
|
|
# BEFORE
|
|
google_maps_enrichment:
|
|
rating: 4.5
|
|
reviews: 127
|
|
popular_times: {...}
|
|
|
|
# AFTER - FORBIDDEN!
|
|
enrichment_status: enriched
|
|
# Where did rating, reviews, popular_times go?!
|
|
```
|
|
|
|
### ❌ SUMMARIZATION
|
|
|
|
**NEVER replace detailed data with summaries:**
|
|
|
|
```yaml
|
|
# BEFORE
|
|
reviews:
|
|
- text: "Amazing collection of historical documents..."
|
|
rating: 5
|
|
author: "Jan de Vries"
|
|
date: "2024-03-15"
|
|
- text: "Small but well-curated museum..."
|
|
rating: 4
|
|
author: "Marie Jansen"
|
|
date: "2024-02-20"
|
|
|
|
# AFTER - FORBIDDEN!
|
|
reviews_summary: "Generally positive reviews (avg 4.5)"
|
|
# Original review text is LOST!
|
|
```
|
|
|
|
### ❌ TRUNCATION
|
|
|
|
**NEVER truncate long content:**
|
|
|
|
```yaml
|
|
# BEFORE
|
|
description: |
|
|
The museum was founded in 1892 by Count Willem van der Berg,
|
|
who donated his extensive collection of medieval manuscripts...
|
|
[500 more words of historical detail]
|
|
|
|
# AFTER - FORBIDDEN!
|
|
description: "The museum was founded in 1892..."
|
|
# Historical detail is LOST!
|
|
```
|
|
|
|
### ❌ OVERWRITING
|
|
|
|
**NEVER overwrite existing enrichment with new data:**
|
|
|
|
```yaml
|
|
# BEFORE (scraped 2024-06-01)
|
|
website_enrichment:
|
|
fetch_timestamp: "2024-06-01T..."
|
|
organization_details:
|
|
staff_count: 45
|
|
|
|
# AFTER (scraped 2024-11-28) - FORBIDDEN!
|
|
website_enrichment:
|
|
fetch_timestamp: "2024-11-28T..."
|
|
organization_details:
|
|
staff_count: 52
|
|
# June scrape data is LOST!
|
|
|
|
# CORRECT approach - preserve history:
|
|
website_enrichment_history:
|
|
- fetch_timestamp: "2024-06-01T..."
|
|
organization_details:
|
|
staff_count: 45
|
|
- fetch_timestamp: "2024-11-28T..."
|
|
organization_details:
|
|
staff_count: 52
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
Before writing ANY enriched file, agents MUST verify:
|
|
|
|
- [ ] **Read First**: Did I read the entire file before editing?
|
|
- [ ] **Field Count**: Does the new file have >= the same number of fields?
|
|
- [ ] **Data Preservation**: Is every piece of enriched data from the original still present?
|
|
- [ ] **No Summaries**: Did I avoid summarizing or truncating any content?
|
|
- [ ] **Timestamps Preserved**: Are all fetch/scrape timestamps still present?
|
|
- [ ] **Reviews Intact**: If reviews existed, are ALL reviews still present?
|
|
- [ ] **Metadata Intact**: Are all IDs (place_id, osm_id, wikidata_id) preserved?
|
|
|
|
---
|
|
|
|
## Recovery Procedures
|
|
|
|
If data loss is detected:
|
|
|
|
1. **Check Git History**: `git log --oneline -- path/to/file.yaml`
|
|
2. **Restore Previous Version**: `git checkout HEAD~1 -- path/to/file.yaml`
|
|
3. **Document Incident**: Note which agent/session caused the data loss
|
|
4. **Merge Correctly**: Manually merge the lost data with any valid new additions
|
|
|
|
---
|
|
|
|
## Rationale
|
|
|
|
**Why is this rule so strict?**
|
|
|
|
1. **Cost of Collection**:
|
|
- Google Maps API calls cost money and have rate limits
|
|
- Website scraping takes time and may be blocked
|
|
- Wikidata queries have usage quotas
|
|
- Re-collecting data wastes resources
|
|
|
|
2. **Historical Value**:
|
|
- Heritage institutions change over time
|
|
- Historical snapshots enable temporal analysis
|
|
- Old reviews may reference events no longer on websites
|
|
|
|
3. **Data Provenance**:
|
|
- Linked Data principles require traceability
|
|
- Deleting source data breaks the provenance chain
|
|
- Academic citations may reference specific versions
|
|
|
|
4. **Community Trust**:
|
|
- Users expect stable, growing datasets
|
|
- Data loss undermines confidence in the project
|
|
- Heritage professionals depend on comprehensive records
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **AGENTS.md**: Rule 5 - NEVER Delete Enriched Data
|
|
- **W3C PROV-O**: https://www.w3.org/TR/prov-o/
|
|
- **FAIR Principles**: https://www.go-fair.org/fair-principles/
|
|
|
|
---
|
|
|
|
**Remember**: When in doubt, PRESERVE everything. It's always possible to add data later, but deleted data may be gone forever.
|