- Implemented a Python script to validate KB library YAML files for required fields and data quality. - Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics. - Created a comprehensive markdown report summarizing validation results and enrichment quality. - Included error handling for file loading and validation processes. - Generated JSON statistics for further analysis.
9.1 KiB
Data Preservation Rules for AI Agents
Version: 1.0
Created: 2025-11-28
Status: MANDATORY
🚨 CRITICAL RULE: NEVER DELETE ENRICHED DATA
Data enrichment is ADDITIVE ONLY.
When AI agents restructure, update, or refactor enriched institution records, they MUST preserve ALL existing enriched content. This rule has NO exceptions.
Protected Data Sources
The following data sources contain enriched content that MUST NEVER be deleted:
1. Google Maps Enrichment
Protected Fields:
google_maps_enrichment:
place_id: "ChIJ..." # NEVER DELETE
rating: 4.5 # NEVER DELETE
total_ratings: 127 # NEVER DELETE
reviews: [...] # NEVER DELETE
photo_count: 45 # NEVER DELETE
popular_times: {...} # NEVER DELETE
business_status: "OPERATIONAL" # NEVER DELETE
address_components: [...] # NEVER DELETE
opening_hours: {...} # NEVER DELETE
website: "https://..." # NEVER DELETE
phone: "+31..." # NEVER DELETE
Why This Data Is Valuable:
- Reviews contain visitor experiences and heritage descriptions
- Ratings indicate public engagement with heritage sites
- Photo counts show visual documentation availability
- Popular times reveal visitor patterns for heritage planning
- Historical snapshots capture changing institutional data
2. OpenStreetMap Enrichment
Protected Fields:
osm_enrichment:
osm_id: 12345678 # NEVER DELETE
osm_type: "way" # NEVER DELETE
osm_tags: # NEVER DELETE
heritage: "2"
building: "museum"
amenity: "museum"
name: "..."
operator: "..."
geometry: {...} # NEVER DELETE
centroid: [lat, lon] # NEVER DELETE
Why This Data Is Valuable:
- OSM tags contain community-verified heritage classifications
- Geometry enables spatial analysis and mapping
- Heritage tags indicate official heritage status
3. Wikidata Enrichment
Protected Fields:
wikidata_enrichment:
wikidata_id: "Q12345" # NEVER DELETE
claims: {...} # NEVER DELETE
sitelinks: {...} # NEVER DELETE
aliases: [...] # NEVER DELETE
descriptions: {...} # NEVER DELETE
labels: {...} # NEVER DELETE
instance_of: [...] # NEVER DELETE
part_of: [...] # NEVER DELETE
coordinate_location: {...} # NEVER DELETE
Why This Data Is Valuable:
- Wikidata IDs are persistent identifiers in the Linked Open Data ecosystem
- Claims contain structured facts about institutions
- Sitelinks connect to Wikipedia articles in multiple languages
4. Website Scrape Data
Protected Fields:
website_enrichment:
fetch_timestamp: "..." # NEVER DELETE
fetch_status: "SUCCESS" # NEVER DELETE
source_url: "https://..." # NEVER DELETE
organization_details: # NEVER DELETE
full_name: "..."
short_name: "..."
legal_form: "..."
founded: "..."
kvk_number: "..."
anbi_status: true
collections: [...] # NEVER DELETE
exhibitions: [...] # NEVER DELETE
opening_hours: {...} # NEVER DELETE
contact: {...} # NEVER DELETE
social_media: {...} # NEVER DELETE
accessibility: {...} # NEVER DELETE
staff: [...] # NEVER DELETE
publications: [...] # NEVER DELETE
Why This Data Is Valuable:
- Website scrapes capture institutional self-descriptions
- Collection information enables heritage discovery
- Contact details enable institutional verification
- Historical scrapes show how institutions evolve
5. ISIL Registry Data
Protected Fields:
isil_enrichment:
isil_code: "NL-..." # NEVER DELETE
assigned_date: "2020-01-15" # NEVER DELETE
remarks: "..." # NEVER DELETE
registry_entry: {...} # NEVER DELETE
Allowed Operations
✅ RESTRUCTURING (Preserving All Content)
You MAY reorganize data structure while keeping all values:
# BEFORE (flat)
google_rating: 4.5
google_reviews: 127
osm_heritage_tag: "2"
# AFTER (nested) - ALLOWED if ALL values preserved
enrichment:
google_maps:
rating: 4.5 # ← Same value
reviews: 127 # ← Same value
osm:
heritage: "2" # ← Same value
✅ ADDING NEW DATA
You MAY add new fields or enrichment sources:
# BEFORE
website_enrichment:
organization_details: {...}
# AFTER - ALLOWED (added new section)
website_enrichment:
organization_details: {...} # ← Original preserved
new_section: # ← New addition
additional_data: "..."
✅ FILE RENAMING
You MAY rename files to be more descriptive:
# ALLOWED
mv 0139_unknown.yaml 0139_de_hollandse_cirkel.yaml
✅ MERGING SOURCES
You MAY merge data from multiple sources while preserving all:
# Source A
website_enrichment:
organization_details: {...}
# Source B
google_maps_enrichment:
rating: 4.5
# MERGED - ALLOWED (both preserved)
enrichment:
website:
organization_details: {...} # ← From Source A
google_maps:
rating: 4.5 # ← From Source B
Forbidden Operations
❌ DELETION
NEVER delete enriched content:
# BEFORE
google_maps_enrichment:
rating: 4.5
reviews: 127
popular_times: {...}
# AFTER - FORBIDDEN!
enrichment_status: enriched
# Where did rating, reviews, popular_times go?!
❌ SUMMARIZATION
NEVER replace detailed data with summaries:
# BEFORE
reviews:
- text: "Amazing collection of historical documents..."
rating: 5
author: "Jan de Vries"
date: "2024-03-15"
- text: "Small but well-curated museum..."
rating: 4
author: "Marie Jansen"
date: "2024-02-20"
# AFTER - FORBIDDEN!
reviews_summary: "Generally positive reviews (avg 4.5)"
# Original review text is LOST!
❌ TRUNCATION
NEVER truncate long content:
# BEFORE
description: |
The museum was founded in 1892 by Count Willem van der Berg,
who donated his extensive collection of medieval manuscripts...
[500 more words of historical detail]
# AFTER - FORBIDDEN!
description: "The museum was founded in 1892..."
# Historical detail is LOST!
❌ OVERWRITING
NEVER overwrite existing enrichment with new data:
# BEFORE (scraped 2024-06-01)
website_enrichment:
fetch_timestamp: "2024-06-01T..."
organization_details:
staff_count: 45
# AFTER (scraped 2024-11-28) - FORBIDDEN!
website_enrichment:
fetch_timestamp: "2024-11-28T..."
organization_details:
staff_count: 52
# June scrape data is LOST!
# CORRECT approach - preserve history:
website_enrichment_history:
- fetch_timestamp: "2024-06-01T..."
organization_details:
staff_count: 45
- fetch_timestamp: "2024-11-28T..."
organization_details:
staff_count: 52
Verification Checklist
Before writing ANY enriched file, agents MUST verify:
- Read First: Did I read the entire file before editing?
- Field Count: Does the new file have >= the same number of fields?
- Data Preservation: Is every piece of enriched data from the original still present?
- No Summaries: Did I avoid summarizing or truncating any content?
- Timestamps Preserved: Are all fetch/scrape timestamps still present?
- Reviews Intact: If reviews existed, are ALL reviews still present?
- Metadata Intact: Are all IDs (place_id, osm_id, wikidata_id) preserved?
Recovery Procedures
If data loss is detected:
- Check Git History:
git log --oneline -- path/to/file.yaml - Restore Previous Version:
git checkout HEAD~1 -- path/to/file.yaml - Document Incident: Note which agent/session caused the data loss
- Merge Correctly: Manually merge the lost data with any valid new additions
Rationale
Why is this rule so strict?
-
Cost of Collection:
- Google Maps API calls cost money and have rate limits
- Website scraping takes time and may be blocked
- Wikidata queries have usage quotas
- Re-collecting data wastes resources
-
Historical Value:
- Heritage institutions change over time
- Historical snapshots enable temporal analysis
- Old reviews may reference events no longer on websites
-
Data Provenance:
- Linked Data principles require traceability
- Deleting source data breaks the provenance chain
- Academic citations may reference specific versions
-
Community Trust:
- Users expect stable, growing datasets
- Data loss undermines confidence in the project
- Heritage professionals depend on comprehensive records
References
- AGENTS.md: Rule 5 - NEVER Delete Enriched Data
- W3C PROV-O: https://www.w3.org/TR/prov-o/
- FAIR Principles: https://www.go-fair.org/fair-principles/
Remember: When in doubt, PRESERVE everything. It's always possible to add data later, but deleted data may be gone forever.