# Data Preservation Rules for AI Agents **Version**: 1.0 **Created**: 2025-11-28 **Status**: MANDATORY --- ## 🚨 CRITICAL RULE: NEVER DELETE ENRICHED DATA **Data enrichment is ADDITIVE ONLY.** When AI agents restructure, update, or refactor enriched institution records, they MUST preserve ALL existing enriched content. This rule has NO exceptions. --- ## Protected Data Sources The following data sources contain enriched content that MUST NEVER be deleted: ### 1. Google Maps Enrichment **Protected Fields**: ```yaml google_maps_enrichment: place_id: "ChIJ..." # NEVER DELETE rating: 4.5 # NEVER DELETE total_ratings: 127 # NEVER DELETE reviews: [...] # NEVER DELETE photo_count: 45 # NEVER DELETE popular_times: {...} # NEVER DELETE business_status: "OPERATIONAL" # NEVER DELETE address_components: [...] # NEVER DELETE opening_hours: {...} # NEVER DELETE website: "https://..." # NEVER DELETE phone: "+31..." # NEVER DELETE ``` **Why This Data Is Valuable**: - Reviews contain visitor experiences and heritage descriptions - Ratings indicate public engagement with heritage sites - Photo counts show visual documentation availability - Popular times reveal visitor patterns for heritage planning - Historical snapshots capture changing institutional data ### 2. OpenStreetMap Enrichment **Protected Fields**: ```yaml osm_enrichment: osm_id: 12345678 # NEVER DELETE osm_type: "way" # NEVER DELETE osm_tags: # NEVER DELETE heritage: "2" building: "museum" amenity: "museum" name: "..." operator: "..." geometry: {...} # NEVER DELETE centroid: [lat, lon] # NEVER DELETE ``` **Why This Data Is Valuable**: - OSM tags contain community-verified heritage classifications - Geometry enables spatial analysis and mapping - Heritage tags indicate official heritage status ### 3. Wikidata Enrichment **Protected Fields**: ```yaml wikidata_enrichment: wikidata_id: "Q12345" # NEVER DELETE claims: {...} # NEVER DELETE sitelinks: {...} # NEVER DELETE aliases: [...] # NEVER DELETE descriptions: {...} # NEVER DELETE labels: {...} # NEVER DELETE instance_of: [...] # NEVER DELETE part_of: [...] # NEVER DELETE coordinate_location: {...} # NEVER DELETE ``` **Why This Data Is Valuable**: - Wikidata IDs are persistent identifiers in the Linked Open Data ecosystem - Claims contain structured facts about institutions - Sitelinks connect to Wikipedia articles in multiple languages ### 4. Website Scrape Data **Protected Fields**: ```yaml website_enrichment: fetch_timestamp: "..." # NEVER DELETE fetch_status: "SUCCESS" # NEVER DELETE source_url: "https://..." # NEVER DELETE organization_details: # NEVER DELETE full_name: "..." short_name: "..." legal_form: "..." founded: "..." kvk_number: "..." anbi_status: true collections: [...] # NEVER DELETE exhibitions: [...] # NEVER DELETE opening_hours: {...} # NEVER DELETE contact: {...} # NEVER DELETE social_media: {...} # NEVER DELETE accessibility: {...} # NEVER DELETE staff: [...] # NEVER DELETE publications: [...] # NEVER DELETE ``` **Why This Data Is Valuable**: - Website scrapes capture institutional self-descriptions - Collection information enables heritage discovery - Contact details enable institutional verification - Historical scrapes show how institutions evolve ### 5. ISIL Registry Data **Protected Fields**: ```yaml isil_enrichment: isil_code: "NL-..." # NEVER DELETE assigned_date: "2020-01-15" # NEVER DELETE remarks: "..." # NEVER DELETE registry_entry: {...} # NEVER DELETE ``` --- ## Allowed Operations ### ✅ RESTRUCTURING (Preserving All Content) You MAY reorganize data structure while keeping all values: ```yaml # BEFORE (flat) google_rating: 4.5 google_reviews: 127 osm_heritage_tag: "2" # AFTER (nested) - ALLOWED if ALL values preserved enrichment: google_maps: rating: 4.5 # ← Same value reviews: 127 # ← Same value osm: heritage: "2" # ← Same value ``` ### ✅ ADDING NEW DATA You MAY add new fields or enrichment sources: ```yaml # BEFORE website_enrichment: organization_details: {...} # AFTER - ALLOWED (added new section) website_enrichment: organization_details: {...} # ← Original preserved new_section: # ← New addition additional_data: "..." ``` ### ✅ FILE RENAMING You MAY rename files to be more descriptive: ```bash # ALLOWED mv 0139_unknown.yaml 0139_de_hollandse_cirkel.yaml ``` ### ✅ MERGING SOURCES You MAY merge data from multiple sources while preserving all: ```yaml # Source A website_enrichment: organization_details: {...} # Source B google_maps_enrichment: rating: 4.5 # MERGED - ALLOWED (both preserved) enrichment: website: organization_details: {...} # ← From Source A google_maps: rating: 4.5 # ← From Source B ``` --- ## Forbidden Operations ### ❌ DELETION **NEVER delete enriched content:** ```yaml # BEFORE google_maps_enrichment: rating: 4.5 reviews: 127 popular_times: {...} # AFTER - FORBIDDEN! enrichment_status: enriched # Where did rating, reviews, popular_times go?! ``` ### ❌ SUMMARIZATION **NEVER replace detailed data with summaries:** ```yaml # BEFORE reviews: - text: "Amazing collection of historical documents..." rating: 5 author: "Jan de Vries" date: "2024-03-15" - text: "Small but well-curated museum..." rating: 4 author: "Marie Jansen" date: "2024-02-20" # AFTER - FORBIDDEN! reviews_summary: "Generally positive reviews (avg 4.5)" # Original review text is LOST! ``` ### ❌ TRUNCATION **NEVER truncate long content:** ```yaml # BEFORE description: | The museum was founded in 1892 by Count Willem van der Berg, who donated his extensive collection of medieval manuscripts... [500 more words of historical detail] # AFTER - FORBIDDEN! description: "The museum was founded in 1892..." # Historical detail is LOST! ``` ### ❌ OVERWRITING **NEVER overwrite existing enrichment with new data:** ```yaml # BEFORE (scraped 2024-06-01) website_enrichment: fetch_timestamp: "2024-06-01T..." organization_details: staff_count: 45 # AFTER (scraped 2024-11-28) - FORBIDDEN! website_enrichment: fetch_timestamp: "2024-11-28T..." organization_details: staff_count: 52 # June scrape data is LOST! # CORRECT approach - preserve history: website_enrichment_history: - fetch_timestamp: "2024-06-01T..." organization_details: staff_count: 45 - fetch_timestamp: "2024-11-28T..." organization_details: staff_count: 52 ``` --- ## Verification Checklist Before writing ANY enriched file, agents MUST verify: - [ ] **Read First**: Did I read the entire file before editing? - [ ] **Field Count**: Does the new file have >= the same number of fields? - [ ] **Data Preservation**: Is every piece of enriched data from the original still present? - [ ] **No Summaries**: Did I avoid summarizing or truncating any content? - [ ] **Timestamps Preserved**: Are all fetch/scrape timestamps still present? - [ ] **Reviews Intact**: If reviews existed, are ALL reviews still present? - [ ] **Metadata Intact**: Are all IDs (place_id, osm_id, wikidata_id) preserved? --- ## Recovery Procedures If data loss is detected: 1. **Check Git History**: `git log --oneline -- path/to/file.yaml` 2. **Restore Previous Version**: `git checkout HEAD~1 -- path/to/file.yaml` 3. **Document Incident**: Note which agent/session caused the data loss 4. **Merge Correctly**: Manually merge the lost data with any valid new additions --- ## Rationale **Why is this rule so strict?** 1. **Cost of Collection**: - Google Maps API calls cost money and have rate limits - Website scraping takes time and may be blocked - Wikidata queries have usage quotas - Re-collecting data wastes resources 2. **Historical Value**: - Heritage institutions change over time - Historical snapshots enable temporal analysis - Old reviews may reference events no longer on websites 3. **Data Provenance**: - Linked Data principles require traceability - Deleting source data breaks the provenance chain - Academic citations may reference specific versions 4. **Community Trust**: - Users expect stable, growing datasets - Data loss undermines confidence in the project - Heritage professionals depend on comprehensive records --- ## References - **AGENTS.md**: Rule 5 - NEVER Delete Enriched Data - **W3C PROV-O**: https://www.w3.org/TR/prov-o/ - **FAIR Principles**: https://www.go-fair.org/fair-principles/ --- **Remember**: When in doubt, PRESERVE everything. It's always possible to add data later, but deleted data may be gone forever.