glam/.opencode/DATA_PRESERVATION_RULES.md
kempersc 30162e6526 Add script to validate KB library entries and generate enrichment report
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
2025-11-28 14:48:33 +01:00

9.1 KiB

Data Preservation Rules for AI Agents

Version: 1.0
Created: 2025-11-28
Status: MANDATORY


🚨 CRITICAL RULE: NEVER DELETE ENRICHED DATA

Data enrichment is ADDITIVE ONLY.

When AI agents restructure, update, or refactor enriched institution records, they MUST preserve ALL existing enriched content. This rule has NO exceptions.


Protected Data Sources

The following data sources contain enriched content that MUST NEVER be deleted:

1. Google Maps Enrichment

Protected Fields:

google_maps_enrichment:
  place_id: "ChIJ..."              # NEVER DELETE
  rating: 4.5                      # NEVER DELETE
  total_ratings: 127               # NEVER DELETE
  reviews: [...]                   # NEVER DELETE
  photo_count: 45                  # NEVER DELETE
  popular_times: {...}             # NEVER DELETE
  business_status: "OPERATIONAL"   # NEVER DELETE
  address_components: [...]        # NEVER DELETE
  opening_hours: {...}             # NEVER DELETE
  website: "https://..."           # NEVER DELETE
  phone: "+31..."                  # NEVER DELETE

Why This Data Is Valuable:

  • Reviews contain visitor experiences and heritage descriptions
  • Ratings indicate public engagement with heritage sites
  • Photo counts show visual documentation availability
  • Popular times reveal visitor patterns for heritage planning
  • Historical snapshots capture changing institutional data

2. OpenStreetMap Enrichment

Protected Fields:

osm_enrichment:
  osm_id: 12345678                 # NEVER DELETE
  osm_type: "way"                  # NEVER DELETE
  osm_tags:                        # NEVER DELETE
    heritage: "2"
    building: "museum"
    amenity: "museum"
    name: "..."
    operator: "..."
  geometry: {...}                  # NEVER DELETE
  centroid: [lat, lon]             # NEVER DELETE

Why This Data Is Valuable:

  • OSM tags contain community-verified heritage classifications
  • Geometry enables spatial analysis and mapping
  • Heritage tags indicate official heritage status

3. Wikidata Enrichment

Protected Fields:

wikidata_enrichment:
  wikidata_id: "Q12345"            # NEVER DELETE
  claims: {...}                    # NEVER DELETE
  sitelinks: {...}                 # NEVER DELETE
  aliases: [...]                   # NEVER DELETE
  descriptions: {...}              # NEVER DELETE
  labels: {...}                    # NEVER DELETE
  instance_of: [...]               # NEVER DELETE
  part_of: [...]                   # NEVER DELETE
  coordinate_location: {...}       # NEVER DELETE

Why This Data Is Valuable:

  • Wikidata IDs are persistent identifiers in the Linked Open Data ecosystem
  • Claims contain structured facts about institutions
  • Sitelinks connect to Wikipedia articles in multiple languages

4. Website Scrape Data

Protected Fields:

website_enrichment:
  fetch_timestamp: "..."           # NEVER DELETE
  fetch_status: "SUCCESS"          # NEVER DELETE
  source_url: "https://..."        # NEVER DELETE
  
  organization_details:            # NEVER DELETE
    full_name: "..."
    short_name: "..."
    legal_form: "..."
    founded: "..."
    kvk_number: "..."
    anbi_status: true
    
  collections: [...]               # NEVER DELETE
  exhibitions: [...]               # NEVER DELETE
  opening_hours: {...}             # NEVER DELETE
  contact: {...}                   # NEVER DELETE
  social_media: {...}              # NEVER DELETE
  accessibility: {...}             # NEVER DELETE
  staff: [...]                     # NEVER DELETE
  publications: [...]              # NEVER DELETE

Why This Data Is Valuable:

  • Website scrapes capture institutional self-descriptions
  • Collection information enables heritage discovery
  • Contact details enable institutional verification
  • Historical scrapes show how institutions evolve

5. ISIL Registry Data

Protected Fields:

isil_enrichment:
  isil_code: "NL-..."              # NEVER DELETE
  assigned_date: "2020-01-15"      # NEVER DELETE
  remarks: "..."                   # NEVER DELETE
  registry_entry: {...}            # NEVER DELETE

Allowed Operations

RESTRUCTURING (Preserving All Content)

You MAY reorganize data structure while keeping all values:

# BEFORE (flat)
google_rating: 4.5
google_reviews: 127
osm_heritage_tag: "2"

# AFTER (nested) - ALLOWED if ALL values preserved
enrichment:
  google_maps:
    rating: 4.5        # ← Same value
    reviews: 127       # ← Same value
  osm:
    heritage: "2"      # ← Same value

ADDING NEW DATA

You MAY add new fields or enrichment sources:

# BEFORE
website_enrichment:
  organization_details: {...}

# AFTER - ALLOWED (added new section)
website_enrichment:
  organization_details: {...}    # ← Original preserved
  new_section:                   # ← New addition
    additional_data: "..."

FILE RENAMING

You MAY rename files to be more descriptive:

# ALLOWED
mv 0139_unknown.yaml 0139_de_hollandse_cirkel.yaml

MERGING SOURCES

You MAY merge data from multiple sources while preserving all:

# Source A
website_enrichment:
  organization_details: {...}

# Source B  
google_maps_enrichment:
  rating: 4.5

# MERGED - ALLOWED (both preserved)
enrichment:
  website:
    organization_details: {...}  # ← From Source A
  google_maps:
    rating: 4.5                  # ← From Source B

Forbidden Operations

DELETION

NEVER delete enriched content:

# BEFORE
google_maps_enrichment:
  rating: 4.5
  reviews: 127
  popular_times: {...}

# AFTER - FORBIDDEN!
enrichment_status: enriched
# Where did rating, reviews, popular_times go?!

SUMMARIZATION

NEVER replace detailed data with summaries:

# BEFORE
reviews:
  - text: "Amazing collection of historical documents..."
    rating: 5
    author: "Jan de Vries"
    date: "2024-03-15"
  - text: "Small but well-curated museum..."
    rating: 4
    author: "Marie Jansen"
    date: "2024-02-20"

# AFTER - FORBIDDEN!
reviews_summary: "Generally positive reviews (avg 4.5)"
# Original review text is LOST!

TRUNCATION

NEVER truncate long content:

# BEFORE
description: |
  The museum was founded in 1892 by Count Willem van der Berg, 
  who donated his extensive collection of medieval manuscripts...
  [500 more words of historical detail]  

# AFTER - FORBIDDEN!
description: "The museum was founded in 1892..."
# Historical detail is LOST!

OVERWRITING

NEVER overwrite existing enrichment with new data:

# BEFORE (scraped 2024-06-01)
website_enrichment:
  fetch_timestamp: "2024-06-01T..."
  organization_details:
    staff_count: 45

# AFTER (scraped 2024-11-28) - FORBIDDEN!
website_enrichment:
  fetch_timestamp: "2024-11-28T..."
  organization_details:
    staff_count: 52
# June scrape data is LOST!

# CORRECT approach - preserve history:
website_enrichment_history:
  - fetch_timestamp: "2024-06-01T..."
    organization_details:
      staff_count: 45
  - fetch_timestamp: "2024-11-28T..."
    organization_details:
      staff_count: 52

Verification Checklist

Before writing ANY enriched file, agents MUST verify:

  • Read First: Did I read the entire file before editing?
  • Field Count: Does the new file have >= the same number of fields?
  • Data Preservation: Is every piece of enriched data from the original still present?
  • No Summaries: Did I avoid summarizing or truncating any content?
  • Timestamps Preserved: Are all fetch/scrape timestamps still present?
  • Reviews Intact: If reviews existed, are ALL reviews still present?
  • Metadata Intact: Are all IDs (place_id, osm_id, wikidata_id) preserved?

Recovery Procedures

If data loss is detected:

  1. Check Git History: git log --oneline -- path/to/file.yaml
  2. Restore Previous Version: git checkout HEAD~1 -- path/to/file.yaml
  3. Document Incident: Note which agent/session caused the data loss
  4. Merge Correctly: Manually merge the lost data with any valid new additions

Rationale

Why is this rule so strict?

  1. Cost of Collection:

    • Google Maps API calls cost money and have rate limits
    • Website scraping takes time and may be blocked
    • Wikidata queries have usage quotas
    • Re-collecting data wastes resources
  2. Historical Value:

    • Heritage institutions change over time
    • Historical snapshots enable temporal analysis
    • Old reviews may reference events no longer on websites
  3. Data Provenance:

    • Linked Data principles require traceability
    • Deleting source data breaks the provenance chain
    • Academic citations may reference specific versions
  4. Community Trust:

    • Users expect stable, growing datasets
    • Data loss undermines confidence in the project
    • Heritage professionals depend on comprehensive records

References


Remember: When in doubt, PRESERVE everything. It's always possible to add data later, but deleted data may be gone forever.