glam/docs/SCHEMA_V0.2.2_CHANGELOG.md
2025-11-19 23:25:22 +01:00

7.2 KiB

Schema v0.2.2 Changelog

Release Date: 2025-11-10
Previous Version: v0.2.1

Summary

Schema v0.2.2 introduces structured enrichment history tracking to replace unstructured provenance.notes strings. This enhancement provides machine-readable, queryable metadata for data quality activities (Wikidata enrichment, geocoding, identifier verification, etc.) with full ontology alignment.

Changes

New Classes

EnrichmentHistoryEntry (schemas/provenance.yaml)

Tracks individual data enrichment activities with structured metadata:

EnrichmentHistoryEntry:
  slots:
    - enrichment_date       # When enrichment performed (datetime, required)
    - enrichment_method     # Method used (string, required)
    - enrichment_type       # Type of enrichment (EnrichmentTypeEnum, required)
    - match_score           # Fuzzy match confidence 0.0-1.0 (float, optional)
    - verified              # Manual verification status (boolean, required)
    - enrichment_source     # Data source URI (uri, optional)
    - enrichment_notes      # Human-readable details (string, optional)

Ontology Mappings:

  • enrichment_dateprov:atTime (PROV-O)
  • enrichment_methodprov:hadPlan (PROV-O)
  • enrichment_typerdf:type (RDF)
  • match_scoreadms:confidence (ADMS)
  • verifiedadms:status (ADMS)
  • enrichment_sourcedcterms:source (Dublin Core)
  • enrichment_notesdcterms:description (Dublin Core)

New Enumerations

EnrichmentTypeEnum (schemas/enums.yaml)

15 controlled vocabulary values for enrichment activity types:

  1. WIKIDATA_IDENTIFIER - Wikidata Q-number added
  2. GEOCODING - Lat/lon coordinates added
  3. VIAF_IDENTIFIER - VIAF identifier added
  4. ISIL_CODE - ISIL code assigned
  5. GHCID_GENERATION - GHCID identifier generated
  6. FALSE_POSITIVE_REMOVAL - Incorrect enrichment removed
  7. NAME_NORMALIZATION - Institution name normalized
  8. IDENTIFIER_VERIFICATION - Existing identifier verified
  9. INSTITUTION_TYPE_CLASSIFICATION - Institution type classified
  10. ADDRESS_STANDARDIZATION - Physical address standardized
  11. WEBSITE_URL_VALIDATION - Website URL validated
  12. COLLECTION_METADATA - Collection metadata added
  13. ORGANIZATIONAL_RELATIONSHIP - Org relationships identified
  14. DIGITAL_PLATFORM_DETECTION - Digital platforms identified
  15. OTHER - Other enrichment activity

Modified Classes

Provenance (schemas/provenance.yaml)

Added new slot:

enrichment_history:
  range: EnrichmentHistoryEntry
  multivalued: true
  inlined_as_list: true
  description: >-
    Chronological log of data enrichment activities performed on this record    

Note: provenance.notes field remains for backward compatibility but is deprecated. Use enrichment_history for new data.

New Ontology Prefixes

Added to support enrichment metadata:

  • foaf: - Friend of a Friend (agent/contact information)
  • adms: - Asset Description Metadata Schema (verification/confidence)

Benefits

Before (v0.2.1): Unstructured Notes

provenance:
  notes: "Wikidata enriched 2025-11-10 (Q3330723, match: 100%). Geocoded to (36.806495, 10.181532) via Nominatim."

Problems:

  • Hard to parse programmatically
  • Not queryable (can't filter by type, date, confidence)
  • No ontology alignment
  • Mixed concerns (multiple activities in one string)

After (v0.2.2): Structured History

provenance:
  enrichment_history:
    - enrichment_date: "2025-11-10T14:30:00Z"
      enrichment_method: "Wikidata SPARQL fuzzy matching"
      enrichment_type: WIKIDATA_IDENTIFIER
      match_score: 1.0
      verified: true
      enrichment_source: "https://www.wikidata.org"
      enrichment_notes: "Perfect name match, city verified: Tunis"
    
    - enrichment_date: "2025-11-10T14:35:00Z"
      enrichment_method: "Nominatim geocoding API"
      enrichment_type: GEOCODING
      match_score: 0.95
      verified: false
      enrichment_source: "https://nominatim.openstreetmap.org"
      enrichment_notes: "Geocoded from city name"

Benefits:

  • Machine-readable structured data
  • Queryable (filter by type, confidence, verification status)
  • Ontology-aligned (PROV-O, ADMS, DCTerms, FOAF)
  • Separation of concerns (one entry per activity)
  • Chronological audit log

Query Examples

# Find all unverified enrichments needing manual review
unverified = [
    e for e in institution['provenance']['enrichment_history']
    if not e['verified']
]

# Find low-confidence enrichments (< 0.85)
low_confidence = [
    e for e in institution['provenance']['enrichment_history']
    if e['match_score'] and e['match_score'] < 0.85
]

# Count enrichments by type
from collections import Counter
type_counts = Counter(
    e['enrichment_type'] 
    for e in institution['provenance']['enrichment_history']
)

# Timeline of enrichment activities
timeline = sorted(
    institution['provenance']['enrichment_history'],
    key=lambda e: e['enrichment_date']
)

Migration Requirements

Existing instances with provenance.notes strings need migration:

  1. Parse notes patterns:

    • "Wikidata enriched YYYY-MM-DD (Qnumber, match: XX%)"
    • "Geocoded to (lat, lon) via Service"
    • "False Wikidata match Qnumber removed YYYY-MM-DD"
  2. Extract structured data:

    • Date, method, type, match score
    • Convert to EnrichmentHistoryEntry objects
  3. Migration script: scripts/migrate_enrichment_notes_to_history.py (TO BE CREATED)

Files Modified

  1. schemas/provenance.yaml

    • Version: 0.2.1 → 0.2.2
    • Added EnrichmentHistoryEntry class (7 slots)
    • Added enrichment_history slot to Provenance class
    • Added ontology mappings (PROV-O, ADMS, FOAF, DCTerms)
  2. schemas/enums.yaml

    • Version: 0.2.1 → 0.2.2
    • Added EnrichmentTypeEnum (15 values)
  3. schemas/heritage_custodian.yaml

    • Version: 0.2.1 → 0.2.2
    • Version bump to match module versions

Backward Compatibility

  • Fully backward compatible
  • provenance.notes field remains available (deprecated)
  • Existing instances continue to work without changes
  • New instances should use enrichment_history

Testing

Demonstration script: scripts/demo_enrichment_history.py

python scripts/demo_enrichment_history.py

Shows before/after comparison, query examples, and ontology mappings.

Next Steps

  1. Schema enhancement complete (v0.2.2)
  2. Create migration script for existing instances
  3. Test with Phase 3 (Chile enrichment workflow)
  4. Update data quality reports to query enrichment_history
  5. Update RDF exporter to serialize enrichment metadata with PROV-O/ADMS
  • Schema Modules: /docs/SCHEMA_MODULES.md
  • Ontology Extensions: /docs/ONTOLOGY_EXTENSIONS.md (to be updated)
  • Phase 2 Completion Report: /data/instances/north_africa/PHASE2_COMPLETION_REPORT.md
  • Agent Instructions: /AGENTS.md (to be updated)

Contributors

  • Schema design: OpenCode AI Agent
  • Ontology alignment: Based on W3C PROV-O, ADMS, Dublin Core
  • Testing: Demonstration script with query examples

Schema Version: v0.2.2
Release: 2025-11-10
Status: Production-ready