glam/docs/SCHEMA_V0.2.2_CHANGELOG.md
2025-11-19 23:25:22 +01:00

229 lines
7.2 KiB
Markdown

# Schema v0.2.2 Changelog
**Release Date**: 2025-11-10
**Previous Version**: v0.2.1
## Summary
Schema v0.2.2 introduces **structured enrichment history tracking** to replace unstructured `provenance.notes` strings. This enhancement provides machine-readable, queryable metadata for data quality activities (Wikidata enrichment, geocoding, identifier verification, etc.) with full ontology alignment.
## Changes
### New Classes
#### `EnrichmentHistoryEntry` (schemas/provenance.yaml)
Tracks individual data enrichment activities with structured metadata:
```yaml
EnrichmentHistoryEntry:
slots:
- enrichment_date # When enrichment performed (datetime, required)
- enrichment_method # Method used (string, required)
- enrichment_type # Type of enrichment (EnrichmentTypeEnum, required)
- match_score # Fuzzy match confidence 0.0-1.0 (float, optional)
- verified # Manual verification status (boolean, required)
- enrichment_source # Data source URI (uri, optional)
- enrichment_notes # Human-readable details (string, optional)
```
**Ontology Mappings**:
- `enrichment_date``prov:atTime` (PROV-O)
- `enrichment_method``prov:hadPlan` (PROV-O)
- `enrichment_type``rdf:type` (RDF)
- `match_score``adms:confidence` (ADMS)
- `verified``adms:status` (ADMS)
- `enrichment_source``dcterms:source` (Dublin Core)
- `enrichment_notes``dcterms:description` (Dublin Core)
### New Enumerations
#### `EnrichmentTypeEnum` (schemas/enums.yaml)
15 controlled vocabulary values for enrichment activity types:
1. `WIKIDATA_IDENTIFIER` - Wikidata Q-number added
2. `GEOCODING` - Lat/lon coordinates added
3. `VIAF_IDENTIFIER` - VIAF identifier added
4. `ISIL_CODE` - ISIL code assigned
5. `GHCID_GENERATION` - GHCID identifier generated
6. `FALSE_POSITIVE_REMOVAL` - Incorrect enrichment removed
7. `NAME_NORMALIZATION` - Institution name normalized
8. `IDENTIFIER_VERIFICATION` - Existing identifier verified
9. `INSTITUTION_TYPE_CLASSIFICATION` - Institution type classified
10. `ADDRESS_STANDARDIZATION` - Physical address standardized
11. `WEBSITE_URL_VALIDATION` - Website URL validated
12. `COLLECTION_METADATA` - Collection metadata added
13. `ORGANIZATIONAL_RELATIONSHIP` - Org relationships identified
14. `DIGITAL_PLATFORM_DETECTION` - Digital platforms identified
15. `OTHER` - Other enrichment activity
### Modified Classes
#### `Provenance` (schemas/provenance.yaml)
Added new slot:
```yaml
enrichment_history:
range: EnrichmentHistoryEntry
multivalued: true
inlined_as_list: true
description: >-
Chronological log of data enrichment activities performed on this record
```
**Note**: `provenance.notes` field remains for backward compatibility but is **deprecated**. Use `enrichment_history` for new data.
### New Ontology Prefixes
Added to support enrichment metadata:
- `foaf:` - Friend of a Friend (agent/contact information)
- `adms:` - Asset Description Metadata Schema (verification/confidence)
## Benefits
### Before (v0.2.1): Unstructured Notes
```yaml
provenance:
notes: "Wikidata enriched 2025-11-10 (Q3330723, match: 100%). Geocoded to (36.806495, 10.181532) via Nominatim."
```
**Problems**:
- ❌ Hard to parse programmatically
- ❌ Not queryable (can't filter by type, date, confidence)
- ❌ No ontology alignment
- ❌ Mixed concerns (multiple activities in one string)
### After (v0.2.2): Structured History
```yaml
provenance:
enrichment_history:
- enrichment_date: "2025-11-10T14:30:00Z"
enrichment_method: "Wikidata SPARQL fuzzy matching"
enrichment_type: WIKIDATA_IDENTIFIER
match_score: 1.0
verified: true
enrichment_source: "https://www.wikidata.org"
enrichment_notes: "Perfect name match, city verified: Tunis"
- enrichment_date: "2025-11-10T14:35:00Z"
enrichment_method: "Nominatim geocoding API"
enrichment_type: GEOCODING
match_score: 0.95
verified: false
enrichment_source: "https://nominatim.openstreetmap.org"
enrichment_notes: "Geocoded from city name"
```
**Benefits**:
- ✅ Machine-readable structured data
- ✅ Queryable (filter by type, confidence, verification status)
- ✅ Ontology-aligned (PROV-O, ADMS, DCTerms, FOAF)
- ✅ Separation of concerns (one entry per activity)
- ✅ Chronological audit log
## Query Examples
```python
# Find all unverified enrichments needing manual review
unverified = [
e for e in institution['provenance']['enrichment_history']
if not e['verified']
]
# Find low-confidence enrichments (< 0.85)
low_confidence = [
e for e in institution['provenance']['enrichment_history']
if e['match_score'] and e['match_score'] < 0.85
]
# Count enrichments by type
from collections import Counter
type_counts = Counter(
e['enrichment_type']
for e in institution['provenance']['enrichment_history']
)
# Timeline of enrichment activities
timeline = sorted(
institution['provenance']['enrichment_history'],
key=lambda e: e['enrichment_date']
)
```
## Migration Requirements
Existing instances with `provenance.notes` strings need migration:
1. **Parse notes patterns**:
- `"Wikidata enriched YYYY-MM-DD (Qnumber, match: XX%)"`
- `"Geocoded to (lat, lon) via Service"`
- `"False Wikidata match Qnumber removed YYYY-MM-DD"`
2. **Extract structured data**:
- Date, method, type, match score
- Convert to `EnrichmentHistoryEntry` objects
3. **Migration script**: `scripts/migrate_enrichment_notes_to_history.py` (TO BE CREATED)
## Files Modified
1. **`schemas/provenance.yaml`**
- Version: 0.2.1 → 0.2.2
- Added `EnrichmentHistoryEntry` class (7 slots)
- Added `enrichment_history` slot to `Provenance` class
- Added ontology mappings (PROV-O, ADMS, FOAF, DCTerms)
2. **`schemas/enums.yaml`**
- Version: 0.2.1 → 0.2.2
- Added `EnrichmentTypeEnum` (15 values)
3. **`schemas/heritage_custodian.yaml`**
- Version: 0.2.1 → 0.2.2
- Version bump to match module versions
## Backward Compatibility
-**Fully backward compatible**
- `provenance.notes` field remains available (deprecated)
- Existing instances continue to work without changes
- New instances should use `enrichment_history`
## Testing
Demonstration script: `scripts/demo_enrichment_history.py`
```bash
python scripts/demo_enrichment_history.py
```
Shows before/after comparison, query examples, and ontology mappings.
## Next Steps
1.**Schema enhancement complete** (v0.2.2)
2.**Create migration script** for existing instances
3.**Test with Phase 3** (Chile enrichment workflow)
4.**Update data quality reports** to query `enrichment_history`
5.**Update RDF exporter** to serialize enrichment metadata with PROV-O/ADMS
## Related Documentation
- **Schema Modules**: `/docs/SCHEMA_MODULES.md`
- **Ontology Extensions**: `/docs/ONTOLOGY_EXTENSIONS.md` (to be updated)
- **Phase 2 Completion Report**: `/data/instances/north_africa/PHASE2_COMPLETION_REPORT.md`
- **Agent Instructions**: `/AGENTS.md` (to be updated)
## Contributors
- Schema design: OpenCode AI Agent
- Ontology alignment: Based on W3C PROV-O, ADMS, Dublin Core
- Testing: Demonstration script with query examples
---
**Schema Version**: v0.2.2
**Release**: 2025-11-10
**Status**: Production-ready