229 lines
7.2 KiB
Markdown
229 lines
7.2 KiB
Markdown
# Schema v0.2.2 Changelog
|
|
|
|
**Release Date**: 2025-11-10
|
|
**Previous Version**: v0.2.1
|
|
|
|
## Summary
|
|
|
|
Schema v0.2.2 introduces **structured enrichment history tracking** to replace unstructured `provenance.notes` strings. This enhancement provides machine-readable, queryable metadata for data quality activities (Wikidata enrichment, geocoding, identifier verification, etc.) with full ontology alignment.
|
|
|
|
## Changes
|
|
|
|
### New Classes
|
|
|
|
#### `EnrichmentHistoryEntry` (schemas/provenance.yaml)
|
|
|
|
Tracks individual data enrichment activities with structured metadata:
|
|
|
|
```yaml
|
|
EnrichmentHistoryEntry:
|
|
slots:
|
|
- enrichment_date # When enrichment performed (datetime, required)
|
|
- enrichment_method # Method used (string, required)
|
|
- enrichment_type # Type of enrichment (EnrichmentTypeEnum, required)
|
|
- match_score # Fuzzy match confidence 0.0-1.0 (float, optional)
|
|
- verified # Manual verification status (boolean, required)
|
|
- enrichment_source # Data source URI (uri, optional)
|
|
- enrichment_notes # Human-readable details (string, optional)
|
|
```
|
|
|
|
**Ontology Mappings**:
|
|
- `enrichment_date` → `prov:atTime` (PROV-O)
|
|
- `enrichment_method` → `prov:hadPlan` (PROV-O)
|
|
- `enrichment_type` → `rdf:type` (RDF)
|
|
- `match_score` → `adms:confidence` (ADMS)
|
|
- `verified` → `adms:status` (ADMS)
|
|
- `enrichment_source` → `dcterms:source` (Dublin Core)
|
|
- `enrichment_notes` → `dcterms:description` (Dublin Core)
|
|
|
|
### New Enumerations
|
|
|
|
#### `EnrichmentTypeEnum` (schemas/enums.yaml)
|
|
|
|
15 controlled vocabulary values for enrichment activity types:
|
|
|
|
1. `WIKIDATA_IDENTIFIER` - Wikidata Q-number added
|
|
2. `GEOCODING` - Lat/lon coordinates added
|
|
3. `VIAF_IDENTIFIER` - VIAF identifier added
|
|
4. `ISIL_CODE` - ISIL code assigned
|
|
5. `GHCID_GENERATION` - GHCID identifier generated
|
|
6. `FALSE_POSITIVE_REMOVAL` - Incorrect enrichment removed
|
|
7. `NAME_NORMALIZATION` - Institution name normalized
|
|
8. `IDENTIFIER_VERIFICATION` - Existing identifier verified
|
|
9. `INSTITUTION_TYPE_CLASSIFICATION` - Institution type classified
|
|
10. `ADDRESS_STANDARDIZATION` - Physical address standardized
|
|
11. `WEBSITE_URL_VALIDATION` - Website URL validated
|
|
12. `COLLECTION_METADATA` - Collection metadata added
|
|
13. `ORGANIZATIONAL_RELATIONSHIP` - Org relationships identified
|
|
14. `DIGITAL_PLATFORM_DETECTION` - Digital platforms identified
|
|
15. `OTHER` - Other enrichment activity
|
|
|
|
### Modified Classes
|
|
|
|
#### `Provenance` (schemas/provenance.yaml)
|
|
|
|
Added new slot:
|
|
```yaml
|
|
enrichment_history:
|
|
range: EnrichmentHistoryEntry
|
|
multivalued: true
|
|
inlined_as_list: true
|
|
description: >-
|
|
Chronological log of data enrichment activities performed on this record
|
|
```
|
|
|
|
**Note**: `provenance.notes` field remains for backward compatibility but is **deprecated**. Use `enrichment_history` for new data.
|
|
|
|
### New Ontology Prefixes
|
|
|
|
Added to support enrichment metadata:
|
|
- `foaf:` - Friend of a Friend (agent/contact information)
|
|
- `adms:` - Asset Description Metadata Schema (verification/confidence)
|
|
|
|
## Benefits
|
|
|
|
### Before (v0.2.1): Unstructured Notes
|
|
|
|
```yaml
|
|
provenance:
|
|
notes: "Wikidata enriched 2025-11-10 (Q3330723, match: 100%). Geocoded to (36.806495, 10.181532) via Nominatim."
|
|
```
|
|
|
|
**Problems**:
|
|
- ❌ Hard to parse programmatically
|
|
- ❌ Not queryable (can't filter by type, date, confidence)
|
|
- ❌ No ontology alignment
|
|
- ❌ Mixed concerns (multiple activities in one string)
|
|
|
|
### After (v0.2.2): Structured History
|
|
|
|
```yaml
|
|
provenance:
|
|
enrichment_history:
|
|
- enrichment_date: "2025-11-10T14:30:00Z"
|
|
enrichment_method: "Wikidata SPARQL fuzzy matching"
|
|
enrichment_type: WIKIDATA_IDENTIFIER
|
|
match_score: 1.0
|
|
verified: true
|
|
enrichment_source: "https://www.wikidata.org"
|
|
enrichment_notes: "Perfect name match, city verified: Tunis"
|
|
|
|
- enrichment_date: "2025-11-10T14:35:00Z"
|
|
enrichment_method: "Nominatim geocoding API"
|
|
enrichment_type: GEOCODING
|
|
match_score: 0.95
|
|
verified: false
|
|
enrichment_source: "https://nominatim.openstreetmap.org"
|
|
enrichment_notes: "Geocoded from city name"
|
|
```
|
|
|
|
**Benefits**:
|
|
- ✅ Machine-readable structured data
|
|
- ✅ Queryable (filter by type, confidence, verification status)
|
|
- ✅ Ontology-aligned (PROV-O, ADMS, DCTerms, FOAF)
|
|
- ✅ Separation of concerns (one entry per activity)
|
|
- ✅ Chronological audit log
|
|
|
|
## Query Examples
|
|
|
|
```python
|
|
# Find all unverified enrichments needing manual review
|
|
unverified = [
|
|
e for e in institution['provenance']['enrichment_history']
|
|
if not e['verified']
|
|
]
|
|
|
|
# Find low-confidence enrichments (< 0.85)
|
|
low_confidence = [
|
|
e for e in institution['provenance']['enrichment_history']
|
|
if e['match_score'] and e['match_score'] < 0.85
|
|
]
|
|
|
|
# Count enrichments by type
|
|
from collections import Counter
|
|
type_counts = Counter(
|
|
e['enrichment_type']
|
|
for e in institution['provenance']['enrichment_history']
|
|
)
|
|
|
|
# Timeline of enrichment activities
|
|
timeline = sorted(
|
|
institution['provenance']['enrichment_history'],
|
|
key=lambda e: e['enrichment_date']
|
|
)
|
|
```
|
|
|
|
## Migration Requirements
|
|
|
|
Existing instances with `provenance.notes` strings need migration:
|
|
|
|
1. **Parse notes patterns**:
|
|
- `"Wikidata enriched YYYY-MM-DD (Qnumber, match: XX%)"`
|
|
- `"Geocoded to (lat, lon) via Service"`
|
|
- `"False Wikidata match Qnumber removed YYYY-MM-DD"`
|
|
|
|
2. **Extract structured data**:
|
|
- Date, method, type, match score
|
|
- Convert to `EnrichmentHistoryEntry` objects
|
|
|
|
3. **Migration script**: `scripts/migrate_enrichment_notes_to_history.py` (TO BE CREATED)
|
|
|
|
## Files Modified
|
|
|
|
1. **`schemas/provenance.yaml`**
|
|
- Version: 0.2.1 → 0.2.2
|
|
- Added `EnrichmentHistoryEntry` class (7 slots)
|
|
- Added `enrichment_history` slot to `Provenance` class
|
|
- Added ontology mappings (PROV-O, ADMS, FOAF, DCTerms)
|
|
|
|
2. **`schemas/enums.yaml`**
|
|
- Version: 0.2.1 → 0.2.2
|
|
- Added `EnrichmentTypeEnum` (15 values)
|
|
|
|
3. **`schemas/heritage_custodian.yaml`**
|
|
- Version: 0.2.1 → 0.2.2
|
|
- Version bump to match module versions
|
|
|
|
## Backward Compatibility
|
|
|
|
- ✅ **Fully backward compatible**
|
|
- `provenance.notes` field remains available (deprecated)
|
|
- Existing instances continue to work without changes
|
|
- New instances should use `enrichment_history`
|
|
|
|
## Testing
|
|
|
|
Demonstration script: `scripts/demo_enrichment_history.py`
|
|
|
|
```bash
|
|
python scripts/demo_enrichment_history.py
|
|
```
|
|
|
|
Shows before/after comparison, query examples, and ontology mappings.
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ **Schema enhancement complete** (v0.2.2)
|
|
2. ⏳ **Create migration script** for existing instances
|
|
3. ⏳ **Test with Phase 3** (Chile enrichment workflow)
|
|
4. ⏳ **Update data quality reports** to query `enrichment_history`
|
|
5. ⏳ **Update RDF exporter** to serialize enrichment metadata with PROV-O/ADMS
|
|
|
|
## Related Documentation
|
|
|
|
- **Schema Modules**: `/docs/SCHEMA_MODULES.md`
|
|
- **Ontology Extensions**: `/docs/ONTOLOGY_EXTENSIONS.md` (to be updated)
|
|
- **Phase 2 Completion Report**: `/data/instances/north_africa/PHASE2_COMPLETION_REPORT.md`
|
|
- **Agent Instructions**: `/AGENTS.md` (to be updated)
|
|
|
|
## Contributors
|
|
|
|
- Schema design: OpenCode AI Agent
|
|
- Ontology alignment: Based on W3C PROV-O, ADMS, Dublin Core
|
|
- Testing: Demonstration script with query examples
|
|
|
|
---
|
|
|
|
**Schema Version**: v0.2.2
|
|
**Release**: 2025-11-10
|
|
**Status**: Production-ready
|