156 lines
6.4 KiB
Markdown
156 lines
6.4 KiB
Markdown
# Libya Wikidata Enrichment - Cleanup Summary
|
|
|
|
**Date**: 2025-11-11
|
|
**File**: `data/instances/libya/libyan_institutions.yaml`
|
|
|
|
## Overview
|
|
|
|
Successfully resolved all duplicate Wikidata Q-number assignments in the Libyan heritage institutions dataset through geographic and institutional type validation.
|
|
|
|
## Data Quality Improvements
|
|
|
|
### Before Cleanup
|
|
- Total institutions: 52
|
|
- With Wikidata IDs: 39 (75.0%)
|
|
- **Duplicate Q-numbers: 3 instances** ❌
|
|
|
|
### After Cleanup
|
|
- Total institutions: 52
|
|
- With Wikidata IDs: 37 (71.2%)
|
|
- **Duplicate Q-numbers: 0** ✅
|
|
|
|
## Issues Resolved
|
|
|
|
### 1. Q80795728 Triple Assignment (CRITICAL)
|
|
|
|
**Wikidata Entity**: Q80795728 = "Barce Museum" (متحف برقة) in Al Bayda/Al Marj region
|
|
|
|
**Problem**: Incorrectly assigned to THREE institutions in wrong geographic locations:
|
|
|
|
1. **Misrata War Museum** (Line 681)
|
|
- ❌ Location: Misrata (~200 km from Al Bayda)
|
|
- ❌ Type: Civil war memorial (not ancient artifacts)
|
|
- ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated
|
|
|
|
2. **Nalut Qasr Museum** (Line 2986-3034)
|
|
- ❌ Location: Nalut (~600 km from Al Bayda)
|
|
- ❌ Type: Berber granary museum (not ancient artifacts)
|
|
- ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated
|
|
|
|
3. **Tobruk War Cemetery and Museum** (Line 3081-3125)
|
|
- ❌ Location: Tobruk (~300 km from Al Bayda)
|
|
- ❌ Type: WWII war memorial (not ancient artifacts)
|
|
- ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated
|
|
|
|
**Root Cause**: Fuzzy string matching matched Arabic labels (قصر، متحف) without sufficient geographic validation.
|
|
|
|
### 2. Q2641357 Duplicate Assignment (CORRECTED)
|
|
|
|
**Wikidata Entity**: Q2641357 = "Qasr Banat" (قصر بنات) in Misrata (coordinates: 31.461833°N, 14.704528°E)
|
|
|
|
**Problem**: Incorrectly assigned to TWO different qasrs:
|
|
|
|
1. **Qasr al-Haj** (Line 1718-1802)
|
|
- ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata)
|
|
- ✅ **CORRECTED**: Changed to **Q2499392** (Gasr Al-Hajj, coordinates: 32.044561°N, 12.164781°E)
|
|
- Location: Jabal al Gharbi (Western Mountains)
|
|
- Type: 12th-century fortified granary
|
|
|
|
2. **Qasr Nalut** (Line 1803-1894)
|
|
- ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata)
|
|
- ✅ **CORRECTED**: Changed to **Q3818705** (Castle Nalut, coordinates: 31.868°N, 10.9855°E)
|
|
- Location: Nalut (Nafusa Mountains)
|
|
- Type: 8th-century BC fortified granary with 300+ storage chambers
|
|
|
|
**Root Cause**: Fuzzy matching on "qasr/قصر" (Arabic for "castle/fortress") without disambiguating between multiple qasrs in Wikidata.
|
|
|
|
## Enrichment Notes Documentation
|
|
|
|
All corrected records now include detailed enrichment notes explaining:
|
|
- Original incorrect assignment
|
|
- Geographic mismatch distances
|
|
- Institution type mismatches
|
|
- Corrected Q-number with coordinates
|
|
|
|
### Example Enrichment Note Format
|
|
|
|
```yaml
|
|
enrichment_notes: 'CORRECTED 2025-11-11: Changed from Q2641357 (Qasr Banat in
|
|
Misrata) to Q2499392 (Gasr Al-Hajj). Original enrichment incorrectly matched
|
|
to wrong qasr. Coordinates: 32.044561°N, 12.164781°E.'
|
|
```
|
|
|
|
## Remaining Work
|
|
|
|
### 15 Institutions Without Wikidata IDs
|
|
|
|
1. **Misrata War Museum** - Removed incorrect Q80795728, no alternative found
|
|
2. **Berenice Archaeological Site** - Ancient Greek/Roman site, may not be in Wikidata
|
|
3. **Mirad Masoud Cave** - Rock art site, may not be in Wikidata
|
|
4. **BILNAS Archive** - UK-based (London), not Libyan institution
|
|
5. **Temehu - Libya's First Online Museum** - Digital-only platform
|
|
6. **Ghadames Manuscript Collections** - Distributed collections, not single institution
|
|
7. **Nafusa Mountain Libraries** - Distributed collections across region
|
|
8. **Ghat Fortress** - Historic site, may not be in Wikidata
|
|
9. **Libyan Center for Archives and Historical Studies** - May not be in Wikidata
|
|
10. **Nalut University** - Academic institution, may have Q-number
|
|
11. **British Institute for Libyan and Northern African Studies Digital Archive** - UK-based (London)
|
|
12. **Endangered Archaeology in the Middle East and North Africa Database** - UK-based (Oxford)
|
|
13. **Ministry of Culture and Knowledge Development Libya** - Government agency
|
|
14. **Nalut Qasr Museum** - Removed incorrect Q80795728, no alternative found
|
|
15. **Tobruk War Cemetery and Museum** - Removed incorrect Q80795728, may exist in Wikidata
|
|
|
|
### Potential Wikidata Investigations
|
|
|
|
- **Nalut University**: Search Wikidata for Libyan universities
|
|
- **Ministry of Culture**: Search for Libyan government ministries
|
|
- **War museums/memorials**: Search for WWII North Africa heritage sites
|
|
- **UK-based institutions**: Already documented, not Libyan entities
|
|
|
|
## Script Improvements Implemented
|
|
|
|
The enrichment script (`scripts/enrich_libya_wikidata_fuzzy.py`) now includes:
|
|
|
|
1. **Geographic Validation** (added in previous session):
|
|
- City name fuzzy matching with 50% minimum threshold
|
|
- Rejects candidates with low geographic match scores
|
|
|
|
2. **Match Score Threshold**:
|
|
- Raised from 80% to 85% minimum for fuzzy name matching
|
|
|
|
3. **Duplicate Detection**:
|
|
- Prevents assigning Q-numbers already used by other institutions
|
|
- Logs warnings when duplicates are blocked
|
|
|
|
## Validation Checklist
|
|
|
|
- [x] No duplicate Q-numbers across all 52 institutions
|
|
- [x] All removed Q-numbers documented with explanatory enrichment notes
|
|
- [x] Corrected Q-numbers verified against Wikidata coordinates
|
|
- [x] Geographic mismatches resolved (200-600km discrepancies fixed)
|
|
- [x] Institution type mismatches resolved (war memorials vs. ancient museums)
|
|
- [x] Enrichment history preserved (original timestamps retained)
|
|
|
|
## Next Steps
|
|
|
|
1. **Manual Wikidata Search**: Investigate remaining 15 institutions without Q-numbers
|
|
2. **Wikidata Entity Creation**: Consider creating new Wikidata entries for significant institutions not yet represented
|
|
3. **Re-run Enrichment**: After Wikidata updates, re-run enrichment script to capture new matches
|
|
4. **Documentation**: Update project documentation with lessons learned from geographic validation
|
|
|
|
## Files Modified
|
|
|
|
- `data/instances/libya/libyan_institutions.yaml` (5 institutions corrected)
|
|
|
|
## References
|
|
|
|
- **Wikidata Queries**: https://query.wikidata.org/
|
|
- **Q2499392**: Gasr Al-Hajj (Qasr al-Haj)
|
|
- **Q3818705**: Castle Nalut (Qasr Nalut)
|
|
- **Q2641357**: Qasr Banat (Misrata) - NOT the institutions we have
|
|
- **Q80795728**: Barce Museum (Al Bayda) - NOT the war museums
|
|
|
|
---
|
|
|
|
**Cleanup Completed**: 2025-11-11
|
|
**Status**: ✅ All duplicates resolved, data quality improved
|