glam/LIBYA_WIKIDATA_CLEANUP_SUMMARY.md
2025-11-19 23:25:22 +01:00

156 lines
6.4 KiB
Markdown

# Libya Wikidata Enrichment - Cleanup Summary
**Date**: 2025-11-11
**File**: `data/instances/libya/libyan_institutions.yaml`
## Overview
Successfully resolved all duplicate Wikidata Q-number assignments in the Libyan heritage institutions dataset through geographic and institutional type validation.
## Data Quality Improvements
### Before Cleanup
- Total institutions: 52
- With Wikidata IDs: 39 (75.0%)
- **Duplicate Q-numbers: 3 instances** ❌
### After Cleanup
- Total institutions: 52
- With Wikidata IDs: 37 (71.2%)
- **Duplicate Q-numbers: 0** ✅
## Issues Resolved
### 1. Q80795728 Triple Assignment (CRITICAL)
**Wikidata Entity**: Q80795728 = "Barce Museum" (متحف برقة) in Al Bayda/Al Marj region
**Problem**: Incorrectly assigned to THREE institutions in wrong geographic locations:
1. **Misrata War Museum** (Line 681)
- ❌ Location: Misrata (~200 km from Al Bayda)
- ❌ Type: Civil war memorial (not ancient artifacts)
-**REMOVED**: Q80795728 identifier removed, enrichment notes updated
2. **Nalut Qasr Museum** (Line 2986-3034)
- ❌ Location: Nalut (~600 km from Al Bayda)
- ❌ Type: Berber granary museum (not ancient artifacts)
-**REMOVED**: Q80795728 identifier removed, enrichment notes updated
3. **Tobruk War Cemetery and Museum** (Line 3081-3125)
- ❌ Location: Tobruk (~300 km from Al Bayda)
- ❌ Type: WWII war memorial (not ancient artifacts)
-**REMOVED**: Q80795728 identifier removed, enrichment notes updated
**Root Cause**: Fuzzy string matching matched Arabic labels (قصر، متحف) without sufficient geographic validation.
### 2. Q2641357 Duplicate Assignment (CORRECTED)
**Wikidata Entity**: Q2641357 = "Qasr Banat" (قصر بنات) in Misrata (coordinates: 31.461833°N, 14.704528°E)
**Problem**: Incorrectly assigned to TWO different qasrs:
1. **Qasr al-Haj** (Line 1718-1802)
- ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata)
-**CORRECTED**: Changed to **Q2499392** (Gasr Al-Hajj, coordinates: 32.044561°N, 12.164781°E)
- Location: Jabal al Gharbi (Western Mountains)
- Type: 12th-century fortified granary
2. **Qasr Nalut** (Line 1803-1894)
- ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata)
-**CORRECTED**: Changed to **Q3818705** (Castle Nalut, coordinates: 31.868°N, 10.9855°E)
- Location: Nalut (Nafusa Mountains)
- Type: 8th-century BC fortified granary with 300+ storage chambers
**Root Cause**: Fuzzy matching on "qasr/قصر" (Arabic for "castle/fortress") without disambiguating between multiple qasrs in Wikidata.
## Enrichment Notes Documentation
All corrected records now include detailed enrichment notes explaining:
- Original incorrect assignment
- Geographic mismatch distances
- Institution type mismatches
- Corrected Q-number with coordinates
### Example Enrichment Note Format
```yaml
enrichment_notes: 'CORRECTED 2025-11-11: Changed from Q2641357 (Qasr Banat in
Misrata) to Q2499392 (Gasr Al-Hajj). Original enrichment incorrectly matched
to wrong qasr. Coordinates: 32.044561°N, 12.164781°E.'
```
## Remaining Work
### 15 Institutions Without Wikidata IDs
1. **Misrata War Museum** - Removed incorrect Q80795728, no alternative found
2. **Berenice Archaeological Site** - Ancient Greek/Roman site, may not be in Wikidata
3. **Mirad Masoud Cave** - Rock art site, may not be in Wikidata
4. **BILNAS Archive** - UK-based (London), not Libyan institution
5. **Temehu - Libya's First Online Museum** - Digital-only platform
6. **Ghadames Manuscript Collections** - Distributed collections, not single institution
7. **Nafusa Mountain Libraries** - Distributed collections across region
8. **Ghat Fortress** - Historic site, may not be in Wikidata
9. **Libyan Center for Archives and Historical Studies** - May not be in Wikidata
10. **Nalut University** - Academic institution, may have Q-number
11. **British Institute for Libyan and Northern African Studies Digital Archive** - UK-based (London)
12. **Endangered Archaeology in the Middle East and North Africa Database** - UK-based (Oxford)
13. **Ministry of Culture and Knowledge Development Libya** - Government agency
14. **Nalut Qasr Museum** - Removed incorrect Q80795728, no alternative found
15. **Tobruk War Cemetery and Museum** - Removed incorrect Q80795728, may exist in Wikidata
### Potential Wikidata Investigations
- **Nalut University**: Search Wikidata for Libyan universities
- **Ministry of Culture**: Search for Libyan government ministries
- **War museums/memorials**: Search for WWII North Africa heritage sites
- **UK-based institutions**: Already documented, not Libyan entities
## Script Improvements Implemented
The enrichment script (`scripts/enrich_libya_wikidata_fuzzy.py`) now includes:
1. **Geographic Validation** (added in previous session):
- City name fuzzy matching with 50% minimum threshold
- Rejects candidates with low geographic match scores
2. **Match Score Threshold**:
- Raised from 80% to 85% minimum for fuzzy name matching
3. **Duplicate Detection**:
- Prevents assigning Q-numbers already used by other institutions
- Logs warnings when duplicates are blocked
## Validation Checklist
- [x] No duplicate Q-numbers across all 52 institutions
- [x] All removed Q-numbers documented with explanatory enrichment notes
- [x] Corrected Q-numbers verified against Wikidata coordinates
- [x] Geographic mismatches resolved (200-600km discrepancies fixed)
- [x] Institution type mismatches resolved (war memorials vs. ancient museums)
- [x] Enrichment history preserved (original timestamps retained)
## Next Steps
1. **Manual Wikidata Search**: Investigate remaining 15 institutions without Q-numbers
2. **Wikidata Entity Creation**: Consider creating new Wikidata entries for significant institutions not yet represented
3. **Re-run Enrichment**: After Wikidata updates, re-run enrichment script to capture new matches
4. **Documentation**: Update project documentation with lessons learned from geographic validation
## Files Modified
- `data/instances/libya/libyan_institutions.yaml` (5 institutions corrected)
## References
- **Wikidata Queries**: https://query.wikidata.org/
- **Q2499392**: Gasr Al-Hajj (Qasr al-Haj)
- **Q3818705**: Castle Nalut (Qasr Nalut)
- **Q2641357**: Qasr Banat (Misrata) - NOT the institutions we have
- **Q80795728**: Barce Museum (Al Bayda) - NOT the war museums
---
**Cleanup Completed**: 2025-11-11
**Status**: ✅ All duplicates resolved, data quality improved