# Libya Wikidata Enrichment - Cleanup Summary **Date**: 2025-11-11 **File**: `data/instances/libya/libyan_institutions.yaml` ## Overview Successfully resolved all duplicate Wikidata Q-number assignments in the Libyan heritage institutions dataset through geographic and institutional type validation. ## Data Quality Improvements ### Before Cleanup - Total institutions: 52 - With Wikidata IDs: 39 (75.0%) - **Duplicate Q-numbers: 3 instances** ❌ ### After Cleanup - Total institutions: 52 - With Wikidata IDs: 37 (71.2%) - **Duplicate Q-numbers: 0** ✅ ## Issues Resolved ### 1. Q80795728 Triple Assignment (CRITICAL) **Wikidata Entity**: Q80795728 = "Barce Museum" (متحف برقة) in Al Bayda/Al Marj region **Problem**: Incorrectly assigned to THREE institutions in wrong geographic locations: 1. **Misrata War Museum** (Line 681) - ❌ Location: Misrata (~200 km from Al Bayda) - ❌ Type: Civil war memorial (not ancient artifacts) - ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated 2. **Nalut Qasr Museum** (Line 2986-3034) - ❌ Location: Nalut (~600 km from Al Bayda) - ❌ Type: Berber granary museum (not ancient artifacts) - ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated 3. **Tobruk War Cemetery and Museum** (Line 3081-3125) - ❌ Location: Tobruk (~300 km from Al Bayda) - ❌ Type: WWII war memorial (not ancient artifacts) - ✅ **REMOVED**: Q80795728 identifier removed, enrichment notes updated **Root Cause**: Fuzzy string matching matched Arabic labels (قصر، متحف) without sufficient geographic validation. ### 2. Q2641357 Duplicate Assignment (CORRECTED) **Wikidata Entity**: Q2641357 = "Qasr Banat" (قصر بنات) in Misrata (coordinates: 31.461833°N, 14.704528°E) **Problem**: Incorrectly assigned to TWO different qasrs: 1. **Qasr al-Haj** (Line 1718-1802) - ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata) - ✅ **CORRECTED**: Changed to **Q2499392** (Gasr Al-Hajj, coordinates: 32.044561°N, 12.164781°E) - Location: Jabal al Gharbi (Western Mountains) - Type: 12th-century fortified granary 2. **Qasr Nalut** (Line 1803-1894) - ❌ Wrong Q-number: Q2641357 (Qasr Banat in Misrata) - ✅ **CORRECTED**: Changed to **Q3818705** (Castle Nalut, coordinates: 31.868°N, 10.9855°E) - Location: Nalut (Nafusa Mountains) - Type: 8th-century BC fortified granary with 300+ storage chambers **Root Cause**: Fuzzy matching on "qasr/قصر" (Arabic for "castle/fortress") without disambiguating between multiple qasrs in Wikidata. ## Enrichment Notes Documentation All corrected records now include detailed enrichment notes explaining: - Original incorrect assignment - Geographic mismatch distances - Institution type mismatches - Corrected Q-number with coordinates ### Example Enrichment Note Format ```yaml enrichment_notes: 'CORRECTED 2025-11-11: Changed from Q2641357 (Qasr Banat in Misrata) to Q2499392 (Gasr Al-Hajj). Original enrichment incorrectly matched to wrong qasr. Coordinates: 32.044561°N, 12.164781°E.' ``` ## Remaining Work ### 15 Institutions Without Wikidata IDs 1. **Misrata War Museum** - Removed incorrect Q80795728, no alternative found 2. **Berenice Archaeological Site** - Ancient Greek/Roman site, may not be in Wikidata 3. **Mirad Masoud Cave** - Rock art site, may not be in Wikidata 4. **BILNAS Archive** - UK-based (London), not Libyan institution 5. **Temehu - Libya's First Online Museum** - Digital-only platform 6. **Ghadames Manuscript Collections** - Distributed collections, not single institution 7. **Nafusa Mountain Libraries** - Distributed collections across region 8. **Ghat Fortress** - Historic site, may not be in Wikidata 9. **Libyan Center for Archives and Historical Studies** - May not be in Wikidata 10. **Nalut University** - Academic institution, may have Q-number 11. **British Institute for Libyan and Northern African Studies Digital Archive** - UK-based (London) 12. **Endangered Archaeology in the Middle East and North Africa Database** - UK-based (Oxford) 13. **Ministry of Culture and Knowledge Development Libya** - Government agency 14. **Nalut Qasr Museum** - Removed incorrect Q80795728, no alternative found 15. **Tobruk War Cemetery and Museum** - Removed incorrect Q80795728, may exist in Wikidata ### Potential Wikidata Investigations - **Nalut University**: Search Wikidata for Libyan universities - **Ministry of Culture**: Search for Libyan government ministries - **War museums/memorials**: Search for WWII North Africa heritage sites - **UK-based institutions**: Already documented, not Libyan entities ## Script Improvements Implemented The enrichment script (`scripts/enrich_libya_wikidata_fuzzy.py`) now includes: 1. **Geographic Validation** (added in previous session): - City name fuzzy matching with 50% minimum threshold - Rejects candidates with low geographic match scores 2. **Match Score Threshold**: - Raised from 80% to 85% minimum for fuzzy name matching 3. **Duplicate Detection**: - Prevents assigning Q-numbers already used by other institutions - Logs warnings when duplicates are blocked ## Validation Checklist - [x] No duplicate Q-numbers across all 52 institutions - [x] All removed Q-numbers documented with explanatory enrichment notes - [x] Corrected Q-numbers verified against Wikidata coordinates - [x] Geographic mismatches resolved (200-600km discrepancies fixed) - [x] Institution type mismatches resolved (war memorials vs. ancient museums) - [x] Enrichment history preserved (original timestamps retained) ## Next Steps 1. **Manual Wikidata Search**: Investigate remaining 15 institutions without Q-numbers 2. **Wikidata Entity Creation**: Consider creating new Wikidata entries for significant institutions not yet represented 3. **Re-run Enrichment**: After Wikidata updates, re-run enrichment script to capture new matches 4. **Documentation**: Update project documentation with lessons learned from geographic validation ## Files Modified - `data/instances/libya/libyan_institutions.yaml` (5 institutions corrected) ## References - **Wikidata Queries**: https://query.wikidata.org/ - **Q2499392**: Gasr Al-Hajj (Qasr al-Haj) - **Q3818705**: Castle Nalut (Qasr Nalut) - **Q2641357**: Qasr Banat (Misrata) - NOT the institutions we have - **Q80795728**: Barce Museum (Al Bayda) - NOT the war museums --- **Cleanup Completed**: 2025-11-11 **Status**: ✅ All duplicates resolved, data quality improved