9.3 KiB
Tunisia Wikidata Enrichment - Summary Report
Date: 2025-11-10
Status: Complete ✅
Coverage: 52/68 institutions (76.5%)
Improvement: +18 institutions (+26.5 percentage points) from initial 50% coverage
Executive Summary
Successfully enriched 68 Tunisian heritage institutions (extracted from conversation files) with Wikidata identifiers using an alternative name matching strategy. The implementation of multilingual search (English primary + French alternatives) increased coverage from 50% to 76.5%, adding 18 new validated Wikidata links.
Problem Statement
Initial Wikidata enrichment achieved only 34/68 institutions (50%) due to language mismatch:
- Primary names (conversation): English ("Diocesan Library of Tunis", "Kerkouane Museum")
- Wikidata labels: French ("Bibliothèque Diocésaine de Tunis", "Musée de Kerkouane")
- Result: String matching failed for French-labeled entities
Solution Implemented
Modified scripts/enrich_tunisia_wikidata_validated.py to search both primary names AND alternative names:
# Search primary name first
results = query_wikidata(institution['name'], city, country)
# If no match, try alternative names
if not results and institution.get('alternative_names'):
for alt_name in institution['alternative_names']:
results = query_wikidata(alt_name, city, country)
if results:
matched_name = alt_name # Track which name worked
break
Key Features
- Multilingual Search: Try English, French, and Arabic name variants
- Entity Type Validation: Museums must be museums (prevents false positives)
- Geographic Validation: Institutions must be in specified cities
- Conservative Thresholds: 70% minimum fuzzy match score
- Provenance Tracking: Log which alternative name produced match
Results
Overall Statistics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Wikidata Coverage | 34/68 (50.0%) | 52/68 (76.5%) | +18 institutions |
| VIAF IDs | Included | Included | Via Wikidata |
| Match Quality | 70%+ threshold | 70%+ threshold | Maintained |
Coverage by Institution Type
| Type | Enriched | Total | Coverage |
|---|---|---|---|
| UNIVERSITY | 5/5 | 5 | 100.0% |
| ARCHIVE | 1/1 | 1 | 100.0% |
| HOLY_SITES | 2/2 | 2 | 100.0% |
| MIXED | 1/1 | 1 | 100.0% |
| MUSEUM | 33/34 | 34 | 97.1% |
| LIBRARY | 4/5 | 5 | 80.0% |
| EDUCATION_PROVIDER | 2/3 | 3 | 66.7% |
| RESEARCH_CENTER | 2/5 | 5 | 40.0% |
| OFFICIAL_INSTITUTION | 2/8 | 8 | 25.0% |
| PERSONAL_COLLECTION | 0/4 | 4 | 0.0% |
Success Examples
High-Confidence Matches (100% similarity):
- Bibliothèque Nationale de Tunisie → Q549445
- Diocesan Library of Tunis (via French alternative) → Q28149782
- National Archives of Tunisia → Q2861080
- Bardo National Museum → Q2260682
- Kerkouane Museum (via "Musée de Kerkouane") → Confirmed
Alternative Name Successes:
- "Diocesan Library of Tunis" → "Bibliothèque Diocésaine de Tunis" → 100% match
- "Chemtou Museum" → "Musée de Chimtou" → 100% match
- Multiple public libraries matched via French alternatives
Unenriched Institutions (16 remaining)
Category Analysis
1. Official/Government Institutions (6)
- BIRUNI Network (academic consortium)
- Centre National Universitaire de Documentation Scientifique et Technique
- British Council Tunisia - Digital Library
- U.S. Embassy Tunisia - Online Resources Library
- Maison de la Culture Ibn-Khaldoun
- Maison de la Culture Ibn Rachiq
Rationale: Often lack Wikidata entries (legitimate gap)
2. Research Centers (3)
- Institut de Recherche sur le Maghreb Contemporain (IRMC)
- Laboratoire national de la conservation et restauration des manuscrits
- Centre des Musiques Arabes et Méditerranéennes
Rationale: Specialized research institutions may not have Wikidata coverage
3. Personal Collections (4)
- El Basi Family Library (Djerba)
- Chahed Family Library (Djerba)
- Mhinni El Barouni Library (Djerba)
- al-Layni Family Library (Djerba)
Rationale: Private family libraries unlikely to have Wikidata records
4. Low Match Quality - Correctly Rejected (3)
- Centre National de la Calligraphie (69% match score)
- Bibliothèque Régionale Ben Arous (69% match score)
- La Rachidia - Institut de Musique Arabe (70% match score)
Rationale: Below 70% threshold to prevent false positives
Technical Implementation
Modified Functions
scripts/enrich_tunisia_wikidata_validated.py:
- Added
alternative_namesparameter toquery_wikidata_by_name()(lines 124-130) - Implemented nested loop for name variant search (lines 203-258)
- Enhanced logging to track which alternative produced match (lines 240-245)
SPARQL Query Pattern
SELECT ?item ?itemLabel ?viaf ?isil WHERE {
?item wdt:P31/wdt:P279* wd:Q33506 . # Instance of museum (or subclass)
?item wdt:P131* wd:Q3572 . # Located in Tunis
?item rdfs:label ?label .
FILTER(CONTAINS(LCASE(?label), LCASE("musée"))) # French search term
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
SERVICE wikibase:label { bd:serviceParam wikibase:language "fr,ar,en" }
}
Validation Strategy
Three-Tier Validation:
-
Entity Type Validation
- Museums must be museums (P31/P279* wd:Q33506)
- Libraries must be libraries (P31/P279* wd:Q7075)
- Archives must be archives (P31/P279* wd:Q166118)
-
Geographic Validation
- Institution must be located in (P131) specified city
- Prevents matching wrong institutions with similar names
-
Fuzzy Match Threshold
- 70% minimum similarity score
- 85%+ recommended for high confidence
- Manual review for 70-85% matches
Comparison to Latin American Enrichment
| Region | Institutions | Wikidata Coverage | Strategy |
|---|---|---|---|
| Tunisia | 68 | 76.5% (52/68) | Alternative name search + validation |
| Latin America | 304 | 19.1% (58/304) | Direct name matching only |
Tunisia's higher success rate due to:
- Alternative name strategy implemented
- Smaller dataset enables manual curation
- French/English bilingual context provided in conversations
- Entity type validation prevents false positives
Key Success Factors
- Alternative names critical for multilingual matching (English ↔ French)
- Entity type validation prevents false positives (banks, stadiums with similar names)
- Geographic validation ensures accuracy (multiple "National Library" entities exist)
- Conservative thresholds maintain quality (70% minimum prevents bad matches)
- Conversation data provides rich context (alternative names mentioned in discussion)
Files Modified
-
scripts/enrich_tunisia_wikidata_validated.py:- Added alternative name search logic (500+ lines total)
- Enhanced logging for alternative name matches
- Maintained validation strategies (type + geographic)
-
data/instances/tunisia/tunisian_institutions_enhanced.yaml:- Updated from 34 → 52 Wikidata identifiers
- Preserved existing data (locations, descriptions, collections)
- Added VIAF IDs where available
-
PROGRESS.md:- Added Tunisia Wikidata Enrichment section
- Updated coverage statistics (12,748 → 12,816 institutions)
- Documented methodology and results
Next Steps
Option A: Accept Current Results (RECOMMENDED)
Rationale: 76.5% coverage is excellent for TIER_4_INFERRED conversation data
Actions:
- ✅ Document enrichment results in
PROGRESS.md - Apply alternative name strategy to other regions (Brazil, Mexico, Chile)
- Move to next country/region enrichment
Option B: Manual Wikidata Creation (Lower Priority)
For high-value institutions without records:
- Centre des Musiques Arabes et Méditerranéennes
- Institut de Recherche sur le Maghreb Contemporain (IRMC)
- Centre National de la Calligraphie
Could create Wikidata entries following proper procedures, then re-run enrichment.
Option C: Apply Strategy to Other Regions
Immediate Opportunity: Apply alternative name strategy to Latin America (304 institutions, currently 19.1% coverage)
Expected improvement:
- Brazil: Many institutions have Portuguese alternatives
- Mexico: Spanish alternatives available
- Chile: Spanish alternatives available
Lessons Learned
- Alternative names are critical for multilingual datasets
- Conversation data provides rich context (alternative names often discussed)
- Entity type validation essential (Wikidata has many entities with similar names)
- Geographic validation ensures accuracy (multiple institutions with same name)
- Conservative thresholds maintain quality (70% minimum prevents false positives)
- Smaller datasets enable manual curation (68 institutions vs. 304 in Latin America)
References
- Script:
scripts/enrich_tunisia_wikidata_validated.py - Test Script:
scripts/test_alternative_names.py - Output:
data/instances/tunisia/tunisian_institutions_enhanced.yaml - Documentation:
PROGRESS.md(Tunisia Wikidata Enrichment section) - Strategy:
docs/isil_enrichment_strategy.md(Phase 1: Wikidata enrichment)
Generated: 2025-11-10
Status: Complete ✅
Next Action: Apply alternative name strategy to Latin America