glam/docs/tunisia_enrichment_summary.md
2025-11-19 23:25:22 +01:00

9.3 KiB

Tunisia Wikidata Enrichment - Summary Report

Date: 2025-11-10
Status: Complete
Coverage: 52/68 institutions (76.5%)
Improvement: +18 institutions (+26.5 percentage points) from initial 50% coverage


Executive Summary

Successfully enriched 68 Tunisian heritage institutions (extracted from conversation files) with Wikidata identifiers using an alternative name matching strategy. The implementation of multilingual search (English primary + French alternatives) increased coverage from 50% to 76.5%, adding 18 new validated Wikidata links.

Problem Statement

Initial Wikidata enrichment achieved only 34/68 institutions (50%) due to language mismatch:

  • Primary names (conversation): English ("Diocesan Library of Tunis", "Kerkouane Museum")
  • Wikidata labels: French ("Bibliothèque Diocésaine de Tunis", "Musée de Kerkouane")
  • Result: String matching failed for French-labeled entities

Solution Implemented

Modified scripts/enrich_tunisia_wikidata_validated.py to search both primary names AND alternative names:

# Search primary name first
results = query_wikidata(institution['name'], city, country)

# If no match, try alternative names
if not results and institution.get('alternative_names'):
    for alt_name in institution['alternative_names']:
        results = query_wikidata(alt_name, city, country)
        if results:
            matched_name = alt_name  # Track which name worked
            break

Key Features

  1. Multilingual Search: Try English, French, and Arabic name variants
  2. Entity Type Validation: Museums must be museums (prevents false positives)
  3. Geographic Validation: Institutions must be in specified cities
  4. Conservative Thresholds: 70% minimum fuzzy match score
  5. Provenance Tracking: Log which alternative name produced match

Results

Overall Statistics

Metric Before After Improvement
Wikidata Coverage 34/68 (50.0%) 52/68 (76.5%) +18 institutions
VIAF IDs Included Included Via Wikidata
Match Quality 70%+ threshold 70%+ threshold Maintained

Coverage by Institution Type

Type Enriched Total Coverage
UNIVERSITY 5/5 5 100.0%
ARCHIVE 1/1 1 100.0%
HOLY_SITES 2/2 2 100.0%
MIXED 1/1 1 100.0%
MUSEUM 33/34 34 97.1%
LIBRARY 4/5 5 80.0%
EDUCATION_PROVIDER 2/3 3 66.7%
RESEARCH_CENTER 2/5 5 40.0%
OFFICIAL_INSTITUTION 2/8 8 25.0%
PERSONAL_COLLECTION 0/4 4 0.0%

Success Examples

High-Confidence Matches (100% similarity):

  • Bibliothèque Nationale de Tunisie → Q549445
  • Diocesan Library of Tunis (via French alternative) → Q28149782
  • National Archives of Tunisia → Q2861080
  • Bardo National Museum → Q2260682
  • Kerkouane Museum (via "Musée de Kerkouane") → Confirmed

Alternative Name Successes:

  • "Diocesan Library of Tunis" → "Bibliothèque Diocésaine de Tunis" → 100% match
  • "Chemtou Museum" → "Musée de Chimtou" → 100% match
  • Multiple public libraries matched via French alternatives

Unenriched Institutions (16 remaining)

Category Analysis

1. Official/Government Institutions (6)

  • BIRUNI Network (academic consortium)
  • Centre National Universitaire de Documentation Scientifique et Technique
  • British Council Tunisia - Digital Library
  • U.S. Embassy Tunisia - Online Resources Library
  • Maison de la Culture Ibn-Khaldoun
  • Maison de la Culture Ibn Rachiq

Rationale: Often lack Wikidata entries (legitimate gap)

2. Research Centers (3)

  • Institut de Recherche sur le Maghreb Contemporain (IRMC)
  • Laboratoire national de la conservation et restauration des manuscrits
  • Centre des Musiques Arabes et Méditerranéennes

Rationale: Specialized research institutions may not have Wikidata coverage

3. Personal Collections (4)

  • El Basi Family Library (Djerba)
  • Chahed Family Library (Djerba)
  • Mhinni El Barouni Library (Djerba)
  • al-Layni Family Library (Djerba)

Rationale: Private family libraries unlikely to have Wikidata records

4. Low Match Quality - Correctly Rejected (3)

  • Centre National de la Calligraphie (69% match score)
  • Bibliothèque Régionale Ben Arous (69% match score)
  • La Rachidia - Institut de Musique Arabe (70% match score)

Rationale: Below 70% threshold to prevent false positives

Technical Implementation

Modified Functions

scripts/enrich_tunisia_wikidata_validated.py:

  • Added alternative_names parameter to query_wikidata_by_name() (lines 124-130)
  • Implemented nested loop for name variant search (lines 203-258)
  • Enhanced logging to track which alternative produced match (lines 240-245)

SPARQL Query Pattern

SELECT ?item ?itemLabel ?viaf ?isil WHERE {
  ?item wdt:P31/wdt:P279* wd:Q33506 .  # Instance of museum (or subclass)
  ?item wdt:P131* wd:Q3572 .           # Located in Tunis
  ?item rdfs:label ?label .
  FILTER(CONTAINS(LCASE(?label), LCASE("musée")))  # French search term
  OPTIONAL { ?item wdt:P214 ?viaf }    # VIAF ID
  OPTIONAL { ?item wdt:P791 ?isil }    # ISIL code
  SERVICE wikibase:label { bd:serviceParam wikibase:language "fr,ar,en" }
}

Validation Strategy

Three-Tier Validation:

  1. Entity Type Validation

    • Museums must be museums (P31/P279* wd:Q33506)
    • Libraries must be libraries (P31/P279* wd:Q7075)
    • Archives must be archives (P31/P279* wd:Q166118)
  2. Geographic Validation

    • Institution must be located in (P131) specified city
    • Prevents matching wrong institutions with similar names
  3. Fuzzy Match Threshold

    • 70% minimum similarity score
    • 85%+ recommended for high confidence
    • Manual review for 70-85% matches

Comparison to Latin American Enrichment

Region Institutions Wikidata Coverage Strategy
Tunisia 68 76.5% (52/68) Alternative name search + validation
Latin America 304 19.1% (58/304) Direct name matching only

Tunisia's higher success rate due to:

  • Alternative name strategy implemented
  • Smaller dataset enables manual curation
  • French/English bilingual context provided in conversations
  • Entity type validation prevents false positives

Key Success Factors

  1. Alternative names critical for multilingual matching (English ↔ French)
  2. Entity type validation prevents false positives (banks, stadiums with similar names)
  3. Geographic validation ensures accuracy (multiple "National Library" entities exist)
  4. Conservative thresholds maintain quality (70% minimum prevents bad matches)
  5. Conversation data provides rich context (alternative names mentioned in discussion)

Files Modified

  1. scripts/enrich_tunisia_wikidata_validated.py:

    • Added alternative name search logic (500+ lines total)
    • Enhanced logging for alternative name matches
    • Maintained validation strategies (type + geographic)
  2. data/instances/tunisia/tunisian_institutions_enhanced.yaml:

    • Updated from 34 → 52 Wikidata identifiers
    • Preserved existing data (locations, descriptions, collections)
    • Added VIAF IDs where available
  3. PROGRESS.md:

    • Added Tunisia Wikidata Enrichment section
    • Updated coverage statistics (12,748 → 12,816 institutions)
    • Documented methodology and results

Next Steps

Rationale: 76.5% coverage is excellent for TIER_4_INFERRED conversation data

Actions:

  • Document enrichment results in PROGRESS.md
  • Apply alternative name strategy to other regions (Brazil, Mexico, Chile)
  • Move to next country/region enrichment

Option B: Manual Wikidata Creation (Lower Priority)

For high-value institutions without records:

  • Centre des Musiques Arabes et Méditerranéennes
  • Institut de Recherche sur le Maghreb Contemporain (IRMC)
  • Centre National de la Calligraphie

Could create Wikidata entries following proper procedures, then re-run enrichment.

Option C: Apply Strategy to Other Regions

Immediate Opportunity: Apply alternative name strategy to Latin America (304 institutions, currently 19.1% coverage)

Expected improvement:

  • Brazil: Many institutions have Portuguese alternatives
  • Mexico: Spanish alternatives available
  • Chile: Spanish alternatives available

Lessons Learned

  1. Alternative names are critical for multilingual datasets
  2. Conversation data provides rich context (alternative names often discussed)
  3. Entity type validation essential (Wikidata has many entities with similar names)
  4. Geographic validation ensures accuracy (multiple institutions with same name)
  5. Conservative thresholds maintain quality (70% minimum prevents false positives)
  6. Smaller datasets enable manual curation (68 institutions vs. 304 in Latin America)

References

  • Script: scripts/enrich_tunisia_wikidata_validated.py
  • Test Script: scripts/test_alternative_names.py
  • Output: data/instances/tunisia/tunisian_institutions_enhanced.yaml
  • Documentation: PROGRESS.md (Tunisia Wikidata Enrichment section)
  • Strategy: docs/isil_enrichment_strategy.md (Phase 1: Wikidata enrichment)

Generated: 2025-11-10
Status: Complete
Next Action: Apply alternative name strategy to Latin America