glam/data/instances/tunisia/ENHANCEMENT_REPORT.md
2025-11-19 23:25:22 +01:00

5.3 KiB

Tunisia Heritage Institutions - Enhancement Report

Date: November 10, 2025
Dataset: tunisian_institutions_enhanced.yaml
Schema Version: LinkML v0.2.1


Enhancement Pipeline Results

Phase 1: GHCID Generation

  • Total institutions: 69
  • GHCIDs generated: 69 (100%)
  • UUID v5 identifiers: 69
  • UUID v8 identifiers: 69
  • Numeric identifiers: 69

Collision Detection:

  • 1 collision detected (duplicate Chemtou Archaeological Museum)
  • No real GHCID collisions after abbreviation optimization
  • Research institutes successfully disambiguated using French acronyms (IRSMC, INRAT)

GHCID Format: TN-{GOV}-{CITY}-{TYPE}-{ABBREV}

  • Country code: TN (Tunisia)
  • Governorate codes: TUN, SFA, KAI, JEN, etc.
  • City codes: 3-letter GeoNames LOCODE
  • Institution type: Single-letter code (M=Museum, L=Library, A=Archive, etc.)
  • Abbreviation: Extracted from French institutional names

Phase 2: Geocoding

  • Geocoded: 68/69 (98.6%)
  • Failed: 1 (likely ambiguous or missing city)
  • API calls: 21 (remaining used cache)
  • Cache hits: 56

Method: Nominatim API with SQLite caching

Phase 3: Wikidata Enrichment ⏸️

  • Skipped in this run for speed
  • Current Wikidata coverage: 2/69 (2.9%) from original extraction
  • Next step: Run comprehensive Wikidata SPARQL enrichment

Dataset Statistics

Institution Types

Type Count %
MUSEUM 35 50.7%
OFFICIAL_INSTITUTION 8 11.6%
LIBRARY 5 7.2%
UNIVERSITY 5 7.2%
RESEARCH_CENTER 5 7.2%
PERSONAL_COLLECTION 4 5.8%
EDUCATION_PROVIDER 3 4.3%
HOLY_SITES 2 2.9%
ARCHIVE 1 1.4%
MIXED 1 1.4%

Geographic Distribution

Top 10 Cities:

  1. Tunis: 26 institutions (37.7%)
  2. Djerba: 6 institutions (8.7%)
  3. Sfax: 4 institutions (5.8%)
  4. Sousse: 3 institutions (4.3%)
  5. Chemtou: 2 institutions (2.9%)
  6. Nabeul: 2 institutions (2.9%)
  7. Bizerte: 2 institutions (2.9%)
  8. Monastir: 2 institutions (2.9%)
  9. El Jem: 1 institution (1.4%)
  10. Carthage: 1 institution (1.4%)

Governorate Coverage: 17+ governorates represented


Sample GHCIDs

TN-TUN-TUN-L-BNT     Bibliothèque Nationale de Tunisie
TN-TUN-TUN-A-ANT     Archives Nationales de Tunisie
TN-TUN-TUN-O-INP     Institut National du Patrimoine
TN-TUN-TUN-M-NMBCT   National Museum of Bardo
TN-SFA-SFA-M-MAT     Sfax Archaeological Museum
TN-KAI-KAI-M-K       Kairouan Museum
TN-JEN-CHE-M-CAM     Chemtou Archaeological Museum
TN-TUN-TUN-R-IRSMC   Institut de Recherche sur le Maghreb Contemporain
TN-TUN-TUN-R-INRAT   Institut National de la Recherche Agronomique

Data Quality

Completeness

Field Coverage
GHCID 100%
Name 100%
Institution Type 100%
Location (city) 100%
Coordinates (lat/lon) 98.6%
Wikidata ID 2.9%
Alternative Names ~85%
Description ~90%

Data Tier

Primary tier: TIER_4_INFERRED (conversation NLP extraction)

  • Source: Claude conversation JSON files about Tunisian GLAM institutions
  • Extraction method: AI-powered NER with pattern matching
  • Validation: LinkML schema compliance verified

Known Issues

  1. Duplicate: Chemtou Archaeological Museum appears twice

    • Same GHCID: TN-JEN-CHE-M-CAM
    • Likely extracted from different conversation turns
    • Action required: Manual deduplication
  2. Missing geocoding: 1 institution failed geocoding

    • Investigate ambiguous city name or missing location data
  3. Low Wikidata coverage: Only 2.9% have Wikidata Q-numbers

    • Recommended: Run Wikidata SPARQL enrichment pipeline
    • French-language institutions may have good Wikidata coverage

Next Steps

Immediate

  1. GHCID generation complete
  2. Geocoding complete (98.6%)
  3. Deduplicate Chemtou Archaeological Museum
  4. Investigate failed geocoding case

Enhancement

  1. Run Wikidata enrichment (--skip-wikidata flag removed)

    • Query Wikidata SPARQL for French-language institutions
    • Match by name, location, institution type
    • Add Q-numbers, VIAF IDs, founding dates
  2. Website crawling (crawl4ai) for select institutions

    • Focus on national institutions (BNT, ANT, INP)
    • Extract additional metadata, collection descriptions

Export & Integration

  1. Export to multiple formats:

    • RDF/Turtle (Linked Open Data)
    • JSON-LD (Schema.org, CPOV)
    • CSV (spreadsheet analysis)
    • Parquet (data warehousing)
  2. Integrate into global GLAM dataset

    • Merge with Latin America, European datasets
    • Cross-link related institutions
    • Generate global statistics

Technical Details

Script: scripts/enhance_tunisia_dataset.py
Input: data/instances/tunisia/tunisian_institutions.yaml
Output: data/instances/tunisia/tunisian_institutions_enhanced.yaml
Cache: data/cache/tunisia_geocoding.db (SQLite)

Performance:

  • GHCID generation: <1 second
  • Geocoding: ~25 seconds (21 API calls @ 1 req/sec)
  • Total runtime: ~30 seconds

Dependencies:

  • glam_extractor (identifiers, geocoding)
  • Nominatim API (geocoding)
  • GeoNames database (city codes)
  • LinkML schema v0.2.1

Report generated: 2025-11-10
Enhancement pipeline version: 1.0
Status: Phase 1-2 complete, Phase 3 pending