5.3 KiB
Tunisia Heritage Institutions - Enhancement Report
Date: November 10, 2025
Dataset: tunisian_institutions_enhanced.yaml
Schema Version: LinkML v0.2.1
Enhancement Pipeline Results
Phase 1: GHCID Generation ✅
- Total institutions: 69
- GHCIDs generated: 69 (100%)
- UUID v5 identifiers: 69
- UUID v8 identifiers: 69
- Numeric identifiers: 69
Collision Detection:
- 1 collision detected (duplicate Chemtou Archaeological Museum)
- No real GHCID collisions after abbreviation optimization
- Research institutes successfully disambiguated using French acronyms (IRSMC, INRAT)
GHCID Format: TN-{GOV}-{CITY}-{TYPE}-{ABBREV}
- Country code: TN (Tunisia)
- Governorate codes: TUN, SFA, KAI, JEN, etc.
- City codes: 3-letter GeoNames LOCODE
- Institution type: Single-letter code (M=Museum, L=Library, A=Archive, etc.)
- Abbreviation: Extracted from French institutional names
Phase 2: Geocoding ✅
- Geocoded: 68/69 (98.6%)
- Failed: 1 (likely ambiguous or missing city)
- API calls: 21 (remaining used cache)
- Cache hits: 56
Method: Nominatim API with SQLite caching
Phase 3: Wikidata Enrichment ⏸️
- Skipped in this run for speed
- Current Wikidata coverage: 2/69 (2.9%) from original extraction
- Next step: Run comprehensive Wikidata SPARQL enrichment
Dataset Statistics
Institution Types
| Type | Count | % |
|---|---|---|
| MUSEUM | 35 | 50.7% |
| OFFICIAL_INSTITUTION | 8 | 11.6% |
| LIBRARY | 5 | 7.2% |
| UNIVERSITY | 5 | 7.2% |
| RESEARCH_CENTER | 5 | 7.2% |
| PERSONAL_COLLECTION | 4 | 5.8% |
| EDUCATION_PROVIDER | 3 | 4.3% |
| HOLY_SITES | 2 | 2.9% |
| ARCHIVE | 1 | 1.4% |
| MIXED | 1 | 1.4% |
Geographic Distribution
Top 10 Cities:
- Tunis: 26 institutions (37.7%)
- Djerba: 6 institutions (8.7%)
- Sfax: 4 institutions (5.8%)
- Sousse: 3 institutions (4.3%)
- Chemtou: 2 institutions (2.9%)
- Nabeul: 2 institutions (2.9%)
- Bizerte: 2 institutions (2.9%)
- Monastir: 2 institutions (2.9%)
- El Jem: 1 institution (1.4%)
- Carthage: 1 institution (1.4%)
Governorate Coverage: 17+ governorates represented
Sample GHCIDs
TN-TUN-TUN-L-BNT Bibliothèque Nationale de Tunisie
TN-TUN-TUN-A-ANT Archives Nationales de Tunisie
TN-TUN-TUN-O-INP Institut National du Patrimoine
TN-TUN-TUN-M-NMBCT National Museum of Bardo
TN-SFA-SFA-M-MAT Sfax Archaeological Museum
TN-KAI-KAI-M-K Kairouan Museum
TN-JEN-CHE-M-CAM Chemtou Archaeological Museum
TN-TUN-TUN-R-IRSMC Institut de Recherche sur le Maghreb Contemporain
TN-TUN-TUN-R-INRAT Institut National de la Recherche Agronomique
Data Quality
Completeness
| Field | Coverage |
|---|---|
| GHCID | 100% |
| Name | 100% |
| Institution Type | 100% |
| Location (city) | 100% |
| Coordinates (lat/lon) | 98.6% |
| Wikidata ID | 2.9% |
| Alternative Names | ~85% |
| Description | ~90% |
Data Tier
Primary tier: TIER_4_INFERRED (conversation NLP extraction)
- Source: Claude conversation JSON files about Tunisian GLAM institutions
- Extraction method: AI-powered NER with pattern matching
- Validation: LinkML schema compliance verified
Known Issues
-
Duplicate: Chemtou Archaeological Museum appears twice
- Same GHCID:
TN-JEN-CHE-M-CAM - Likely extracted from different conversation turns
- Action required: Manual deduplication
- Same GHCID:
-
Missing geocoding: 1 institution failed geocoding
- Investigate ambiguous city name or missing location data
-
Low Wikidata coverage: Only 2.9% have Wikidata Q-numbers
- Recommended: Run Wikidata SPARQL enrichment pipeline
- French-language institutions may have good Wikidata coverage
Next Steps
Immediate
- ✅ GHCID generation complete
- ✅ Geocoding complete (98.6%)
- ⏳ Deduplicate Chemtou Archaeological Museum
- ⏳ Investigate failed geocoding case
Enhancement
-
⏳ Run Wikidata enrichment (
--skip-wikidataflag removed)- Query Wikidata SPARQL for French-language institutions
- Match by name, location, institution type
- Add Q-numbers, VIAF IDs, founding dates
-
⏳ Website crawling (crawl4ai) for select institutions
- Focus on national institutions (BNT, ANT, INP)
- Extract additional metadata, collection descriptions
Export & Integration
-
⏳ Export to multiple formats:
- RDF/Turtle (Linked Open Data)
- JSON-LD (Schema.org, CPOV)
- CSV (spreadsheet analysis)
- Parquet (data warehousing)
-
⏳ Integrate into global GLAM dataset
- Merge with Latin America, European datasets
- Cross-link related institutions
- Generate global statistics
Technical Details
Script: scripts/enhance_tunisia_dataset.py
Input: data/instances/tunisia/tunisian_institutions.yaml
Output: data/instances/tunisia/tunisian_institutions_enhanced.yaml
Cache: data/cache/tunisia_geocoding.db (SQLite)
Performance:
- GHCID generation: <1 second
- Geocoding: ~25 seconds (21 API calls @ 1 req/sec)
- Total runtime: ~30 seconds
Dependencies:
glam_extractor(identifiers, geocoding)- Nominatim API (geocoding)
- GeoNames database (city codes)
- LinkML schema v0.2.1
Report generated: 2025-11-10
Enhancement pipeline version: 1.0
Status: Phase 1-2 complete, Phase 3 pending