13 KiB
Global GLAM Dataset Unification - Complete Summary
Completed: 2025-11-11 15:17 UTC
Script: scripts/unify_all_datasets.py
✅ MISSION ACCOMPLISHED
Successfully unified all heritage institution datasets into a single comprehensive global dataset.
📊 Final Statistics
Overall Coverage
- Total Institutions: 13,502 (from 25,963 raw records)
- Countries Covered: 18
- Wikidata Coverage: 7,520/13,502 (55.7%)
- Geocoding Coverage: 8,178/13,502 (60.6%)
- Duplicates Removed: 12,461 (48.0% deduplication rate)
Data Quality Metrics
- Records Needing Enrichment: 13,461 (99.7%)
- Missing Wikidata Q-numbers: 5,982 institutions
- Missing Coordinates: 5,324 institutions
- Missing Website URLs: 2,085 institutions
- Missing/Incomplete Descriptions: 13,036 institutions
🌍 Geographic Distribution
Top 10 Countries by Institution Count
| Rank | Country | Code | Count | Wikidata % | Geocode % | Status |
|---|---|---|---|---|---|---|
| 1 | Japan | JP | 12,065 | 58.8% | 58.8% | ✅ Good |
| 2 | Netherlands | NL | 622 | 31.0% | 99.8% | ⚠️ Needs Wikidata |
| 3 | Mexico | MX | 226 | 15.0% | 73.9% | 🔴 Priority |
| 4 | Brazil | BR | 212 | 13.7% | 45.8% | 🔴 Priority |
| 5 | Chile | CL | 180 | 53.9% | 93.3% | ✅ Good |
| 6 | Libya | LY | 50 | 16.0% | 0.0% | 🔴 Needs Geocoding |
| 7 | Tunisia | TN | 69 | 1.4% | 13.0% | 🔴 Critical |
| 8 | Vietnam | VN | 21 | 38.1% | 0.0% | ⚠️ Needs Geocoding |
| 9 | Algeria | DZ | 19 | 5.3% | 0.0% | 🔴 Critical |
| 10 | Georgia | GE | 14 | 0.0% | 0.0% | 🔴 Critical |
Countries with 0% Wikidata Coverage (HIGHEST PRIORITY)
- Georgia (GE): 14 institutions - NO data
- Great Britain (GB): 4 institutions - NO data
- Belgium (BE): 7 institutions - Geocoded but no Wikidata
- United States (US): 7 institutions - Geocoded but no Wikidata
- Luxembourg (LU): 1 institution - Geocoded but no Wikidata
📁 Files Generated
Output Location: /data/instances/all/
-
globalglam-20251111.yaml (24 MB)
- Complete unified dataset
- 13,502 unique institutions
- Provenance tracking for each record
-
ENRICHMENT_CANDIDATES.yaml (2.8 MB)
- 13,461 institutions needing enrichment
- Sorted by priority (4 = most urgent, 1 = least urgent)
- Detailed field-level gap analysis
-
UNIFICATION_REPORT.md (11 KB)
- Comprehensive statistics by country and source
- Top 50 enrichment candidates
- Duplicate detection results
-
DATASET_STATISTICS.yaml (3 KB)
- Machine-readable metrics
- Country-by-country breakdown
- Quality indicators
🔍 Data Sources Merged
| Source | Count | Wikidata % | Geocode % | Notes |
|---|---|---|---|---|
| Global Merged | 13,396 | 55.6% | 100.0% | Base dataset from previous work |
| Japan | 12,065 | 0.0% | 0.0% | Largest single-country dataset |
| Chile | 90 | 78.9% | 86.7% | Best quality - enriched in Batch 19 |
| Brazil | 115 | 6.1% | 0.0% | Batch 6 enriched |
| Mexico | 117 | 0.0% | 49.6% | Geocoded only |
| Libya | 54 | 14.8% | 0.0% | Needs geocoding |
| Tunisia | 69 | 1.4% | 13.0% | Minimal data |
| Vietnam | 21 | 38.1% | 0.0% | Needs geocoding |
| Algeria | 19 | 5.3% | 0.0% | Minimal data |
| Georgia | 14 | 0.0% | 0.0% | Critical - no enrichment |
| Historical | 5 | 100.0% | 100.0% | Validation dataset |
🎯 Enrichment Priority Matrix
Priority 4 (4 Missing Fields): 855 Institutions
ALL need: Wikidata, Coordinates, Website, Description
Geographic Focus:
- Brazil: 108 institutions (Museu dos Povos Acreanos, UFAC Repository, etc.)
- Algeria: 19 institutions (all institutions)
- Georgia: 14 institutions (all institutions)
- Libya: 47 institutions (most institutions)
Action: Batch Wikidata query + Nominatim geocoding + website scraping
Priority 3 (3 Missing Fields): 4,875 Institutions
Typical pattern: Missing Wikidata, Coordinates, Description (have website) OR missing Wikidata, Website, Description (have coordinates)
Geographic Focus:
- Japan: Majority of 12,065 institutions (missing Wikidata)
- Netherlands: 430 institutions (have geocoding, need Wikidata)
- Mexico: 170 institutions (partial data)
Action: Focus on Wikidata enrichment via SPARQL
Priority 2 (2 Missing Fields): 665 Institutions
Typical pattern: Missing Wikidata + one other field
Action: Targeted enrichment for specific gaps
Priority 1 (1 Missing Field): 7,042 Institutions
Typical pattern: Only missing description OR only missing website
Action: Lower priority - can defer
🚀 Next Steps - Global Enrichment Workflow
Phase 1: Critical Countries (0% Wikidata Coverage)
Target: 33 institutions across 5 countries (GE, GB, BE, US, LU)
Workflow:
- Create enrichment script:
scripts/enrich_critical_countries.py - Query Wikidata SPARQL endpoint by country + institution type
- Fuzzy match institution names (threshold > 0.85)
- Geocode missing coordinates via Nominatim
- Validate and update records
Expected Outcome: Bring all 5 countries to 50%+ Wikidata coverage
Phase 2: North Africa (Tunisia, Algeria, Libya)
Target: 112 institutions with <16% Wikidata coverage
Challenges:
- Limited Wikidata entries for North African institutions
- Multilingual names (Arabic/French/English)
- Missing coordinates
Workflow:
- Wikidata enrichment with Arabic name variants
- Batch geocoding for all institutions
- Cross-reference with UNESCO heritage sites
- Manual validation of fuzzy matches
Expected Outcome: 40%+ Wikidata coverage, 80%+ geocoding
Phase 3: Latin America (Brazil, Mexico)
Target: 438 institutions (212 BR + 226 MX)
Current State:
- Brazil: 13.7% Wikidata, 45.8% geocoded
- Mexico: 15.0% Wikidata, 73.9% geocoded
Workflow:
- Reuse Chile enrichment scripts (proven 78.9% success rate)
- Batch SPARQL queries for Brazilian/Mexican institutions
- Enhance geocoding for Brazil (currently 45.8%)
- Website crawling for missing descriptions
Expected Outcome:
- Brazil → 50%+ Wikidata, 80%+ geocoding
- Mexico → 50%+ Wikidata, 90%+ geocoding
Phase 4: Netherlands Deep Enrichment
Target: 622 institutions, currently 31.0% Wikidata
Advantages:
- Already 99.8% geocoded
- Rich metadata available (ISIL codes, KvK numbers)
- Many institutions have websites
Workflow:
- Cross-reference with Dutch ISIL registry (TIER_1 data)
- Query Wikidata using ISIL codes as identifiers
- Crawl institutional websites for descriptions
- Leverage existing digital platform metadata
Expected Outcome: 70%+ Wikidata coverage (431 institutions)
Phase 5: Japan Mass Enrichment
Target: 12,065 institutions (89.5% of total dataset!)
Current State: 0% Wikidata from local dataset, but global merge shows 58.8%
Analysis: Japan data appears split between:
- Local Japanese dataset (12,065 records, 0% enriched)
- Global dataset (includes ~7,091 Japanese institutions with Wikidata)
Workflow:
- Investigate duplicate detection logic (why 12,461 duplicates removed?)
- Verify Japanese institution deduplication by name + coordinates
- Run batch Wikidata enrichment for remaining institutions
- Consider Japanese-language Wikidata queries
Expected Outcome: Maintain 58.8% coverage, improve to 70%+
📋 Enrichment Scripts to Create
1. scripts/enrich_critical_countries_batch.py
Purpose: Enrich Georgia, Great Britain, Belgium, US, Luxembourg institutions
Target: 33 institutions (0% → 50%+ Wikidata)
2. scripts/enrich_north_africa_batch.py
Purpose: Enrich Tunisia, Algeria, Libya institutions
Target: 112 institutions (<16% → 40%+ Wikidata)
3. scripts/enrich_brazil_comprehensive.py
Purpose: Full Brazil enrichment (Wikidata + geocoding + websites)
Target: 212 institutions (13.7% → 50%+ Wikidata, 45.8% → 80%+ geocoding)
4. scripts/enrich_mexico_comprehensive.py
Purpose: Mexico Wikidata enrichment (geocoding already good)
Target: 226 institutions (15.0% → 50%+ Wikidata)
5. scripts/enrich_netherlands_isil.py
Purpose: Netherlands enrichment using ISIL codes
Target: 622 institutions (31.0% → 70%+ Wikidata)
6. scripts/enrich_japan_mass.py
Purpose: Japan mass enrichment + deduplication analysis
Target: 12,065 institutions (maintain/improve 58.8% coverage)
🏆 Success Criteria
Minimum Viable Dataset (MVP)
- ✅ Total institutions: 13,000+ (ACHIEVED: 13,502)
- ✅ Wikidata coverage: 50%+ (ACHIEVED: 55.7%)
- ✅ Geocoding coverage: 50%+ (ACHIEVED: 60.6%)
- ❌ All countries: 30%+ Wikidata (NOT MET: 5 countries at 0%)
Target Goals (Post-Enrichment)
- Total institutions: 15,000+ (add more countries)
- Wikidata coverage: 70%+ globally
- Geocoding coverage: 80%+ globally
- All countries: 50%+ Wikidata minimum
- Description coverage: 80%+ (currently 3.5%)
🔧 Technical Notes
Deduplication Strategy
- 12,461 duplicates removed (48% duplicate rate!)
- Prioritized records with Wikidata Q-numbers
- Kept most complete version when duplicates found
- ID-based deduplication (exact match on
idfield)
Investigation Needed: Why such a high duplicate rate?
- Likely overlap between "global" dataset and country-specific datasets
- Japan institutions may be duplicated between sources
Data Quality Issues
-
Description Completeness: 96.5% of records missing/incomplete descriptions
- Most urgent data quality issue
- Affects usability and discoverability
- Can be addressed via website crawling
-
Coordinate Precision: 39.3% missing coordinates
- Libya: 100% missing (50 institutions)
- Algeria: 100% missing (19 institutions)
- Vietnam: 100% missing (21 institutions)
- Georgia: 100% missing (14 institutions)
- Brazil: 54.2% missing (115 institutions)
-
Website URLs: 15.5% missing
- Lower priority (institutions may not have websites)
- Focus on institutional websites vs. social media
📚 Chile Success Story - Benchmark for Quality
Chile Enrichment Results (Completed in Batch 19):
- Total: 90 institutions
- Wikidata: 71/90 (78.9%) - EXCEEDS 70% target by 8.9 points
- Geocoding: 78/90 (86.7%)
- Method: Iterative batch enrichment with fuzzy matching
- Scripts:
enrich_chile_batch[1-19].py
Key Success Factors:
- Iterative approach (19 batches, gradual refinement)
- Fuzzy matching threshold optimization (0.85+)
- Manual validation of uncertain matches
- Parent organization fallback (when direct match fails)
Replication Strategy: Use Chile's approach as template for other countries
🎉 Achievements
Data Integration
✅ Unified 11 separate datasets into single comprehensive file
✅ Merged 25,963 raw records → 13,502 unique institutions (48.0% deduplication)
✅ Covered 18 countries across 4 continents
✅ Preserved provenance tracking for all records
Quality Metrics
✅ Exceeded 50% Wikidata coverage globally (55.7%)
✅ Exceeded 50% geocoding coverage globally (60.6%)
✅ Generated comprehensive enrichment candidates list (13,461 records)
✅ Automated priority scoring (4-level system)
Documentation
✅ Created detailed unification report (UNIFICATION_REPORT.md)
✅ Machine-readable statistics (DATASET_STATISTICS.yaml)
✅ Enrichment roadmap with 6 phase plan
✅ Country-by-country breakdown with quality indicators
💡 Recommendations
Immediate Priorities (This Week)
- Enrich Critical Countries (GE, GB, BE, US, LU) - 33 institutions, 0% coverage
- Fix North Africa Geocoding (DZ, LY, TN) - 112 institutions, 0% coordinates
- Boost Brazil Coverage - 212 institutions, only 13.7% Wikidata
Short-term Goals (This Month)
- Netherlands Deep Dive - 622 institutions, leverage ISIL codes
- Mexico Enhancement - 226 institutions, build on existing geocoding
- Japan Deduplication Analysis - Investigate high duplicate rate
Long-term Vision (Next Quarter)
- Add New Countries - Target 25 countries total
- Semantic Web Integration - Generate RDF/Turtle exports
- API Development - Create SPARQL endpoint for querying
- Collection-Level Enrichment - Extract collection metadata from websites
📊 Progress Tracking
Overall Progress:
- ✅ Phase 0: Dataset Unification (COMPLETE)
- ⏳ Phase 1: Critical Countries Enrichment (READY TO START)
- ⏳ Phase 2: North Africa Enrichment (READY TO START)
- 📋 Phase 3: Latin America Enrichment (PLANNED)
- 📋 Phase 4: Netherlands Enrichment (PLANNED)
- 📋 Phase 5: Japan Mass Enrichment (PLANNED)
Files Ready for Use:
- ✅
globalglam-20251111.yaml- Master dataset - ✅
ENRICHMENT_CANDIDATES.yaml- Prioritized enrichment list - ✅
UNIFICATION_REPORT.md- Detailed statistics - ✅
DATASET_STATISTICS.yaml- Machine-readable metrics
Scripts to Create:
- 📝
enrich_critical_countries_batch.py - 📝
enrich_north_africa_batch.py - 📝
enrich_brazil_comprehensive.py - 📝
enrich_mexico_comprehensive.py - 📝
enrich_netherlands_isil.py - 📝
enrich_japan_mass.py
Status: ✅ READY FOR GLOBAL ENRICHMENT WORKFLOW
Next Command: Create first enrichment script for critical countries (GE, GB, BE, US, LU)