# Global Heritage Institutions - Unification Summary **Last Updated**: November 11, 2025 **Quick Reference**: Post-Merge Statistics & File Locations --- ## Current Statistics (Post-Merge) | Metric | Value | |--------|-------| | **Total Institutions** | 13,505 | | **Countries Covered** | 18 | | **Wikidata Coverage** | 56.6% (7,650 institutions) | | **Geocoding Coverage** | 60.7% (8,192 institutions) | | **Master File Size** | ~45 MB | | **Schema Version** | LinkML v0.2.1 | --- ## Merge History ### November 11, 2025 - Major Enrichment Merge | Dataset | Institutions | Wikidata % | Status | |---------|--------------|------------|--------| | Tunisia Enhanced | 70 (+27 new) | 74.3% | ✅ Merged | | Georgia GLAM | 14 (new country) | 85.7% | ✅ Merged | | Latin America Updated | 586 (updated) | 43.4% avg | ✅ Merged | **Impact**: - **+27 institutions** (13,478 → 13,505) - **+153 Wikidata Q-numbers** (55.6% → 56.6%) - **+1 country** (Georgia added) **Notable Achievements**: - 🎯 **Tunisia**: 2.3% → 74.3% Wikidata coverage (+72 pp!) - 🎯 **Georgia**: 85.7% Wikidata coverage (new country) - 🎯 **Chile**: 53.9% → 81.7% Wikidata coverage - 🎯 **Mexico**: 117 → 192 institutions (+75) - 🎯 **Six countries at 100%**: Belgium, USA, UK, Russia, Denmark, Luxembourg --- ## File Locations ### Master Dataset **Primary File**: ``` data/instances/all/unified_global_heritage_institutions.yaml ``` - **Size**: ~45 MB - **Format**: LinkML-compliant YAML - **Institutions**: 13,505 - **Last Updated**: November 11, 2025 **Backup Files** (in same directory): ``` unified_global_heritage_institutions_backup_20251111_092645.yaml ← Pre-merge backup unified_global_heritage_institutions.yaml.backup ← Latest backup unified_global_heritage_institutions.yaml.backup2 ← Secondary backup ``` ### Documentation Files | File | Purpose | Size | |------|---------|------| | `UNIFIED_OVERVIEW.md` | Complete project documentation | ~22 KB | | `UNIFICATION_REPORT.md` | Detailed merge analysis | ~10 KB | | `UNIFICATION_SUMMARY.md` | Quick reference (this file) | ~3 KB | | `DATASET_STATISTICS.yaml` | Machine-readable metrics | ~3.3 KB | | `ENRICHMENT_PROGRESS.md` | Historical enrichment tracking | ~11 KB | | `README.md` | Quick reference index | ~7.7 KB | --- ## Superseded Files (Archived) The following files were merged into the unified dataset and archived: ### Archived on November 11, 2025 **Location**: `data/instances/archive/2025-11-11-pre-merge/` | File | Source Location | Reason | |------|-----------------|--------| | `tunisian_institutions_enhanced.yaml` | `tunisia/` | Merged into unified dataset | | `georgia_glam_institutions_enriched.yaml` | Root directory | Merged into unified dataset | | `latin_american_institutions_AUTHORITATIVE.yaml` | Root directory | Updated and merged | **Archive Documentation**: See `data/instances/archive/2025-11-11-pre-merge/SUPERSEDED.md` --- ## Countries with 100% Wikidata Coverage As of November 11, 2025, the following countries have achieved complete Wikidata coverage: | Country | Institutions | Wikidata Q-numbers | Geocoded | |---------|--------------|-------------------|----------| | Belgium | 7 | 7 (100%) | 7 (100%) | | USA | 7 | 7 (100%) | 7 (100%) | | UK | 4 | 4 (100%) | 3 (75%) | | Russia | 1 | 1 (100%) | 1 (100%) | | Denmark | 1 | 1 (100%) | 1 (100%) | | Luxembourg | 1 | 1 (100%) | 1 (100%) | --- ## Top 10 Countries by Institution Count | Rank | Country | Institutions | Wikidata % | Geocoded % | |------|---------|--------------|------------|------------| | 1 | Japan | 12,065 | 58.8% | 58.8% | | 2 | Netherlands | 622 | 31.0% | 99.8% | | 3 | Brazil | 212 | 24.5% | 40.1% | | 4 | Mexico | 192 | 32.8% | 78.1% | | 5 | Chile | 180 | 81.7% | 89.4% | | 6 | Tunisia | 70 | 74.3% | 84.3% | | 7 | Libya | 50 | 16.0% | 0.0% | | 8 | Vietnam | 21 | 38.1% | 0.0% | | 9 | Algeria | 19 | 5.3% | 0.0% | | 10 | Georgia | 14 | 85.7% | 0.0% | --- ## Enrichment Needs Summary **Total Institutions Needing Enrichment**: 5,855 (43.4%) ### By Priority Level **🔴 High Priority** (Low Wikidata coverage, significant size): - Japan: 4,974 institutions need Q-numbers (58.8% → target 70%) - Netherlands: 429 institutions need Q-numbers (31.0% → target 50%) - Brazil: 160 institutions need Q-numbers (24.5% → target 50%) **🟡 Medium Priority** (Moderate coverage, room for improvement): - Mexico: 129 institutions need Q-numbers (32.8% → target 50%) - Libya: 42 institutions need Q-numbers (16.0% → target 50%) - Chile: 33 institutions need Q-numbers (81.7% → target 90%) **🟢 Low Priority** (High coverage, minor gaps): - Tunisia: 18 institutions need Q-numbers (74.3% → target 90%) - Vietnam: 13 institutions need Q-numbers (38.1% → target 75%) - Georgia: 2 institutions need Q-numbers (85.7% → target 100%) --- ## Recent Major Improvements ### Tunisia - Regional Leader - **Before**: 43 institutions, 1 Wikidata Q-number (2.3%) - **After**: 70 institutions, 52 Wikidata Q-numbers (74.3%) - **Improvement**: +27 institutions, +51 Q-numbers, +72 percentage points ### Georgia - New Country Addition - **Before**: Not in dataset - **After**: 14 institutions, 12 Wikidata Q-numbers (85.7%) - **Impact**: Second-highest Wikidata coverage globally ### Chile - Major Enhancement - **Before**: 180 institutions, 97 Wikidata Q-numbers (53.9%) - **After**: 180 institutions, 147 Wikidata Q-numbers (81.7%) - **Improvement**: +50 Q-numbers, +27.8 percentage points ### Brazil - Steady Growth - **Before**: 212 institutions, 29 Wikidata Q-numbers (13.7%) - **After**: 212 institutions, 52 Wikidata Q-numbers (24.5%) - **Improvement**: +23 Q-numbers, +10.8 percentage points ### Mexico - Expansion & Enrichment - **Before**: 117 institutions, 0 Wikidata Q-numbers (0%) - **After**: 192 institutions, 63 Wikidata Q-numbers (32.8%) - **Improvement**: +75 institutions, +63 Q-numbers, +32.8 percentage points --- ## Next Steps ### Immediate (This Week) 1. ✅ Complete documentation updates (this file) 2. 📦 Archive superseded files to `archive/2025-11-11-pre-merge/` 3. 🌍 Start geocoding for Libya, Vietnam, Algeria, Georgia ### Short-Term (This Month) 4. 🇯🇵 Japan Batch 1 enrichment: Target 50 major institutions 5. 🇳🇱 Netherlands Batch 1 enrichment: Target 30 provincial museums 6. 🇧🇷 Brazil Batch 8 enrichment: Target 25 university collections 7. 🇲🇽 Mexico Batch 3 enrichment: Target 20 federal institutions ### Long-Term (Q1 2026) 8. 🌏 Expand to Southeast Asia (Vietnam, Thailand, Indonesia) 9. 🌍 Expand to Middle East (UAE, Saudi Arabia, Jordan) 10. 🌍 Expand to Africa (Egypt, Kenya, South Africa) --- ## Data Quality Tiers | Tier | Description | Count | % | |------|-------------|-------|---| | **TIER_1** | Authoritative (registries) | 622 | 4.6% | | **TIER_2** | Verified (websites) | 4,891 | 36.2% | | **TIER_3** | Crowd-sourced (Wikidata) | 7,650 | 56.6% | | **TIER_4** | Inferred (NLP extraction) | 342 | 2.5% | **Goal**: Reduce TIER_4 to <1% through verification and website crawling. --- ## Technical References ### Schema Architecture - **Version**: LinkML v0.2.1 (modular) - **Modules**: 6 specialized modules (core, enums, provenance, collections, dutch, main) - **Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md` ### Persistent Identifiers (GHCID) - **Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]` - **Collision Suffix**: Native language institution name in snake_case (NOT Wikidata Q-numbers) - **UUID Strategies**: v5 (SHA-1 primary), v8 (SHA-256 secondary), v7 (database only) - **Documentation**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md` - **Collision Resolution**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md` ### Ontology Alignment - **TOOI**: Dutch government organizations - **CPOV**: EU Core Public Organisation Vocabulary - **Schema.org**: Web semantics - **CIDOC-CRM**: Cultural heritage domain --- ## For More Information | Document | Purpose | |----------|---------| | **UNIFIED_OVERVIEW.md** | Comprehensive project documentation (read this for details) | | **UNIFICATION_REPORT.md** | Detailed merge analysis with country-by-country breakdowns | | **DATASET_STATISTICS.yaml** | Machine-readable metrics (parse this for automation) | | **ENRICHMENT_PROGRESS.md** | Historical enrichment tracking (batch-by-batch progress) | | **README.md** | Quick reference index (start here for navigation) | --- **Document Version**: 1.0 **Created**: November 11, 2025 **Maintained By**: GLAM Data Extraction Project **Master Dataset Location**: `data/instances/all/unified_global_heritage_institutions.yaml`