8.5 KiB
Global Heritage Institutions - Unification Summary
Last Updated: November 11, 2025
Quick Reference: Post-Merge Statistics & File Locations
Current Statistics (Post-Merge)
| Metric | Value |
|---|---|
| Total Institutions | 13,505 |
| Countries Covered | 18 |
| Wikidata Coverage | 56.6% (7,650 institutions) |
| Geocoding Coverage | 60.7% (8,192 institutions) |
| Master File Size | ~45 MB |
| Schema Version | LinkML v0.2.1 |
Merge History
November 11, 2025 - Major Enrichment Merge
| Dataset | Institutions | Wikidata % | Status |
|---|---|---|---|
| Tunisia Enhanced | 70 (+27 new) | 74.3% | ✅ Merged |
| Georgia GLAM | 14 (new country) | 85.7% | ✅ Merged |
| Latin America Updated | 586 (updated) | 43.4% avg | ✅ Merged |
Impact:
- +27 institutions (13,478 → 13,505)
- +153 Wikidata Q-numbers (55.6% → 56.6%)
- +1 country (Georgia added)
Notable Achievements:
- 🎯 Tunisia: 2.3% → 74.3% Wikidata coverage (+72 pp!)
- 🎯 Georgia: 85.7% Wikidata coverage (new country)
- 🎯 Chile: 53.9% → 81.7% Wikidata coverage
- 🎯 Mexico: 117 → 192 institutions (+75)
- 🎯 Six countries at 100%: Belgium, USA, UK, Russia, Denmark, Luxembourg
File Locations
Master Dataset
Primary File:
data/instances/all/unified_global_heritage_institutions.yaml
- Size: ~45 MB
- Format: LinkML-compliant YAML
- Institutions: 13,505
- Last Updated: November 11, 2025
Backup Files (in same directory):
unified_global_heritage_institutions_backup_20251111_092645.yaml ← Pre-merge backup
unified_global_heritage_institutions.yaml.backup ← Latest backup
unified_global_heritage_institutions.yaml.backup2 ← Secondary backup
Documentation Files
| File | Purpose | Size |
|---|---|---|
UNIFIED_OVERVIEW.md |
Complete project documentation | ~22 KB |
UNIFICATION_REPORT.md |
Detailed merge analysis | ~10 KB |
UNIFICATION_SUMMARY.md |
Quick reference (this file) | ~3 KB |
DATASET_STATISTICS.yaml |
Machine-readable metrics | ~3.3 KB |
ENRICHMENT_PROGRESS.md |
Historical enrichment tracking | ~11 KB |
README.md |
Quick reference index | ~7.7 KB |
Superseded Files (Archived)
The following files were merged into the unified dataset and archived:
Archived on November 11, 2025
Location: data/instances/archive/2025-11-11-pre-merge/
| File | Source Location | Reason |
|---|---|---|
tunisian_institutions_enhanced.yaml |
tunisia/ |
Merged into unified dataset |
georgia_glam_institutions_enriched.yaml |
Root directory | Merged into unified dataset |
latin_american_institutions_AUTHORITATIVE.yaml |
Root directory | Updated and merged |
Archive Documentation: See data/instances/archive/2025-11-11-pre-merge/SUPERSEDED.md
Countries with 100% Wikidata Coverage
As of November 11, 2025, the following countries have achieved complete Wikidata coverage:
| Country | Institutions | Wikidata Q-numbers | Geocoded |
|---|---|---|---|
| Belgium | 7 | 7 (100%) | 7 (100%) |
| USA | 7 | 7 (100%) | 7 (100%) |
| UK | 4 | 4 (100%) | 3 (75%) |
| Russia | 1 | 1 (100%) | 1 (100%) |
| Denmark | 1 | 1 (100%) | 1 (100%) |
| Luxembourg | 1 | 1 (100%) | 1 (100%) |
Top 10 Countries by Institution Count
| Rank | Country | Institutions | Wikidata % | Geocoded % |
|---|---|---|---|---|
| 1 | Japan | 12,065 | 58.8% | 58.8% |
| 2 | Netherlands | 622 | 31.0% | 99.8% |
| 3 | Brazil | 212 | 24.5% | 40.1% |
| 4 | Mexico | 192 | 32.8% | 78.1% |
| 5 | Chile | 180 | 81.7% | 89.4% |
| 6 | Tunisia | 70 | 74.3% | 84.3% |
| 7 | Libya | 50 | 16.0% | 0.0% |
| 8 | Vietnam | 21 | 38.1% | 0.0% |
| 9 | Algeria | 19 | 5.3% | 0.0% |
| 10 | Georgia | 14 | 85.7% | 0.0% |
Enrichment Needs Summary
Total Institutions Needing Enrichment: 5,855 (43.4%)
By Priority Level
🔴 High Priority (Low Wikidata coverage, significant size):
- Japan: 4,974 institutions need Q-numbers (58.8% → target 70%)
- Netherlands: 429 institutions need Q-numbers (31.0% → target 50%)
- Brazil: 160 institutions need Q-numbers (24.5% → target 50%)
🟡 Medium Priority (Moderate coverage, room for improvement):
- Mexico: 129 institutions need Q-numbers (32.8% → target 50%)
- Libya: 42 institutions need Q-numbers (16.0% → target 50%)
- Chile: 33 institutions need Q-numbers (81.7% → target 90%)
🟢 Low Priority (High coverage, minor gaps):
- Tunisia: 18 institutions need Q-numbers (74.3% → target 90%)
- Vietnam: 13 institutions need Q-numbers (38.1% → target 75%)
- Georgia: 2 institutions need Q-numbers (85.7% → target 100%)
Recent Major Improvements
Tunisia - Regional Leader
- Before: 43 institutions, 1 Wikidata Q-number (2.3%)
- After: 70 institutions, 52 Wikidata Q-numbers (74.3%)
- Improvement: +27 institutions, +51 Q-numbers, +72 percentage points
Georgia - New Country Addition
- Before: Not in dataset
- After: 14 institutions, 12 Wikidata Q-numbers (85.7%)
- Impact: Second-highest Wikidata coverage globally
Chile - Major Enhancement
- Before: 180 institutions, 97 Wikidata Q-numbers (53.9%)
- After: 180 institutions, 147 Wikidata Q-numbers (81.7%)
- Improvement: +50 Q-numbers, +27.8 percentage points
Brazil - Steady Growth
- Before: 212 institutions, 29 Wikidata Q-numbers (13.7%)
- After: 212 institutions, 52 Wikidata Q-numbers (24.5%)
- Improvement: +23 Q-numbers, +10.8 percentage points
Mexico - Expansion & Enrichment
- Before: 117 institutions, 0 Wikidata Q-numbers (0%)
- After: 192 institutions, 63 Wikidata Q-numbers (32.8%)
- Improvement: +75 institutions, +63 Q-numbers, +32.8 percentage points
Next Steps
Immediate (This Week)
- ✅ Complete documentation updates (this file)
- 📦 Archive superseded files to
archive/2025-11-11-pre-merge/ - 🌍 Start geocoding for Libya, Vietnam, Algeria, Georgia
Short-Term (This Month)
- 🇯🇵 Japan Batch 1 enrichment: Target 50 major institutions
- 🇳🇱 Netherlands Batch 1 enrichment: Target 30 provincial museums
- 🇧🇷 Brazil Batch 8 enrichment: Target 25 university collections
- 🇲🇽 Mexico Batch 3 enrichment: Target 20 federal institutions
Long-Term (Q1 2026)
- 🌏 Expand to Southeast Asia (Vietnam, Thailand, Indonesia)
- 🌍 Expand to Middle East (UAE, Saudi Arabia, Jordan)
- 🌍 Expand to Africa (Egypt, Kenya, South Africa)
Data Quality Tiers
| Tier | Description | Count | % |
|---|---|---|---|
| TIER_1 | Authoritative (registries) | 622 | 4.6% |
| TIER_2 | Verified (websites) | 4,891 | 36.2% |
| TIER_3 | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
| TIER_4 | Inferred (NLP extraction) | 342 | 2.5% |
Goal: Reduce TIER_4 to <1% through verification and website crawling.
Technical References
Schema Architecture
- Version: LinkML v0.2.1 (modular)
- Modules: 6 specialized modules (core, enums, provenance, collections, dutch, main)
- Documentation:
/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md
Persistent Identifiers (GHCID)
- Format:
{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}] - Collision Suffix: Native language institution name in snake_case (NOT Wikidata Q-numbers)
- UUID Strategies: v5 (SHA-1 primary), v8 (SHA-256 secondary), v7 (database only)
- Documentation:
/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md - Collision Resolution:
/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md
Ontology Alignment
- TOOI: Dutch government organizations
- CPOV: EU Core Public Organisation Vocabulary
- Schema.org: Web semantics
- CIDOC-CRM: Cultural heritage domain
For More Information
| Document | Purpose |
|---|---|
| UNIFIED_OVERVIEW.md | Comprehensive project documentation (read this for details) |
| UNIFICATION_REPORT.md | Detailed merge analysis with country-by-country breakdowns |
| DATASET_STATISTICS.yaml | Machine-readable metrics (parse this for automation) |
| ENRICHMENT_PROGRESS.md | Historical enrichment tracking (batch-by-batch progress) |
| README.md | Quick reference index (start here for navigation) |
Document Version: 1.0
Created: November 11, 2025
Maintained By: GLAM Data Extraction Project
Master Dataset Location: data/instances/all/unified_global_heritage_institutions.yaml