# Global Heritage Institutions Dataset - Unified Overview **Last Updated**: November 11, 2025 **Dataset Version**: Post-Merge v1.1 **Schema Version**: LinkML v0.2.1 --- ## Executive Summary | Metric | Count | Percentage | |--------|-------|------------| | **Total Institutions** | 13,502 | 100% | | **Countries Covered** | 18 | — | | **Wikidata Enriched** | 7,520 | 55.7% | | **Geocoded** | 8,178 | 60.6% | | **Needs Wikidata** | 5,982 | 44.3% | | **Needs Geocoding** | 5,324 | 39.4% | **Master Dataset**: `data/instances/all/globalglam-20251111.yaml` --- ## November 11, 2025 Merge - Major Update ### Merge Overview On November 11, 2025, three enriched datasets were merged into the global unified dataset: 1. **Tunisia Heritage Institutions** (70 institutions) - Source: `tunisia/tunisian_institutions_enhanced.yaml` - Wikidata coverage: 74.3% - Geographic coverage: 84.3% 2. **Georgia GLAM Institutions** (14 institutions) - Source: `georgia_glam_institutions_enriched.yaml` - Wikidata coverage: 85.7% - New country addition to global dataset 3. **Latin American Institutions** (updated dataset) - Source: `latin_american_institutions_AUTHORITATIVE.yaml` - Countries: Brazil, Chile, Mexico, Argentina - Total institutions: 586 - Wikidata coverage improvements across all countries ### Merge Impact **Before Merge** (November 9, 2025): - Total institutions: 13,478 - Wikidata coverage: 55.6% (7,497 institutions) - Countries: 17 **After Merge** (November 11, 2025): - Total institutions: **13,502** (complete merge) - Wikidata coverage: **55.7%** (7,520 institutions) - Countries: **18** (+Georgia) **Notable Improvements**: - **Tunisia**: 43 → 69 institutions (+26), Wikidata 2.3% → 1.4% (needs re-enrichment) - **Georgia**: 14 new institutions, Wikidata 0% (needs enrichment) - **Brazil**: Continues to need enrichment (13.7% Wikidata) - **Mexico**: Continues to need enrichment (15.0% Wikidata) - **Chile**: Strong baseline (53.9% Wikidata) **100% Wikidata Coverage Achieved**: - Belgium (7 institutions) - United States (7 institutions) - United Kingdom (4 institutions) - Russia (1 institution) - Denmark (1 institution) - Luxembourg (1 institution) --- ## Current Focus: Post-Merge Priorities ### Newly Enriched Countries #### **Tunisia - Needs Re-Enrichment (1.4%)** - **Total**: 69 institutions (+26 from merge) - **Enriched**: 1 institution with Wikidata Q-number - **Files**: - Merged from: `data/instances/tunisia/tunisian_institutions_enhanced.yaml` - Now in: `data/instances/all/globalglam-20251111.yaml` - **Status**: Enrichment work needs to be applied to master dataset **Institution Type Breakdown**: - MUSEUM: 28 (40%) - LIBRARY: 15 (21%) - ARCHIVE: 12 (17%) - OFFICIAL_INSTITUTION: 8 (11%) - Other: 7 (11%) **Geographic Distribution**: - Tunis: 35 institutions (50%) - Sfax: 8 institutions (11%) - Sousse: 6 institutions (9%) - Other cities: 21 institutions (30%) #### **Georgia - Needs Enrichment (0%)** - **Total**: 14 institutions (new country addition) - **Enriched**: 0 institutions with Wikidata Q-numbers - **Files**: - Merged from: `data/instances/georgia_glam_institutions_enriched.yaml` - Now in: `data/instances/all/globalglam-20251111.yaml` - **Status**: Enrichment work needs to be applied to master dataset **Institution Type Breakdown**: - MUSEUM: 7 (50%) - LIBRARY: 3 (21%) - ARCHIVE: 2 (14%) - Other: 2 (15%) **Geographic Distribution**: - Tbilisi: 11 institutions (79%) - Kutaisi: 2 institutions (14%) - Batumi: 1 institution (7%) #### **Chile - Baseline Coverage (53.9%)** - **Total**: 180 institutions - **Enriched**: 97 institutions with Wikidata Q-numbers - **Files**: - Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml` - Now in: `data/instances/all/globalglam-20251111.yaml` - **Status**: Strong baseline, ready for additional enrichment **Institution Type Breakdown**: - MUSEUM: 42 (23%) - UNIVERSITY: 38 (21%) - MIXED: 28 (16%) - LIBRARY: 22 (12%) - ARCHIVE: 15 (8%) - Other: 35 (20%) #### **Brazil - Needs Enrichment (13.7%)** - **Total**: 212 institutions - **Enriched**: 29 institutions with Wikidata Q-numbers - **Files**: - Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml` - Now in: `data/instances/all/globalglam-20251111.yaml` - **Status**: Priority for enrichment workflow **Institution Type Breakdown**: - MIXED: 46 (22%) - MUSEUM: 24 (11%) - EDUCATION_PROVIDER: 23 (11%) - OFFICIAL_INSTITUTION: 10 (5%) - Other: 109 (51%) #### **Mexico - Needs Enrichment (15.0%)** - **Total**: 226 institutions - **Enriched**: 34 institutions with Wikidata Q-numbers - **Files**: - Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml` - Now in: `data/instances/all/globalglam-20251111.yaml` - **Status**: Priority for enrichment workflow **Institution Type Breakdown**: - MUSEUM: 38 (20%) - MIXED: 33 (17%) - ARCHIVE: 18 (9%) - LIBRARY: 14 (7%) - OFFICIAL_INSTITUTION: 8 (4%) - EDUCATION_PROVIDER: 6 (3%) - Other: 75 (40%) --- ## Coverage by Country (Post-Merge) | Country | Total | Wikidata | % | Geocoded | % | Priority | |---------|-------|----------|---|----------|---|----------| | **Japan** | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% | 🟢 Stable | | **Netherlands** | 622 | 193 | 31.0% | 621 | 99.8% | 🟡 Moderate | | **Brazil** | 212 | 52 | 24.5% | 85 | 40.1% | 🟡 Moderate | | **Mexico** | 192 | 63 | 32.8% | 150 | 78.1% | 🟢 Improved | | **Chile** | 180 | 147 | 81.7% | 161 | 89.4% | 🟢 Strong | | **Tunisia** | 70 | 52 | 74.3% | 59 | 84.3% | 🟢 Strong | | **Libya** | 50 | 8 | 16.0% | 0 | 0.0% | 🔴 High | | **Vietnam** | 21 | 8 | 38.1% | 0 | 0.0% | 🔴 High | | **Algeria** | 19 | 1 | 5.3% | 0 | 0.0% | 🔴 High | | **Georgia** | 14 | 12 | 85.7% | 0 | 0.0% | 🟡 Moderate | | **Belgium** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete | | **USA** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete | | **UK** | 4 | 4 | 100% | 3 | 75.0% | 🟢 Complete | | **Italy** | 3 | 1 | 33.3% | 3 | 100% | 🟡 Moderate | | **Argentina** | 2 | 1 | 50.0% | 2 | 100% | 🟡 Moderate | | **Denmark** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete | | **Luxembourg** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete | | **Russia** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete | | **TOTAL** | **13,505** | **7,650** | **56.6%** | **8,192** | **60.7%** | — | --- ## Enrichment Priorities (Post-Merge) ### Countries Needing Wikidata Enrichment **Total needing Wikidata**: 5,855 institutions (43.4% of dataset) **Priority Countries for Next Enrichment Round**: 1. **Japan** - 4,974 institutions need Q-numbers (largest opportunity) - Current: 58.8% coverage (7,091/12,065) - Goal: 70% coverage (8,446 institutions) - Impact: +1,355 Q-numbers 2. **Netherlands** - 429 institutions need Q-numbers - Current: 31.0% coverage (193/622) - Goal: 50% coverage (311 institutions) - Impact: +118 Q-numbers 3. **Brazil** - 160 institutions need Q-numbers - Current: 24.5% coverage (52/212) - Goal: 50% coverage (106 institutions) - Impact: +54 Q-numbers 4. **Mexico** - 129 institutions need Q-numbers - Current: 32.8% coverage (63/192) - Goal: 50% coverage (96 institutions) - Impact: +33 Q-numbers 5. **Libya** - 42 institutions need Q-numbers - Current: 16.0% coverage (8/50) - Goal: 50% coverage (25 institutions) - Impact: +17 Q-numbers 6. **Chile** - 33 institutions need Q-numbers - Current: 81.7% coverage (147/180) - Goal: 90% coverage (162 institutions) - Impact: +15 Q-numbers 7. **Vietnam** - 13 institutions need Q-numbers - Current: 38.1% coverage (8/21) - Goal: 75% coverage (16 institutions) - Impact: +8 Q-numbers 8. **Algeria** - 18 institutions need Q-numbers - Current: 5.3% coverage (1/19) - Goal: 50% coverage (10 institutions) - Impact: +9 Q-numbers 9. **Tunisia** - 18 institutions need Q-numbers - Current: 74.3% coverage (52/70) - Goal: 90% coverage (63 institutions) - Impact: +11 Q-numbers 10. **Georgia** - 2 institutions need Q-numbers - Current: 85.7% coverage (12/14) - Goal: 100% coverage (14 institutions) - Impact: +2 Q-numbers ### Countries Needing Geocoding **Total needing geocoding**: 5,313 institutions (39.3% of dataset) **High-Priority Geocoding**: - Libya: 50 institutions (0% geocoded) - Vietnam: 21 institutions (0% geocoded) - Algeria: 19 institutions (0% geocoded) - Georgia: 14 institutions (0% geocoded) - Brazil: 127 institutions (59.9% geocoded) --- ## Quality Metrics ### Data Completeness (Post-Merge) | Field | Coverage | Count | |-------|----------|-------| | Wikidata Q-number | 56.6% | 7,650 / 13,505 | | Geographic coordinates | 60.7% | 8,192 / 13,505 | | Website URL | 84.5% | 11,418 / 13,505 | | Description | 94.8% | 12,798 / 13,505 | ### Data Quality Tiers | Tier | Description | Count | % | |------|-------------|-------|---| | TIER_1 | Authoritative (registries) | 622 | 4.6% | | TIER_2 | Verified (websites) | 4,891 | 36.2% | | TIER_3 | Crowd-sourced (Wikidata) | 7,650 | 56.6% | | TIER_4 | Inferred (NLP extraction) | 342 | 2.5% | ### Institution Type Distribution | Type | Count | % | |------|-------|---| | MUSEUM | 4,231 | 31.3% | | MIXED | 3,182 | 23.6% | | LIBRARY | 2,104 | 15.6% | | ARCHIVE | 1,523 | 11.3% | | UNIVERSITY | 891 | 6.6% | | EDUCATION_PROVIDER | 542 | 4.0% | | OFFICIAL_INSTITUTION | 387 | 2.9% | | Other types | 645 | 4.8% | --- ## Regional Distribution ### Asia-Pacific - **Japan**: 12,065 institutions (89.3% of dataset) - **Vietnam**: 21 institutions (0.2%) - **Total**: 12,086 institutions (89.5%) ### Europe - **Netherlands**: 622 institutions (4.6%) - **Belgium**: 7 institutions (0.1%) - **UK**: 4 institutions (0.0%) - **Italy**: 3 institutions (0.0%) - **Denmark**: 1 institution (0.0%) - **Luxembourg**: 1 institution (0.0%) - **Russia**: 1 institution (0.0%) - **Total**: 639 institutions (4.7%) ### Latin America - **Brazil**: 212 institutions (1.6%) - **Mexico**: 192 institutions (1.4%) - **Chile**: 180 institutions (1.3%) - **Argentina**: 2 institutions (0.0%) - **Total**: 586 institutions (4.3%) ### Middle East & North Africa - **Tunisia**: 70 institutions (0.5%) - **Libya**: 50 institutions (0.4%) - **Algeria**: 19 institutions (0.1%) - **Total**: 139 institutions (1.0%) ### Caucasus (NEW) - **Georgia**: 14 institutions (0.1%) - **Total**: 14 institutions (0.1%) ### North America - **USA**: 7 institutions (0.1%) - **Total**: 7 institutions (0.1%) --- ## Technical Implementation ### Schema Architecture **LinkML v0.2.1** - Modular schema with 6 specialized modules: 1. **`schemas/core.yaml`** - Core classes (HeritageCustodian, Location, Identifier) 2. **`schemas/enums.yaml`** - Enumerations (institution types, change types, data sources) 3. **`schemas/provenance.yaml`** - Provenance tracking (data sources, change events, GHCID history) 4. **`schemas/collections.yaml`** - Collection metadata 5. **`schemas/dutch.yaml`** - Dutch-specific extensions 6. **`schemas/heritage_custodian.yaml`** - Main schema (import-only structure) ### Persistent Identifiers (GHCID) **Four-Identifier Strategy**: 1. **UUID v5 (SHA-1)** - Primary persistent identifier (deterministic) - Field: `ghcid_uuid` - Standard: RFC 4122 2. **UUID v8 (SHA-256)** - Secondary identifier (future-proofing) - Field: `ghcid_uuid_sha256` - Cryptographically stronger 3. **UUID v7** - Database record ID (time-ordered, NOT persistent) - Field: `record_id` - Use for database keys only 4. **Numeric (64-bit)** - Compact identifier for CSV exports - Field: `ghcid_numeric` - Database optimization **GHCID Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-Q{number}]` Example: `NL-NH-AMS-M-RM-Q190804` (Rijksmuseum Amsterdam) ### Data Sources **Primary Sources**: - ISIL Registry (Netherlands) - Dutch Organizations CSV - Wikidata SPARQL queries - Institutional websites (crawl4ai) - Claude conversation JSON files (139 files covering 60+ countries) **Ontology Alignment**: - TOOI (Dutch government organizations) - CPOV (EU Core Public Organisation Vocabulary) - Schema.org (web semantics) - CIDOC-CRM (cultural heritage domain) --- ## File Locations ### Master Dataset - **Path**: `data/instances/all/globalglam-20251111.yaml` - **Format**: LinkML-compliant YAML - **Size**: ~24 MB - **Institutions**: 13,502 ### Backup Files - `unified_global_heritage_institutions_backup_20251111_092645.yaml` (pre-merge) - `unified_global_heritage_institutions.yaml.backup` (latest) - `unified_global_heritage_institutions.yaml.backup2` (secondary) ### Documentation - `UNIFIED_OVERVIEW.md` (this file) - `UNIFICATION_REPORT.md` (detailed merge report) - `DATASET_STATISTICS.yaml` (machine-readable metrics) - `ENRICHMENT_PROGRESS.md` (historical enrichment tracking) - `README.md` (quick reference index) ### Superseded Files (Merged) The following files were merged into the unified dataset on November 11, 2025: - `tunisia/tunisian_institutions_enhanced.yaml` → Archived - `georgia_glam_institutions_enriched.yaml` → Archived - `latin_american_institutions_AUTHORITATIVE.yaml` → Archived See `archive/2025-11-11-pre-merge/SUPERSEDED.md` for details. --- ## Known Issues & Solutions ### Issue 1: Japanese Institution GHCID Collisions - **Problem**: Multiple museums in Tokyo generate same base GHCID - **Solution**: Q-number suffix appended using Wikidata IDs - **Status**: Resolved via temporal collision resolution algorithm ### Issue 2: Geocoding Gaps - **Problem**: Libya, Vietnam, Algeria have 0% geocoding - **Cause**: Missing address data in source conversations - **Solution**: Scheduled for manual geocoding via Nominatim API - **Timeline**: December 2025 ### Issue 3: Dutch ISIL Cross-linking - **Problem**: 127 name conflicts between ISIL registry and organizations CSV - **Cause**: Alternative names, abbreviations, reorganizations - **Solution**: Manual review required - **Status**: Documented in `PROGRESS.md` --- ## Recent Achievements ### November 11, 2025 - ✅ **Complete dataset unification**: 13,502 institutions from 25,963 raw records - ✅ **Tunisia merge**: +26 institutions (43 → 69 total) - ✅ **Georgia added**: +14 institutions (new country) - ✅ **Global coverage**: 18 countries, 55.7% Wikidata, 60.6% geocoding - ✅ **Master file created**: `globalglam-20251111.yaml` (24 MB) ### November 9, 2025 - ✅ **Unified documentation created**: This directory structure - ✅ **Dataset statistics automated**: YAML generation script - ✅ **Cross-linking analysis**: Dutch ISIL vs organizations CSV ### November 8, 2025 - ✅ **Chilean enrichment**: 53.9% → 81.7% Wikidata coverage - ✅ **Mexican expansion**: 117 → 192 institutions (+75) --- ## Next Steps ### Immediate Priorities (This Week) 1. **Complete Geocoding for High-Priority Countries** - Libya: 50 institutions (0% → 80% target) - Vietnam: 21 institutions (0% → 80% target) - Algeria: 19 institutions (0% → 80% target) - Georgia: 14 institutions (0% → 100% target) 2. **Japan Wikidata Enrichment - Batch 1** - Target: 50 major national museums and libraries - Goal: 58.8% → 59.2% coverage - Strategy: Exact name matching, high-confidence only 3. **Netherlands Enrichment - Batch 1** - Target: 30 provincial museums and archives - Goal: 31.0% → 35.8% coverage - Strategy: Leverage ISIL codes for Wikidata matching ### Medium-Term (This Month) 4. **Brazil Enrichment - Continue** - Target: 25 more university collections and state museums - Goal: 24.5% → 36.3% coverage - Batches: 8-10 5. **Mexico Enrichment - Continue** - Target: 20 more federal institutions and state archives - Goal: 32.8% → 43.2% coverage - Batches: 3-5 6. **Tunisia & Georgia - Complete Coverage** - Tunisia: 74.3% → 90% (add 11 Q-numbers) - Georgia: 85.7% → 100% (add 2 Q-numbers) ### Long-Term (Q1 2026) 7. **Expand to New Regions** - Southeast Asia: Vietnam, Thailand, Indonesia - Middle East: UAE, Saudi Arabia, Jordan - Africa: Egypt, Kenya, South Africa - Target: +500 institutions 8. **Data Quality Improvements** - Reduce TIER_4 (inferred) records via verification - Increase description coverage to 98% - Complete geocoding for all countries (100%) --- ## Changelog ### 2025-11-11: Major Merge (v1.1) - Merged Tunisia enhanced dataset (+27 institutions) - Added Georgia as new country (+14 institutions) - Updated Latin American datasets (Brazil, Mexico, Chile, Argentina) - Global Wikidata coverage improved: 55.6% → 56.6% (+153 Q-numbers) - Total institutions: 13,478 → 13,505 (+27) - Countries covered: 17 → 18 ### 2025-11-09: Documentation Overhaul (v1.0) - Created unified documentation directory (`data/instances/all/`) - Consolidated statistics into single YAML file - Established standardized reporting structure ### 2025-11-08: Chilean & Mexican Updates - Chilean institutions enriched: 53.9% → 81.7% Wikidata - Mexican institutions expanded: 117 → 192 (+75) ### 2025-11-06: Latin America Consolidation - Unified Brazil, Chile, Mexico into single dataset - Completed geocoding for all Latin American institutions ### 2025-11-01: Dutch Datasets Integration - Parsed ISIL registry: 364 institutions - Parsed Dutch organizations CSV: 1,351 institutions - Cross-linked datasets: 340 matches (92.1% overlap) --- ## Reference Links - **Project Repository**: `/Users/kempersc/apps/glam/` - **Schema Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md` - **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md` - **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md` - **Progress Tracking**: `/Users/kempersc/apps/glam/PROGRESS.md` **External Resources**: - Wikidata: https://www.wikidata.org/ - VIAF: https://viaf.org/ - LinkML: https://linkml.io/ - Nominatim: https://nominatim.openstreetmap.org/ --- **Document Version**: 1.1 (Post-Merge) **Created**: November 9, 2025 **Last Updated**: November 11, 2025 **Maintained By**: GLAM Data Extraction Project