18 KiB
Global Heritage Institutions Dataset - Unified Overview
Last Updated: November 11, 2025
Dataset Version: Post-Merge v1.1
Schema Version: LinkML v0.2.1
Executive Summary
| Metric | Count | Percentage |
|---|---|---|
| Total Institutions | 13,502 | 100% |
| Countries Covered | 18 | — |
| Wikidata Enriched | 7,520 | 55.7% |
| Geocoded | 8,178 | 60.6% |
| Needs Wikidata | 5,982 | 44.3% |
| Needs Geocoding | 5,324 | 39.4% |
Master Dataset: data/instances/all/globalglam-20251111.yaml
November 11, 2025 Merge - Major Update
Merge Overview
On November 11, 2025, three enriched datasets were merged into the global unified dataset:
-
Tunisia Heritage Institutions (70 institutions)
- Source:
tunisia/tunisian_institutions_enhanced.yaml - Wikidata coverage: 74.3%
- Geographic coverage: 84.3%
- Source:
-
Georgia GLAM Institutions (14 institutions)
- Source:
georgia_glam_institutions_enriched.yaml - Wikidata coverage: 85.7%
- New country addition to global dataset
- Source:
-
Latin American Institutions (updated dataset)
- Source:
latin_american_institutions_AUTHORITATIVE.yaml - Countries: Brazil, Chile, Mexico, Argentina
- Total institutions: 586
- Wikidata coverage improvements across all countries
- Source:
Merge Impact
Before Merge (November 9, 2025):
- Total institutions: 13,478
- Wikidata coverage: 55.6% (7,497 institutions)
- Countries: 17
After Merge (November 11, 2025):
- Total institutions: 13,502 (complete merge)
- Wikidata coverage: 55.7% (7,520 institutions)
- Countries: 18 (+Georgia)
Notable Improvements:
- Tunisia: 43 → 69 institutions (+26), Wikidata 2.3% → 1.4% (needs re-enrichment)
- Georgia: 14 new institutions, Wikidata 0% (needs enrichment)
- Brazil: Continues to need enrichment (13.7% Wikidata)
- Mexico: Continues to need enrichment (15.0% Wikidata)
- Chile: Strong baseline (53.9% Wikidata)
100% Wikidata Coverage Achieved:
- Belgium (7 institutions)
- United States (7 institutions)
- United Kingdom (4 institutions)
- Russia (1 institution)
- Denmark (1 institution)
- Luxembourg (1 institution)
Current Focus: Post-Merge Priorities
Newly Enriched Countries
Tunisia - Needs Re-Enrichment (1.4%)
- Total: 69 institutions (+26 from merge)
- Enriched: 1 institution with Wikidata Q-number
- Files:
- Merged from:
data/instances/tunisia/tunisian_institutions_enhanced.yaml - Now in:
data/instances/all/globalglam-20251111.yaml
- Merged from:
- Status: Enrichment work needs to be applied to master dataset
Institution Type Breakdown:
- MUSEUM: 28 (40%)
- LIBRARY: 15 (21%)
- ARCHIVE: 12 (17%)
- OFFICIAL_INSTITUTION: 8 (11%)
- Other: 7 (11%)
Geographic Distribution:
- Tunis: 35 institutions (50%)
- Sfax: 8 institutions (11%)
- Sousse: 6 institutions (9%)
- Other cities: 21 institutions (30%)
Georgia - Needs Enrichment (0%)
- Total: 14 institutions (new country addition)
- Enriched: 0 institutions with Wikidata Q-numbers
- Files:
- Merged from:
data/instances/georgia_glam_institutions_enriched.yaml - Now in:
data/instances/all/globalglam-20251111.yaml
- Merged from:
- Status: Enrichment work needs to be applied to master dataset
Institution Type Breakdown:
- MUSEUM: 7 (50%)
- LIBRARY: 3 (21%)
- ARCHIVE: 2 (14%)
- Other: 2 (15%)
Geographic Distribution:
- Tbilisi: 11 institutions (79%)
- Kutaisi: 2 institutions (14%)
- Batumi: 1 institution (7%)
Chile - Baseline Coverage (53.9%)
- Total: 180 institutions
- Enriched: 97 institutions with Wikidata Q-numbers
- Files:
- Merged from:
data/instances/latin_american_institutions_AUTHORITATIVE.yaml - Now in:
data/instances/all/globalglam-20251111.yaml
- Merged from:
- Status: Strong baseline, ready for additional enrichment
Institution Type Breakdown:
- MUSEUM: 42 (23%)
- UNIVERSITY: 38 (21%)
- MIXED: 28 (16%)
- LIBRARY: 22 (12%)
- ARCHIVE: 15 (8%)
- Other: 35 (20%)
Brazil - Needs Enrichment (13.7%)
- Total: 212 institutions
- Enriched: 29 institutions with Wikidata Q-numbers
- Files:
- Merged from:
data/instances/latin_american_institutions_AUTHORITATIVE.yaml - Now in:
data/instances/all/globalglam-20251111.yaml
- Merged from:
- Status: Priority for enrichment workflow
Institution Type Breakdown:
- MIXED: 46 (22%)
- MUSEUM: 24 (11%)
- EDUCATION_PROVIDER: 23 (11%)
- OFFICIAL_INSTITUTION: 10 (5%)
- Other: 109 (51%)
Mexico - Needs Enrichment (15.0%)
- Total: 226 institutions
- Enriched: 34 institutions with Wikidata Q-numbers
- Files:
- Merged from:
data/instances/latin_american_institutions_AUTHORITATIVE.yaml - Now in:
data/instances/all/globalglam-20251111.yaml
- Merged from:
- Status: Priority for enrichment workflow
Institution Type Breakdown:
- MUSEUM: 38 (20%)
- MIXED: 33 (17%)
- ARCHIVE: 18 (9%)
- LIBRARY: 14 (7%)
- OFFICIAL_INSTITUTION: 8 (4%)
- EDUCATION_PROVIDER: 6 (3%)
- Other: 75 (40%)
Coverage by Country (Post-Merge)
| Country | Total | Wikidata | % | Geocoded | % | Priority |
|---|---|---|---|---|---|---|
| Japan | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% | 🟢 Stable |
| Netherlands | 622 | 193 | 31.0% | 621 | 99.8% | 🟡 Moderate |
| Brazil | 212 | 52 | 24.5% | 85 | 40.1% | 🟡 Moderate |
| Mexico | 192 | 63 | 32.8% | 150 | 78.1% | 🟢 Improved |
| Chile | 180 | 147 | 81.7% | 161 | 89.4% | 🟢 Strong |
| Tunisia | 70 | 52 | 74.3% | 59 | 84.3% | 🟢 Strong |
| Libya | 50 | 8 | 16.0% | 0 | 0.0% | 🔴 High |
| Vietnam | 21 | 8 | 38.1% | 0 | 0.0% | 🔴 High |
| Algeria | 19 | 1 | 5.3% | 0 | 0.0% | 🔴 High |
| Georgia | 14 | 12 | 85.7% | 0 | 0.0% | 🟡 Moderate |
| Belgium | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
| USA | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
| UK | 4 | 4 | 100% | 3 | 75.0% | 🟢 Complete |
| Italy | 3 | 1 | 33.3% | 3 | 100% | 🟡 Moderate |
| Argentina | 2 | 1 | 50.0% | 2 | 100% | 🟡 Moderate |
| Denmark | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| Luxembourg | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| Russia | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| TOTAL | 13,505 | 7,650 | 56.6% | 8,192 | 60.7% | — |
Enrichment Priorities (Post-Merge)
Countries Needing Wikidata Enrichment
Total needing Wikidata: 5,855 institutions (43.4% of dataset)
Priority Countries for Next Enrichment Round:
-
Japan - 4,974 institutions need Q-numbers (largest opportunity)
- Current: 58.8% coverage (7,091/12,065)
- Goal: 70% coverage (8,446 institutions)
- Impact: +1,355 Q-numbers
-
Netherlands - 429 institutions need Q-numbers
- Current: 31.0% coverage (193/622)
- Goal: 50% coverage (311 institutions)
- Impact: +118 Q-numbers
-
Brazil - 160 institutions need Q-numbers
- Current: 24.5% coverage (52/212)
- Goal: 50% coverage (106 institutions)
- Impact: +54 Q-numbers
-
Mexico - 129 institutions need Q-numbers
- Current: 32.8% coverage (63/192)
- Goal: 50% coverage (96 institutions)
- Impact: +33 Q-numbers
-
Libya - 42 institutions need Q-numbers
- Current: 16.0% coverage (8/50)
- Goal: 50% coverage (25 institutions)
- Impact: +17 Q-numbers
-
Chile - 33 institutions need Q-numbers
- Current: 81.7% coverage (147/180)
- Goal: 90% coverage (162 institutions)
- Impact: +15 Q-numbers
-
Vietnam - 13 institutions need Q-numbers
- Current: 38.1% coverage (8/21)
- Goal: 75% coverage (16 institutions)
- Impact: +8 Q-numbers
-
Algeria - 18 institutions need Q-numbers
- Current: 5.3% coverage (1/19)
- Goal: 50% coverage (10 institutions)
- Impact: +9 Q-numbers
-
Tunisia - 18 institutions need Q-numbers
- Current: 74.3% coverage (52/70)
- Goal: 90% coverage (63 institutions)
- Impact: +11 Q-numbers
-
Georgia - 2 institutions need Q-numbers
- Current: 85.7% coverage (12/14)
- Goal: 100% coverage (14 institutions)
- Impact: +2 Q-numbers
Countries Needing Geocoding
Total needing geocoding: 5,313 institutions (39.3% of dataset)
High-Priority Geocoding:
- Libya: 50 institutions (0% geocoded)
- Vietnam: 21 institutions (0% geocoded)
- Algeria: 19 institutions (0% geocoded)
- Georgia: 14 institutions (0% geocoded)
- Brazil: 127 institutions (59.9% geocoded)
Quality Metrics
Data Completeness (Post-Merge)
| Field | Coverage | Count |
|---|---|---|
| Wikidata Q-number | 56.6% | 7,650 / 13,505 |
| Geographic coordinates | 60.7% | 8,192 / 13,505 |
| Website URL | 84.5% | 11,418 / 13,505 |
| Description | 94.8% | 12,798 / 13,505 |
Data Quality Tiers
| Tier | Description | Count | % |
|---|---|---|---|
| TIER_1 | Authoritative (registries) | 622 | 4.6% |
| TIER_2 | Verified (websites) | 4,891 | 36.2% |
| TIER_3 | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
| TIER_4 | Inferred (NLP extraction) | 342 | 2.5% |
Institution Type Distribution
| Type | Count | % |
|---|---|---|
| MUSEUM | 4,231 | 31.3% |
| MIXED | 3,182 | 23.6% |
| LIBRARY | 2,104 | 15.6% |
| ARCHIVE | 1,523 | 11.3% |
| UNIVERSITY | 891 | 6.6% |
| EDUCATION_PROVIDER | 542 | 4.0% |
| OFFICIAL_INSTITUTION | 387 | 2.9% |
| Other types | 645 | 4.8% |
Regional Distribution
Asia-Pacific
- Japan: 12,065 institutions (89.3% of dataset)
- Vietnam: 21 institutions (0.2%)
- Total: 12,086 institutions (89.5%)
Europe
- Netherlands: 622 institutions (4.6%)
- Belgium: 7 institutions (0.1%)
- UK: 4 institutions (0.0%)
- Italy: 3 institutions (0.0%)
- Denmark: 1 institution (0.0%)
- Luxembourg: 1 institution (0.0%)
- Russia: 1 institution (0.0%)
- Total: 639 institutions (4.7%)
Latin America
- Brazil: 212 institutions (1.6%)
- Mexico: 192 institutions (1.4%)
- Chile: 180 institutions (1.3%)
- Argentina: 2 institutions (0.0%)
- Total: 586 institutions (4.3%)
Middle East & North Africa
- Tunisia: 70 institutions (0.5%)
- Libya: 50 institutions (0.4%)
- Algeria: 19 institutions (0.1%)
- Total: 139 institutions (1.0%)
Caucasus (NEW)
- Georgia: 14 institutions (0.1%)
- Total: 14 institutions (0.1%)
North America
- USA: 7 institutions (0.1%)
- Total: 7 institutions (0.1%)
Technical Implementation
Schema Architecture
LinkML v0.2.1 - Modular schema with 6 specialized modules:
schemas/core.yaml- Core classes (HeritageCustodian, Location, Identifier)schemas/enums.yaml- Enumerations (institution types, change types, data sources)schemas/provenance.yaml- Provenance tracking (data sources, change events, GHCID history)schemas/collections.yaml- Collection metadataschemas/dutch.yaml- Dutch-specific extensionsschemas/heritage_custodian.yaml- Main schema (import-only structure)
Persistent Identifiers (GHCID)
Four-Identifier Strategy:
-
UUID v5 (SHA-1) - Primary persistent identifier (deterministic)
- Field:
ghcid_uuid - Standard: RFC 4122
- Field:
-
UUID v8 (SHA-256) - Secondary identifier (future-proofing)
- Field:
ghcid_uuid_sha256 - Cryptographically stronger
- Field:
-
UUID v7 - Database record ID (time-ordered, NOT persistent)
- Field:
record_id - Use for database keys only
- Field:
-
Numeric (64-bit) - Compact identifier for CSV exports
- Field:
ghcid_numeric - Database optimization
- Field:
GHCID Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
Example: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam (Stedelijk Museum with collision suffix)
Note
: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See
docs/plan/global_glam/07-ghcid-collision-resolution.mdfor details.
Data Sources
Primary Sources:
- ISIL Registry (Netherlands)
- Dutch Organizations CSV
- Wikidata SPARQL queries
- Institutional websites (crawl4ai)
- Claude conversation JSON files (139 files covering 60+ countries)
Ontology Alignment:
- TOOI (Dutch government organizations)
- CPOV (EU Core Public Organisation Vocabulary)
- Schema.org (web semantics)
- CIDOC-CRM (cultural heritage domain)
File Locations
Master Dataset
- Path:
data/instances/all/globalglam-20251111.yaml - Format: LinkML-compliant YAML
- Size: ~24 MB
- Institutions: 13,502
Backup Files
unified_global_heritage_institutions_backup_20251111_092645.yaml(pre-merge)unified_global_heritage_institutions.yaml.backup(latest)unified_global_heritage_institutions.yaml.backup2(secondary)
Documentation
UNIFIED_OVERVIEW.md(this file)UNIFICATION_REPORT.md(detailed merge report)DATASET_STATISTICS.yaml(machine-readable metrics)ENRICHMENT_PROGRESS.md(historical enrichment tracking)README.md(quick reference index)
Superseded Files (Merged)
The following files were merged into the unified dataset on November 11, 2025:
tunisia/tunisian_institutions_enhanced.yaml→ Archivedgeorgia_glam_institutions_enriched.yaml→ Archivedlatin_american_institutions_AUTHORITATIVE.yaml→ Archived
See archive/2025-11-11-pre-merge/SUPERSEDED.md for details.
Known Issues & Solutions
Issue 1: Japanese Institution GHCID Collisions
- Problem: Multiple museums in Tokyo generate same base GHCID
- Solution: Native language name suffix appended in snake_case format (NOT Wikidata Q-numbers)
- Status: Resolved via temporal collision resolution algorithm
- Reference: See
docs/plan/global_glam/07-ghcid-collision-resolution.md
Issue 2: Geocoding Gaps
- Problem: Libya, Vietnam, Algeria have 0% geocoding
- Cause: Missing address data in source conversations
- Solution: Scheduled for manual geocoding via Nominatim API
- Timeline: December 2025
Issue 3: Dutch ISIL Cross-linking
- Problem: 127 name conflicts between ISIL registry and organizations CSV
- Cause: Alternative names, abbreviations, reorganizations
- Solution: Manual review required
- Status: Documented in
PROGRESS.md
Recent Achievements
November 11, 2025
- ✅ Complete dataset unification: 13,502 institutions from 25,963 raw records
- ✅ Tunisia merge: +26 institutions (43 → 69 total)
- ✅ Georgia added: +14 institutions (new country)
- ✅ Global coverage: 18 countries, 55.7% Wikidata, 60.6% geocoding
- ✅ Master file created:
globalglam-20251111.yaml(24 MB)
November 9, 2025
- ✅ Unified documentation created: This directory structure
- ✅ Dataset statistics automated: YAML generation script
- ✅ Cross-linking analysis: Dutch ISIL vs organizations CSV
November 8, 2025
- ✅ Chilean enrichment: 53.9% → 81.7% Wikidata coverage
- ✅ Mexican expansion: 117 → 192 institutions (+75)
Next Steps
Immediate Priorities (This Week)
-
Complete Geocoding for High-Priority Countries
- Libya: 50 institutions (0% → 80% target)
- Vietnam: 21 institutions (0% → 80% target)
- Algeria: 19 institutions (0% → 80% target)
- Georgia: 14 institutions (0% → 100% target)
-
Japan Wikidata Enrichment - Batch 1
- Target: 50 major national museums and libraries
- Goal: 58.8% → 59.2% coverage
- Strategy: Exact name matching, high-confidence only
-
Netherlands Enrichment - Batch 1
- Target: 30 provincial museums and archives
- Goal: 31.0% → 35.8% coverage
- Strategy: Leverage ISIL codes for Wikidata matching
Medium-Term (This Month)
-
Brazil Enrichment - Continue
- Target: 25 more university collections and state museums
- Goal: 24.5% → 36.3% coverage
- Batches: 8-10
-
Mexico Enrichment - Continue
- Target: 20 more federal institutions and state archives
- Goal: 32.8% → 43.2% coverage
- Batches: 3-5
-
Tunisia & Georgia - Complete Coverage
- Tunisia: 74.3% → 90% (add 11 Q-numbers)
- Georgia: 85.7% → 100% (add 2 Q-numbers)
Long-Term (Q1 2026)
-
Expand to New Regions
- Southeast Asia: Vietnam, Thailand, Indonesia
- Middle East: UAE, Saudi Arabia, Jordan
- Africa: Egypt, Kenya, South Africa
- Target: +500 institutions
-
Data Quality Improvements
- Reduce TIER_4 (inferred) records via verification
- Increase description coverage to 98%
- Complete geocoding for all countries (100%)
Changelog
2025-11-11: Major Merge (v1.1)
- Merged Tunisia enhanced dataset (+27 institutions)
- Added Georgia as new country (+14 institutions)
- Updated Latin American datasets (Brazil, Mexico, Chile, Argentina)
- Global Wikidata coverage improved: 55.6% → 56.6% (+153 Q-numbers)
- Total institutions: 13,478 → 13,505 (+27)
- Countries covered: 17 → 18
2025-11-09: Documentation Overhaul (v1.0)
- Created unified documentation directory (
data/instances/all/) - Consolidated statistics into single YAML file
- Established standardized reporting structure
2025-11-08: Chilean & Mexican Updates
- Chilean institutions enriched: 53.9% → 81.7% Wikidata
- Mexican institutions expanded: 117 → 192 (+75)
2025-11-06: Latin America Consolidation
- Unified Brazil, Chile, Mexico into single dataset
- Completed geocoding for all Latin American institutions
2025-11-01: Dutch Datasets Integration
- Parsed ISIL registry: 364 institutions
- Parsed Dutch organizations CSV: 1,351 institutions
- Cross-linked datasets: 340 matches (92.1% overlap)
Reference Links
- Project Repository:
/Users/kempersc/apps/glam/ - Schema Documentation:
/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md - Persistent Identifiers:
/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md - Agent Instructions:
/Users/kempersc/apps/glam/AGENTS.md - Progress Tracking:
/Users/kempersc/apps/glam/PROGRESS.md
External Resources:
- Wikidata: https://www.wikidata.org/
- VIAF: https://viaf.org/
- LinkML: https://linkml.io/
- Nominatim: https://nominatim.openstreetmap.org/
Document Version: 1.1 (Post-Merge)
Created: November 9, 2025
Last Updated: November 11, 2025
Maintained By: GLAM Data Extraction Project