glam/data/instances/all/UNIFICATION_SUMMARY.md
2025-11-30 23:30:29 +01:00

8.5 KiB

Global Heritage Institutions - Unification Summary

Last Updated: November 11, 2025
Quick Reference: Post-Merge Statistics & File Locations


Current Statistics (Post-Merge)

Metric Value
Total Institutions 13,505
Countries Covered 18
Wikidata Coverage 56.6% (7,650 institutions)
Geocoding Coverage 60.7% (8,192 institutions)
Master File Size ~45 MB
Schema Version LinkML v0.2.1

Merge History

November 11, 2025 - Major Enrichment Merge

Dataset Institutions Wikidata % Status
Tunisia Enhanced 70 (+27 new) 74.3% Merged
Georgia GLAM 14 (new country) 85.7% Merged
Latin America Updated 586 (updated) 43.4% avg Merged

Impact:

  • +27 institutions (13,478 → 13,505)
  • +153 Wikidata Q-numbers (55.6% → 56.6%)
  • +1 country (Georgia added)

Notable Achievements:

  • 🎯 Tunisia: 2.3% → 74.3% Wikidata coverage (+72 pp!)
  • 🎯 Georgia: 85.7% Wikidata coverage (new country)
  • 🎯 Chile: 53.9% → 81.7% Wikidata coverage
  • 🎯 Mexico: 117 → 192 institutions (+75)
  • 🎯 Six countries at 100%: Belgium, USA, UK, Russia, Denmark, Luxembourg

File Locations

Master Dataset

Primary File:

data/instances/all/unified_global_heritage_institutions.yaml
  • Size: ~45 MB
  • Format: LinkML-compliant YAML
  • Institutions: 13,505
  • Last Updated: November 11, 2025

Backup Files (in same directory):

unified_global_heritage_institutions_backup_20251111_092645.yaml  ← Pre-merge backup
unified_global_heritage_institutions.yaml.backup                  ← Latest backup
unified_global_heritage_institutions.yaml.backup2                 ← Secondary backup

Documentation Files

File Purpose Size
UNIFIED_OVERVIEW.md Complete project documentation ~22 KB
UNIFICATION_REPORT.md Detailed merge analysis ~10 KB
UNIFICATION_SUMMARY.md Quick reference (this file) ~3 KB
DATASET_STATISTICS.yaml Machine-readable metrics ~3.3 KB
ENRICHMENT_PROGRESS.md Historical enrichment tracking ~11 KB
README.md Quick reference index ~7.7 KB

Superseded Files (Archived)

The following files were merged into the unified dataset and archived:

Archived on November 11, 2025

Location: data/instances/archive/2025-11-11-pre-merge/

File Source Location Reason
tunisian_institutions_enhanced.yaml tunisia/ Merged into unified dataset
georgia_glam_institutions_enriched.yaml Root directory Merged into unified dataset
latin_american_institutions_AUTHORITATIVE.yaml Root directory Updated and merged

Archive Documentation: See data/instances/archive/2025-11-11-pre-merge/SUPERSEDED.md


Countries with 100% Wikidata Coverage

As of November 11, 2025, the following countries have achieved complete Wikidata coverage:

Country Institutions Wikidata Q-numbers Geocoded
Belgium 7 7 (100%) 7 (100%)
USA 7 7 (100%) 7 (100%)
UK 4 4 (100%) 3 (75%)
Russia 1 1 (100%) 1 (100%)
Denmark 1 1 (100%) 1 (100%)
Luxembourg 1 1 (100%) 1 (100%)

Top 10 Countries by Institution Count

Rank Country Institutions Wikidata % Geocoded %
1 Japan 12,065 58.8% 58.8%
2 Netherlands 622 31.0% 99.8%
3 Brazil 212 24.5% 40.1%
4 Mexico 192 32.8% 78.1%
5 Chile 180 81.7% 89.4%
6 Tunisia 70 74.3% 84.3%
7 Libya 50 16.0% 0.0%
8 Vietnam 21 38.1% 0.0%
9 Algeria 19 5.3% 0.0%
10 Georgia 14 85.7% 0.0%

Enrichment Needs Summary

Total Institutions Needing Enrichment: 5,855 (43.4%)

By Priority Level

🔴 High Priority (Low Wikidata coverage, significant size):

  • Japan: 4,974 institutions need Q-numbers (58.8% → target 70%)
  • Netherlands: 429 institutions need Q-numbers (31.0% → target 50%)
  • Brazil: 160 institutions need Q-numbers (24.5% → target 50%)

🟡 Medium Priority (Moderate coverage, room for improvement):

  • Mexico: 129 institutions need Q-numbers (32.8% → target 50%)
  • Libya: 42 institutions need Q-numbers (16.0% → target 50%)
  • Chile: 33 institutions need Q-numbers (81.7% → target 90%)

🟢 Low Priority (High coverage, minor gaps):

  • Tunisia: 18 institutions need Q-numbers (74.3% → target 90%)
  • Vietnam: 13 institutions need Q-numbers (38.1% → target 75%)
  • Georgia: 2 institutions need Q-numbers (85.7% → target 100%)

Recent Major Improvements

Tunisia - Regional Leader

  • Before: 43 institutions, 1 Wikidata Q-number (2.3%)
  • After: 70 institutions, 52 Wikidata Q-numbers (74.3%)
  • Improvement: +27 institutions, +51 Q-numbers, +72 percentage points

Georgia - New Country Addition

  • Before: Not in dataset
  • After: 14 institutions, 12 Wikidata Q-numbers (85.7%)
  • Impact: Second-highest Wikidata coverage globally

Chile - Major Enhancement

  • Before: 180 institutions, 97 Wikidata Q-numbers (53.9%)
  • After: 180 institutions, 147 Wikidata Q-numbers (81.7%)
  • Improvement: +50 Q-numbers, +27.8 percentage points

Brazil - Steady Growth

  • Before: 212 institutions, 29 Wikidata Q-numbers (13.7%)
  • After: 212 institutions, 52 Wikidata Q-numbers (24.5%)
  • Improvement: +23 Q-numbers, +10.8 percentage points

Mexico - Expansion & Enrichment

  • Before: 117 institutions, 0 Wikidata Q-numbers (0%)
  • After: 192 institutions, 63 Wikidata Q-numbers (32.8%)
  • Improvement: +75 institutions, +63 Q-numbers, +32.8 percentage points

Next Steps

Immediate (This Week)

  1. Complete documentation updates (this file)
  2. 📦 Archive superseded files to archive/2025-11-11-pre-merge/
  3. 🌍 Start geocoding for Libya, Vietnam, Algeria, Georgia

Short-Term (This Month)

  1. 🇯🇵 Japan Batch 1 enrichment: Target 50 major institutions
  2. 🇳🇱 Netherlands Batch 1 enrichment: Target 30 provincial museums
  3. 🇧🇷 Brazil Batch 8 enrichment: Target 25 university collections
  4. 🇲🇽 Mexico Batch 3 enrichment: Target 20 federal institutions

Long-Term (Q1 2026)

  1. 🌏 Expand to Southeast Asia (Vietnam, Thailand, Indonesia)
  2. 🌍 Expand to Middle East (UAE, Saudi Arabia, Jordan)
  3. 🌍 Expand to Africa (Egypt, Kenya, South Africa)

Data Quality Tiers

Tier Description Count %
TIER_1 Authoritative (registries) 622 4.6%
TIER_2 Verified (websites) 4,891 36.2%
TIER_3 Crowd-sourced (Wikidata) 7,650 56.6%
TIER_4 Inferred (NLP extraction) 342 2.5%

Goal: Reduce TIER_4 to <1% through verification and website crawling.


Technical References

Schema Architecture

  • Version: LinkML v0.2.1 (modular)
  • Modules: 6 specialized modules (core, enums, provenance, collections, dutch, main)
  • Documentation: /Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md

Persistent Identifiers (GHCID)

  • Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
  • Collision Suffix: Native language institution name in snake_case (NOT Wikidata Q-numbers)
  • UUID Strategies: v5 (SHA-1 primary), v8 (SHA-256 secondary), v7 (database only)
  • Documentation: /Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md
  • Collision Resolution: /Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md

Ontology Alignment

  • TOOI: Dutch government organizations
  • CPOV: EU Core Public Organisation Vocabulary
  • Schema.org: Web semantics
  • CIDOC-CRM: Cultural heritage domain

For More Information

Document Purpose
UNIFIED_OVERVIEW.md Comprehensive project documentation (read this for details)
UNIFICATION_REPORT.md Detailed merge analysis with country-by-country breakdowns
DATASET_STATISTICS.yaml Machine-readable metrics (parse this for automation)
ENRICHMENT_PROGRESS.md Historical enrichment tracking (batch-by-batch progress)
README.md Quick reference index (start here for navigation)

Document Version: 1.0
Created: November 11, 2025
Maintained By: GLAM Data Extraction Project

Master Dataset Location: data/instances/all/unified_global_heritage_institutions.yaml