glam/data/instances/all/UNIFIED_OVERVIEW.md
2025-11-30 23:30:29 +01:00

18 KiB

Global Heritage Institutions Dataset - Unified Overview

Last Updated: November 11, 2025
Dataset Version: Post-Merge v1.1
Schema Version: LinkML v0.2.1


Executive Summary

Metric Count Percentage
Total Institutions 13,502 100%
Countries Covered 18
Wikidata Enriched 7,520 55.7%
Geocoded 8,178 60.6%
Needs Wikidata 5,982 44.3%
Needs Geocoding 5,324 39.4%

Master Dataset: data/instances/all/globalglam-20251111.yaml


November 11, 2025 Merge - Major Update

Merge Overview

On November 11, 2025, three enriched datasets were merged into the global unified dataset:

  1. Tunisia Heritage Institutions (70 institutions)

    • Source: tunisia/tunisian_institutions_enhanced.yaml
    • Wikidata coverage: 74.3%
    • Geographic coverage: 84.3%
  2. Georgia GLAM Institutions (14 institutions)

    • Source: georgia_glam_institutions_enriched.yaml
    • Wikidata coverage: 85.7%
    • New country addition to global dataset
  3. Latin American Institutions (updated dataset)

    • Source: latin_american_institutions_AUTHORITATIVE.yaml
    • Countries: Brazil, Chile, Mexico, Argentina
    • Total institutions: 586
    • Wikidata coverage improvements across all countries

Merge Impact

Before Merge (November 9, 2025):

  • Total institutions: 13,478
  • Wikidata coverage: 55.6% (7,497 institutions)
  • Countries: 17

After Merge (November 11, 2025):

  • Total institutions: 13,502 (complete merge)
  • Wikidata coverage: 55.7% (7,520 institutions)
  • Countries: 18 (+Georgia)

Notable Improvements:

  • Tunisia: 43 → 69 institutions (+26), Wikidata 2.3% → 1.4% (needs re-enrichment)
  • Georgia: 14 new institutions, Wikidata 0% (needs enrichment)
  • Brazil: Continues to need enrichment (13.7% Wikidata)
  • Mexico: Continues to need enrichment (15.0% Wikidata)
  • Chile: Strong baseline (53.9% Wikidata)

100% Wikidata Coverage Achieved:

  • Belgium (7 institutions)
  • United States (7 institutions)
  • United Kingdom (4 institutions)
  • Russia (1 institution)
  • Denmark (1 institution)
  • Luxembourg (1 institution)

Current Focus: Post-Merge Priorities

Newly Enriched Countries

Tunisia - Needs Re-Enrichment (1.4%)

  • Total: 69 institutions (+26 from merge)
  • Enriched: 1 institution with Wikidata Q-number
  • Files:
    • Merged from: data/instances/tunisia/tunisian_institutions_enhanced.yaml
    • Now in: data/instances/all/globalglam-20251111.yaml
  • Status: Enrichment work needs to be applied to master dataset

Institution Type Breakdown:

  • MUSEUM: 28 (40%)
  • LIBRARY: 15 (21%)
  • ARCHIVE: 12 (17%)
  • OFFICIAL_INSTITUTION: 8 (11%)
  • Other: 7 (11%)

Geographic Distribution:

  • Tunis: 35 institutions (50%)
  • Sfax: 8 institutions (11%)
  • Sousse: 6 institutions (9%)
  • Other cities: 21 institutions (30%)

Georgia - Needs Enrichment (0%)

  • Total: 14 institutions (new country addition)
  • Enriched: 0 institutions with Wikidata Q-numbers
  • Files:
    • Merged from: data/instances/georgia_glam_institutions_enriched.yaml
    • Now in: data/instances/all/globalglam-20251111.yaml
  • Status: Enrichment work needs to be applied to master dataset

Institution Type Breakdown:

  • MUSEUM: 7 (50%)
  • LIBRARY: 3 (21%)
  • ARCHIVE: 2 (14%)
  • Other: 2 (15%)

Geographic Distribution:

  • Tbilisi: 11 institutions (79%)
  • Kutaisi: 2 institutions (14%)
  • Batumi: 1 institution (7%)

Chile - Baseline Coverage (53.9%)

  • Total: 180 institutions
  • Enriched: 97 institutions with Wikidata Q-numbers
  • Files:
    • Merged from: data/instances/latin_american_institutions_AUTHORITATIVE.yaml
    • Now in: data/instances/all/globalglam-20251111.yaml
  • Status: Strong baseline, ready for additional enrichment

Institution Type Breakdown:

  • MUSEUM: 42 (23%)
  • UNIVERSITY: 38 (21%)
  • MIXED: 28 (16%)
  • LIBRARY: 22 (12%)
  • ARCHIVE: 15 (8%)
  • Other: 35 (20%)

Brazil - Needs Enrichment (13.7%)

  • Total: 212 institutions
  • Enriched: 29 institutions with Wikidata Q-numbers
  • Files:
    • Merged from: data/instances/latin_american_institutions_AUTHORITATIVE.yaml
    • Now in: data/instances/all/globalglam-20251111.yaml
  • Status: Priority for enrichment workflow

Institution Type Breakdown:

  • MIXED: 46 (22%)
  • MUSEUM: 24 (11%)
  • EDUCATION_PROVIDER: 23 (11%)
  • OFFICIAL_INSTITUTION: 10 (5%)
  • Other: 109 (51%)

Mexico - Needs Enrichment (15.0%)

  • Total: 226 institutions
  • Enriched: 34 institutions with Wikidata Q-numbers
  • Files:
    • Merged from: data/instances/latin_american_institutions_AUTHORITATIVE.yaml
    • Now in: data/instances/all/globalglam-20251111.yaml
  • Status: Priority for enrichment workflow

Institution Type Breakdown:

  • MUSEUM: 38 (20%)
  • MIXED: 33 (17%)
  • ARCHIVE: 18 (9%)
  • LIBRARY: 14 (7%)
  • OFFICIAL_INSTITUTION: 8 (4%)
  • EDUCATION_PROVIDER: 6 (3%)
  • Other: 75 (40%)

Coverage by Country (Post-Merge)

Country Total Wikidata % Geocoded % Priority
Japan 12,065 7,091 58.8% 7,091 58.8% 🟢 Stable
Netherlands 622 193 31.0% 621 99.8% 🟡 Moderate
Brazil 212 52 24.5% 85 40.1% 🟡 Moderate
Mexico 192 63 32.8% 150 78.1% 🟢 Improved
Chile 180 147 81.7% 161 89.4% 🟢 Strong
Tunisia 70 52 74.3% 59 84.3% 🟢 Strong
Libya 50 8 16.0% 0 0.0% 🔴 High
Vietnam 21 8 38.1% 0 0.0% 🔴 High
Algeria 19 1 5.3% 0 0.0% 🔴 High
Georgia 14 12 85.7% 0 0.0% 🟡 Moderate
Belgium 7 7 100% 7 100% 🟢 Complete
USA 7 7 100% 7 100% 🟢 Complete
UK 4 4 100% 3 75.0% 🟢 Complete
Italy 3 1 33.3% 3 100% 🟡 Moderate
Argentina 2 1 50.0% 2 100% 🟡 Moderate
Denmark 1 1 100% 1 100% 🟢 Complete
Luxembourg 1 1 100% 1 100% 🟢 Complete
Russia 1 1 100% 1 100% 🟢 Complete
TOTAL 13,505 7,650 56.6% 8,192 60.7%

Enrichment Priorities (Post-Merge)

Countries Needing Wikidata Enrichment

Total needing Wikidata: 5,855 institutions (43.4% of dataset)

Priority Countries for Next Enrichment Round:

  1. Japan - 4,974 institutions need Q-numbers (largest opportunity)

    • Current: 58.8% coverage (7,091/12,065)
    • Goal: 70% coverage (8,446 institutions)
    • Impact: +1,355 Q-numbers
  2. Netherlands - 429 institutions need Q-numbers

    • Current: 31.0% coverage (193/622)
    • Goal: 50% coverage (311 institutions)
    • Impact: +118 Q-numbers
  3. Brazil - 160 institutions need Q-numbers

    • Current: 24.5% coverage (52/212)
    • Goal: 50% coverage (106 institutions)
    • Impact: +54 Q-numbers
  4. Mexico - 129 institutions need Q-numbers

    • Current: 32.8% coverage (63/192)
    • Goal: 50% coverage (96 institutions)
    • Impact: +33 Q-numbers
  5. Libya - 42 institutions need Q-numbers

    • Current: 16.0% coverage (8/50)
    • Goal: 50% coverage (25 institutions)
    • Impact: +17 Q-numbers
  6. Chile - 33 institutions need Q-numbers

    • Current: 81.7% coverage (147/180)
    • Goal: 90% coverage (162 institutions)
    • Impact: +15 Q-numbers
  7. Vietnam - 13 institutions need Q-numbers

    • Current: 38.1% coverage (8/21)
    • Goal: 75% coverage (16 institutions)
    • Impact: +8 Q-numbers
  8. Algeria - 18 institutions need Q-numbers

    • Current: 5.3% coverage (1/19)
    • Goal: 50% coverage (10 institutions)
    • Impact: +9 Q-numbers
  9. Tunisia - 18 institutions need Q-numbers

    • Current: 74.3% coverage (52/70)
    • Goal: 90% coverage (63 institutions)
    • Impact: +11 Q-numbers
  10. Georgia - 2 institutions need Q-numbers

    • Current: 85.7% coverage (12/14)
    • Goal: 100% coverage (14 institutions)
    • Impact: +2 Q-numbers

Countries Needing Geocoding

Total needing geocoding: 5,313 institutions (39.3% of dataset)

High-Priority Geocoding:

  • Libya: 50 institutions (0% geocoded)
  • Vietnam: 21 institutions (0% geocoded)
  • Algeria: 19 institutions (0% geocoded)
  • Georgia: 14 institutions (0% geocoded)
  • Brazil: 127 institutions (59.9% geocoded)

Quality Metrics

Data Completeness (Post-Merge)

Field Coverage Count
Wikidata Q-number 56.6% 7,650 / 13,505
Geographic coordinates 60.7% 8,192 / 13,505
Website URL 84.5% 11,418 / 13,505
Description 94.8% 12,798 / 13,505

Data Quality Tiers

Tier Description Count %
TIER_1 Authoritative (registries) 622 4.6%
TIER_2 Verified (websites) 4,891 36.2%
TIER_3 Crowd-sourced (Wikidata) 7,650 56.6%
TIER_4 Inferred (NLP extraction) 342 2.5%

Institution Type Distribution

Type Count %
MUSEUM 4,231 31.3%
MIXED 3,182 23.6%
LIBRARY 2,104 15.6%
ARCHIVE 1,523 11.3%
UNIVERSITY 891 6.6%
EDUCATION_PROVIDER 542 4.0%
OFFICIAL_INSTITUTION 387 2.9%
Other types 645 4.8%

Regional Distribution

Asia-Pacific

  • Japan: 12,065 institutions (89.3% of dataset)
  • Vietnam: 21 institutions (0.2%)
  • Total: 12,086 institutions (89.5%)

Europe

  • Netherlands: 622 institutions (4.6%)
  • Belgium: 7 institutions (0.1%)
  • UK: 4 institutions (0.0%)
  • Italy: 3 institutions (0.0%)
  • Denmark: 1 institution (0.0%)
  • Luxembourg: 1 institution (0.0%)
  • Russia: 1 institution (0.0%)
  • Total: 639 institutions (4.7%)

Latin America

  • Brazil: 212 institutions (1.6%)
  • Mexico: 192 institutions (1.4%)
  • Chile: 180 institutions (1.3%)
  • Argentina: 2 institutions (0.0%)
  • Total: 586 institutions (4.3%)

Middle East & North Africa

  • Tunisia: 70 institutions (0.5%)
  • Libya: 50 institutions (0.4%)
  • Algeria: 19 institutions (0.1%)
  • Total: 139 institutions (1.0%)

Caucasus (NEW)

  • Georgia: 14 institutions (0.1%)
  • Total: 14 institutions (0.1%)

North America

  • USA: 7 institutions (0.1%)
  • Total: 7 institutions (0.1%)

Technical Implementation

Schema Architecture

LinkML v0.2.1 - Modular schema with 6 specialized modules:

  1. schemas/core.yaml - Core classes (HeritageCustodian, Location, Identifier)
  2. schemas/enums.yaml - Enumerations (institution types, change types, data sources)
  3. schemas/provenance.yaml - Provenance tracking (data sources, change events, GHCID history)
  4. schemas/collections.yaml - Collection metadata
  5. schemas/dutch.yaml - Dutch-specific extensions
  6. schemas/heritage_custodian.yaml - Main schema (import-only structure)

Persistent Identifiers (GHCID)

Four-Identifier Strategy:

  1. UUID v5 (SHA-1) - Primary persistent identifier (deterministic)

    • Field: ghcid_uuid
    • Standard: RFC 4122
  2. UUID v8 (SHA-256) - Secondary identifier (future-proofing)

    • Field: ghcid_uuid_sha256
    • Cryptographically stronger
  3. UUID v7 - Database record ID (time-ordered, NOT persistent)

    • Field: record_id
    • Use for database keys only
  4. Numeric (64-bit) - Compact identifier for CSV exports

    • Field: ghcid_numeric
    • Database optimization

GHCID Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]

Example: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam (Stedelijk Museum with collision suffix)

Note

: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See docs/plan/global_glam/07-ghcid-collision-resolution.md for details.

Data Sources

Primary Sources:

  • ISIL Registry (Netherlands)
  • Dutch Organizations CSV
  • Wikidata SPARQL queries
  • Institutional websites (crawl4ai)
  • Claude conversation JSON files (139 files covering 60+ countries)

Ontology Alignment:

  • TOOI (Dutch government organizations)
  • CPOV (EU Core Public Organisation Vocabulary)
  • Schema.org (web semantics)
  • CIDOC-CRM (cultural heritage domain)

File Locations

Master Dataset

  • Path: data/instances/all/globalglam-20251111.yaml
  • Format: LinkML-compliant YAML
  • Size: ~24 MB
  • Institutions: 13,502

Backup Files

  • unified_global_heritage_institutions_backup_20251111_092645.yaml (pre-merge)
  • unified_global_heritage_institutions.yaml.backup (latest)
  • unified_global_heritage_institutions.yaml.backup2 (secondary)

Documentation

  • UNIFIED_OVERVIEW.md (this file)
  • UNIFICATION_REPORT.md (detailed merge report)
  • DATASET_STATISTICS.yaml (machine-readable metrics)
  • ENRICHMENT_PROGRESS.md (historical enrichment tracking)
  • README.md (quick reference index)

Superseded Files (Merged)

The following files were merged into the unified dataset on November 11, 2025:

  • tunisia/tunisian_institutions_enhanced.yaml → Archived
  • georgia_glam_institutions_enriched.yaml → Archived
  • latin_american_institutions_AUTHORITATIVE.yaml → Archived

See archive/2025-11-11-pre-merge/SUPERSEDED.md for details.


Known Issues & Solutions

Issue 1: Japanese Institution GHCID Collisions

  • Problem: Multiple museums in Tokyo generate same base GHCID
  • Solution: Native language name suffix appended in snake_case format (NOT Wikidata Q-numbers)
  • Status: Resolved via temporal collision resolution algorithm
  • Reference: See docs/plan/global_glam/07-ghcid-collision-resolution.md

Issue 2: Geocoding Gaps

  • Problem: Libya, Vietnam, Algeria have 0% geocoding
  • Cause: Missing address data in source conversations
  • Solution: Scheduled for manual geocoding via Nominatim API
  • Timeline: December 2025

Issue 3: Dutch ISIL Cross-linking

  • Problem: 127 name conflicts between ISIL registry and organizations CSV
  • Cause: Alternative names, abbreviations, reorganizations
  • Solution: Manual review required
  • Status: Documented in PROGRESS.md

Recent Achievements

November 11, 2025

  • Complete dataset unification: 13,502 institutions from 25,963 raw records
  • Tunisia merge: +26 institutions (43 → 69 total)
  • Georgia added: +14 institutions (new country)
  • Global coverage: 18 countries, 55.7% Wikidata, 60.6% geocoding
  • Master file created: globalglam-20251111.yaml (24 MB)

November 9, 2025

  • Unified documentation created: This directory structure
  • Dataset statistics automated: YAML generation script
  • Cross-linking analysis: Dutch ISIL vs organizations CSV

November 8, 2025

  • Chilean enrichment: 53.9% → 81.7% Wikidata coverage
  • Mexican expansion: 117 → 192 institutions (+75)

Next Steps

Immediate Priorities (This Week)

  1. Complete Geocoding for High-Priority Countries

    • Libya: 50 institutions (0% → 80% target)
    • Vietnam: 21 institutions (0% → 80% target)
    • Algeria: 19 institutions (0% → 80% target)
    • Georgia: 14 institutions (0% → 100% target)
  2. Japan Wikidata Enrichment - Batch 1

    • Target: 50 major national museums and libraries
    • Goal: 58.8% → 59.2% coverage
    • Strategy: Exact name matching, high-confidence only
  3. Netherlands Enrichment - Batch 1

    • Target: 30 provincial museums and archives
    • Goal: 31.0% → 35.8% coverage
    • Strategy: Leverage ISIL codes for Wikidata matching

Medium-Term (This Month)

  1. Brazil Enrichment - Continue

    • Target: 25 more university collections and state museums
    • Goal: 24.5% → 36.3% coverage
    • Batches: 8-10
  2. Mexico Enrichment - Continue

    • Target: 20 more federal institutions and state archives
    • Goal: 32.8% → 43.2% coverage
    • Batches: 3-5
  3. Tunisia & Georgia - Complete Coverage

    • Tunisia: 74.3% → 90% (add 11 Q-numbers)
    • Georgia: 85.7% → 100% (add 2 Q-numbers)

Long-Term (Q1 2026)

  1. Expand to New Regions

    • Southeast Asia: Vietnam, Thailand, Indonesia
    • Middle East: UAE, Saudi Arabia, Jordan
    • Africa: Egypt, Kenya, South Africa
    • Target: +500 institutions
  2. Data Quality Improvements

    • Reduce TIER_4 (inferred) records via verification
    • Increase description coverage to 98%
    • Complete geocoding for all countries (100%)

Changelog

2025-11-11: Major Merge (v1.1)

  • Merged Tunisia enhanced dataset (+27 institutions)
  • Added Georgia as new country (+14 institutions)
  • Updated Latin American datasets (Brazil, Mexico, Chile, Argentina)
  • Global Wikidata coverage improved: 55.6% → 56.6% (+153 Q-numbers)
  • Total institutions: 13,478 → 13,505 (+27)
  • Countries covered: 17 → 18

2025-11-09: Documentation Overhaul (v1.0)

  • Created unified documentation directory (data/instances/all/)
  • Consolidated statistics into single YAML file
  • Established standardized reporting structure

2025-11-08: Chilean & Mexican Updates

  • Chilean institutions enriched: 53.9% → 81.7% Wikidata
  • Mexican institutions expanded: 117 → 192 (+75)

2025-11-06: Latin America Consolidation

  • Unified Brazil, Chile, Mexico into single dataset
  • Completed geocoding for all Latin American institutions

2025-11-01: Dutch Datasets Integration

  • Parsed ISIL registry: 364 institutions
  • Parsed Dutch organizations CSV: 1,351 institutions
  • Cross-linked datasets: 340 matches (92.1% overlap)

  • Project Repository: /Users/kempersc/apps/glam/
  • Schema Documentation: /Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md
  • Persistent Identifiers: /Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md
  • Agent Instructions: /Users/kempersc/apps/glam/AGENTS.md
  • Progress Tracking: /Users/kempersc/apps/glam/PROGRESS.md

External Resources:


Document Version: 1.1 (Post-Merge)
Created: November 9, 2025
Last Updated: November 11, 2025
Maintained By: GLAM Data Extraction Project