glam/docs/sessions/global_merge_2025-11-07.md
2025-11-30 23:30:29 +01:00

14 KiB

Global Heritage Institutions Dataset Merge - Session Report

⚠️ HISTORICAL DOCUMENT - SUPERSEDED COLLISION RESOLUTION POLICY

This session report documents the original GHCID collision resolution approach using Wikidata Q-numbers. As of November 2025, collision resolution now uses native language institution names in snake_case format.

Current policy: See docs/plan/global_glam/07-ghcid-collision-resolution.md

Examples of the NEW format:

  • OLD: NL-NH-AMS-M-SM-Q924335
  • NEW: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

Date: 2025-11-07
Session: Global Dataset Merge & GHCID Collision Analysis


Executive Summary

Successfully merged four regional ISIL datasets into a unified global heritage custodian database. However, discovered critical GHCID collision issue affecting 21.2% of Japanese institutions, requiring disambiguation to resolve.

Note

: The collision resolution approach documented below (Wikidata Q-numbers) has been superseded by native language name suffixes. See deprecation notice above.

Final Count: 10,838 institutions (after deduplication from 13,411 source records)


Datasets Merged

Dataset Source File Records Coverage
Japan ISIL data/instances/japan/jp_institutions.yaml 12,065 Japan (47 prefectures)
Netherlands ISIL data/dutch_institutions_with_ghcids.yaml 1,032 Netherlands (all provinces)
EU Institutions data/instances/eu_institutions.yaml 10 Belgium, Italy, Luxembourg
Latin America data/instances/latin_american_institutions_AUTHORITATIVE.yaml 304 Brazil, Chile, Mexico, Argentina
TOTAL - 13,41110,838 10 countries

Data Loss: 2,573 institutions (19.2%) removed during deduplication due to GHCID collisions


Global Dataset Statistics

Geographic Distribution

Country Count Percentage
🇯🇵 Japan 9,507 87.7%
🇳🇱 Netherlands 1,017 9.4%
🇲🇽 Mexico 109 1.0%
🇧🇷 Brazil 97 0.9%
🇨🇱 Chile 90 0.8%
🇧🇪 Belgium 7 0.1%
🇺🇸 United States 7 0.1%
🇮🇹 Italy 2 0.0%
🇱🇺 Luxembourg 1 0.0%
🇦🇷 Argentina 1 0.0%

Institution Types

Type Count Percentage
LIBRARY 5,132 47.4%
MUSEUM 4,680 43.2%
MIXED 543 5.0%
ARCHIVE 304 2.8%
COLLECTING_SOCIETY 66 0.6%
EDUCATION_PROVIDER 38 0.4%
OFFICIAL_INSTITUTION 37 0.3%
RESEARCH_CENTER 32 0.3%
BOTANICAL_ZOO 4 0.0%
UNDEFINED 2 0.0%

Data Quality Metrics

Metric Count Percentage
GHCID Coverage 10,838 100.0%
Has Identifiers 10,535 97.2%
Has Website 8,666 80.0% ⚠️
Geocoded (lat/lon) 187 1.7%

Major Gap: Only 187 institutions (1.7%) have coordinates - urgent geocoding needed


CRITICAL ISSUE: GHCID Collisions

Problem Summary

2,558 Japanese institutions lost during merge due to GHCID collisions:

  • Colliding GHCIDs: 868 cases
  • Institutions Affected: 3,426 (28.4% of Japan dataset)
  • Data Loss: 2,558 institutions (21.2% of Japan dataset)

Root Cause

Japanese municipal library networks have many branches in the same city with similar names, all abbreviating to the same GHCID:

Example: Toyohashi City (TOYOHASHI SHI)

  • 102 community center libraries (e.g., "TOYOHASHISHIFUTAGAWASHOGAIGAKUSHUSENTATOSHOSHITSU")
  • All abbreviate to: JP-AI-TOY-L-T
  • ISIL codes: JP-1006390 through JP-1006491 (all unique!)
  • 101 institutions lost in merge

Example: Nagasaki City (NAGASAKI SHI)

  • 51 community center libraries (e.g., "NAGASAKISHIHIGASHIKOMINKANTOSHOSHITSU")
  • All abbreviate to: JP-NA-NAG-L-N
  • ISIL codes: JP-1006756 through JP-1006806 (all unique!)
  • 50 institutions lost in merge

Collision Size Distribution

Collision Size Cases Data Loss
102 institutions 1 101 lost
51 institutions 1 50 lost
34 institutions 1 33 lost
30 institutions 1 29 lost
29 institutions 2 56 lost
20+ institutions 9 cases ~180 lost
10-19 institutions 31 cases ~400 lost
5-9 institutions 101 cases ~600 lost
2-4 institutions 690 cases ~1,100 lost

Top 10 Collision Hotspots

City Institutions Lost
TOYOHASHI SHI (Aichi) 109
NAGASAKI SHI (Nagasaki) 55
SHIMOTSUKE SHI (Tochigi) 33
IMABARI SHI (Ehime) 31
SASEBO SHI (Nagasaki) 30
MIYAGI GUN MATSUSHIMA MACHI 28
TOTTORI SHI (Tottori) 28
TOYAMA SHI (Toyama) 27
CHIYODA KU (Tokyo) 26
TAKAOKA SHI (Toyama) 25

Why This Matters

ISIL codes are unique (all 12,065 Japan institutions have unique codes), but GHCIDs collide because:

  1. City abbreviations work (SAP = Sapporo, TOY = Toyohashi)
  2. Institution type works (L = Library)
  3. Name abbreviations fail for similar branch names:
    • "Sapporo Shinkotoni Library" → "SSL"
    • "Sapporo Sumikawa Library" → "SSL" (collision!)
    • "TOYOHASHISHIFUTAGAWASHOGAIGAKUSHUSENTATOSHOSHITSU" → "T"
    • "TOYOHASHISHITOYOKASHOGAIGAKUSHUSENTATOSHOSHITSU" → "T" (collision!)

Solution: Wikidata Q-Number Enrichment

GHCID Specification Requirement

Per docs/PERSISTENT_IDENTIFIERS.md and docs/GHCID_PID_SCHEME.md:

GHCID format: {country}-{region}-{city}-{type}-{name_abbrev}[-Q{wikidata_id}]

Collision Resolution Rule:

When multiple institutions generate the same base GHCID, append Wikidata Q-number:

  • JP-AI-TOY-L-T-Q12345 (Toyohashi Futagawa Library)
  • JP-AI-TOY-L-T-Q67890 (Toyohashi Toyoka Library)

Implementation Strategy

Step 1: Wikidata Lookup (Use Wikidata API)

# For each colliding institution:
# 1. Search Wikidata by ISIL code
# 2. Fallback: Search by name + location
# 3. If found: Extract Q-number
# 4. Append to GHCID: JP-AI-TOY-L-T-Q12345

Step 2: Synthetic Q-Number Generation (Fallback) For institutions without Wikidata entries:

# Generate synthetic Q-number from GHCID numeric hash
synthetic_q = f"Q{ghcid_numeric % 100000000}"
# Result: JP-AI-TOY-L-T-Q17339437

Step 3: GHCID History Tracking

ghcid_history:
  - ghcid: JP-AI-TOY-L-T-Q12345  # Current (with Q-number)
    ghcid_numeric: 789012345678
    valid_from: "2025-11-07T09:15:47Z"
    valid_to: null
    reason: "Q-number added to resolve collision with 101 other Toyohashi libraries"
  
  - ghcid: JP-AI-TOY-L-T  # Original (without Q-number)
    ghcid_numeric: 123456789012
    valid_from: "2025-11-07T08:00:00Z"
    valid_to: "2025-11-07T09:15:47Z"
    reason: "Base GHCID from ISIL parser (pre-collision resolution)"

Output Files

Main Dataset

  • File: data/instances/global/global_heritage_institutions.yaml
  • Size: 10,838 institutions
  • Format: LinkML-compliant YAML

Analysis Reports

  1. Merge Report: data/instances/global/merge_report.md
  2. Merge Statistics: data/instances/global/merge_statistics.yaml
  3. GHCID Collision Analysis: data/instances/japan/ghcid_collision_analysis.yaml

Scripts Created

  1. scripts/merge_global_datasets.py - Global merge with deduplication
  2. scripts/analyze_ghcid_collisions.py - Collision pattern analysis

Next Steps (Priority Order)

🔴 CRITICAL: Fix GHCID Collisions

Priority: HIGH
Impact: Recover 2,558 lost institutions (21.2% of Japan dataset)

Tasks:

  1. Implement Wikidata API integration for Q-number lookup
  2. Enrich Japanese institutions with Q-numbers
  3. Regenerate GHCIDs with collision resolution
  4. Re-merge datasets with updated GHCIDs
  5. Validate: All 12,065 Japan institutions present in final dataset

Expected Outcome: Global dataset grows from 10,838 → 13,396 institutions


🟡 HIGH PRIORITY: Geocoding

Priority: HIGH
Impact: Enable geographic visualization and spatial analysis

Current Coverage: 187/10,838 (1.7%) have coordinates

Geocoding Targets:

  • Japan: 12,045 addresses (99.83% have street addresses)
  • Netherlands: 1,032 institutions (addresses available in ISIL registry)
  • EU: 10 institutions
  • Latin America: Already geocoded (187/304 = 61.5%)

Tools:

  • Nominatim API (OpenStreetMap)
  • Batch geocoding with caching
  • Quality validation (coordinate boundaries check)

Expected Outcome: ~13,000 geocoded institutions (95%+ coverage)


🟢 MEDIUM PRIORITY: Switzerland ISIL Parser

Priority: MEDIUM
Impact: Add 2,377 institutions to global dataset

Tasks:

  1. Implement Swiss ISIL CSV parser
  2. Generate GHCIDs with canton-level geography
  3. Cross-link with Wikidata for multilingual names
  4. Integrate into global merge

Expected Outcome: Global dataset grows to ~15,800 institutions


🟢 MEDIUM PRIORITY: Data Enrichment

Priority: MEDIUM
Impact: Improve data quality and linked data coverage

Enrichment Sources:

  1. Wikidata: Q-numbers, multilingual names, coordinates, websites
  2. VIAF: Authority control identifiers for cultural institutions
  3. GeoNames: Geographic coordinates and place name variants
  4. OpenStreetMap: Address normalization and POI data

Technical Observations

Deduplication Algorithm Performance

The merge script successfully handles duplicates using a tier-based priority system:

  1. TIER_1_AUTHORITATIVE (CSV registries) > TIER_2 > TIER_3 > TIER_4
  2. Completeness score (more filled fields = better)
  3. Extraction date (more recent = better)

Example: When Netherlands ISIL and Netherlands organizations CSV both contain the same institution, TIER_1 data wins.

GHCID Coverage Achievement

100% GHCID coverage across all 10,838 institutions demonstrates successful GHCID generation pipeline:

  • Japan: All 12,065 → 9,507 (after collision dedup) have GHCIDs
  • Netherlands: All 1,032 → 1,017 (after collision dedup) have GHCIDs
  • EU: All 10 have GHCIDs
  • Latin America: All 304 have GHCIDs

Data Tier Distribution

Tier Description Count (Estimated)
TIER_1 Authoritative CSV registries ~13,100 (97.7%)
TIER_2 Website crawls (future) 0
TIER_3 Wikidata/crowd-sourced 0
TIER_4 NLP-extracted (Latin America) ~300 (2.2%)

Recommendations

Immediate Actions

  1. Fix GHCID collisions BEFORE further analysis

    • Current global dataset is incomplete (missing 21% of Japan data)
    • Collision resolution is required by GHCID spec
    • Wikidata enrichment provides additional value (multilingual names, links)
  2. Update validation scripts

    • Add collision detection to schema validators
    • Warn if base GHCID (without Q-number) appears multiple times
    • Enforce Q-number requirement for collisions
  3. Document collision resolution algorithm

    • Create step-by-step guide for future parsers
    • Example code for Wikidata Q-number lookup
    • Fallback strategy for institutions without Wikidata entries

Long-term Strategy

  1. Prevent future collisions

    • Generate GHCIDs with Q-numbers during initial parsing
    • Don't wait for merge phase to discover collisions
    • Japan parser should detect branch library patterns
  2. Improve abbreviation algorithm

    • Current algorithm: First letter of each word
    • Better approach: Distinctive substring extraction
    • Example: "Futagawa" → "FUT" (not "F"), "Toyoka" → "TOY" (not "T")
    • But this is complex - Q-numbers are simpler and more reliable
  3. Integrate Wikidata from start

    • ISIL CSV files often lack Wikidata Q-numbers
    • Enrich during parsing, not after merge
    • Cache Q-number lookups (rate limit: 1 req/sec)

References

Documentation

  • GHCID Specification: docs/PERSISTENT_IDENTIFIERS.md
  • GHCID PID Scheme: docs/GHCID_PID_SCHEME.md
  • Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
  • Agent Instructions: AGENTS.md (updated with collision handling guidance)

Schema

  • Core Classes: schemas/core.yaml (HeritageCustodian, Location, Identifier)
  • Provenance: schemas/provenance.yaml (GHCIDHistoryEntry, ChangeEvent)
  • Enumerations: schemas/enums.yaml (InstitutionTypeEnum, DataSource, DataTier)

Data Sources

  • Japan ISIL: NDL ISIL registry (12,065 institutions)
  • Netherlands ISIL: KB ISIL registry (364 codes) + Dutch orgs CSV (1,351 orgs)
  • EU: Manual ISIL lookups (Belgium, Italy, Luxembourg)
  • Latin America: NLP-extracted from Claude conversations (Brazil, Chile, Mexico)

Session Artifacts

Files Created

data/instances/global/
├── global_heritage_institutions.yaml    # 10,838 institutions
├── merge_report.md                      # Human-readable summary
└── merge_statistics.yaml                # Detailed statistics

data/instances/japan/
└── ghcid_collision_analysis.yaml       # Collision investigation

scripts/
├── merge_global_datasets.py            # Merge + deduplication
└── analyze_ghcid_collisions.py         # Collision analysis

docs/sessions/
└── global_merge_2025-11-07.md          # This document

Commands Executed

# Global merge
python scripts/merge_global_datasets.py

# Collision analysis
python scripts/analyze_ghcid_collisions.py

Conclusion

Status: Global merge technically successful, but GHCID collision issue is blocking further work.

Key Achievement: Unified 4 regional datasets into single global database with 100% GHCID coverage.

Critical Finding: 21.2% of Japan dataset lost to GHCID collisions - Wikidata Q-number enrichment is mandatory before proceeding with geocoding, exports, or public release.

Next Session: Implement Wikidata Q-number enrichment and GHCID regeneration for Japanese institutions.


Session Date: 2025-11-07
Duration: ~90 minutes
Total Institutions: 10,838 (with 2,558 pending recovery)
Schema Version: v0.2.0 (modular LinkML)
Status: ⚠️ Collision resolution required