14 KiB
Global Heritage Institutions Dataset Merge - Session Report
⚠️ HISTORICAL DOCUMENT - SUPERSEDED COLLISION RESOLUTION POLICY
This session report documents the original GHCID collision resolution approach using Wikidata Q-numbers. As of November 2025, collision resolution now uses native language institution names in snake_case format.
Current policy: See
docs/plan/global_glam/07-ghcid-collision-resolution.mdExamples of the NEW format:
- OLD:
NL-NH-AMS-M-SM-Q924335- NEW:
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
Date: 2025-11-07
Session: Global Dataset Merge & GHCID Collision Analysis
Executive Summary
Successfully merged four regional ISIL datasets into a unified global heritage custodian database. However, discovered critical GHCID collision issue affecting 21.2% of Japanese institutions, requiring disambiguation to resolve.
Note
: The collision resolution approach documented below (Wikidata Q-numbers) has been superseded by native language name suffixes. See deprecation notice above.
Final Count: 10,838 institutions (after deduplication from 13,411 source records)
Datasets Merged
| Dataset | Source File | Records | Coverage |
|---|---|---|---|
| Japan ISIL | data/instances/japan/jp_institutions.yaml |
12,065 | Japan (47 prefectures) |
| Netherlands ISIL | data/dutch_institutions_with_ghcids.yaml |
1,032 | Netherlands (all provinces) |
| EU Institutions | data/instances/eu_institutions.yaml |
10 | Belgium, Italy, Luxembourg |
| Latin America | data/instances/latin_american_institutions_AUTHORITATIVE.yaml |
304 | Brazil, Chile, Mexico, Argentina |
| TOTAL | - | 13,411 → 10,838 | 10 countries |
Data Loss: 2,573 institutions (19.2%) removed during deduplication due to GHCID collisions
Global Dataset Statistics
Geographic Distribution
| Country | Count | Percentage |
|---|---|---|
| 🇯🇵 Japan | 9,507 | 87.7% |
| 🇳🇱 Netherlands | 1,017 | 9.4% |
| 🇲🇽 Mexico | 109 | 1.0% |
| 🇧🇷 Brazil | 97 | 0.9% |
| 🇨🇱 Chile | 90 | 0.8% |
| 🇧🇪 Belgium | 7 | 0.1% |
| 🇺🇸 United States | 7 | 0.1% |
| 🇮🇹 Italy | 2 | 0.0% |
| 🇱🇺 Luxembourg | 1 | 0.0% |
| 🇦🇷 Argentina | 1 | 0.0% |
Institution Types
| Type | Count | Percentage |
|---|---|---|
| LIBRARY | 5,132 | 47.4% |
| MUSEUM | 4,680 | 43.2% |
| MIXED | 543 | 5.0% |
| ARCHIVE | 304 | 2.8% |
| COLLECTING_SOCIETY | 66 | 0.6% |
| EDUCATION_PROVIDER | 38 | 0.4% |
| OFFICIAL_INSTITUTION | 37 | 0.3% |
| RESEARCH_CENTER | 32 | 0.3% |
| BOTANICAL_ZOO | 4 | 0.0% |
| UNDEFINED | 2 | 0.0% |
Data Quality Metrics
| Metric | Count | Percentage |
|---|---|---|
| GHCID Coverage | 10,838 | 100.0% ✅ |
| Has Identifiers | 10,535 | 97.2% ✅ |
| Has Website | 8,666 | 80.0% ⚠️ |
| Geocoded (lat/lon) | 187 | 1.7% ❌ |
Major Gap: Only 187 institutions (1.7%) have coordinates - urgent geocoding needed
CRITICAL ISSUE: GHCID Collisions
Problem Summary
2,558 Japanese institutions lost during merge due to GHCID collisions:
- Colliding GHCIDs: 868 cases
- Institutions Affected: 3,426 (28.4% of Japan dataset)
- Data Loss: 2,558 institutions (21.2% of Japan dataset)
Root Cause
Japanese municipal library networks have many branches in the same city with similar names, all abbreviating to the same GHCID:
Example: Toyohashi City (TOYOHASHI SHI)
- 102 community center libraries (e.g., "TOYOHASHISHIFUTAGAWASHOGAIGAKUSHUSENTATOSHOSHITSU")
- All abbreviate to: JP-AI-TOY-L-T
- ISIL codes: JP-1006390 through JP-1006491 (all unique!)
- 101 institutions lost in merge
Example: Nagasaki City (NAGASAKI SHI)
- 51 community center libraries (e.g., "NAGASAKISHIHIGASHIKOMINKANTOSHOSHITSU")
- All abbreviate to: JP-NA-NAG-L-N
- ISIL codes: JP-1006756 through JP-1006806 (all unique!)
- 50 institutions lost in merge
Collision Size Distribution
| Collision Size | Cases | Data Loss |
|---|---|---|
| 102 institutions | 1 | 101 lost |
| 51 institutions | 1 | 50 lost |
| 34 institutions | 1 | 33 lost |
| 30 institutions | 1 | 29 lost |
| 29 institutions | 2 | 56 lost |
| 20+ institutions | 9 cases | ~180 lost |
| 10-19 institutions | 31 cases | ~400 lost |
| 5-9 institutions | 101 cases | ~600 lost |
| 2-4 institutions | 690 cases | ~1,100 lost |
Top 10 Collision Hotspots
| City | Institutions Lost |
|---|---|
| TOYOHASHI SHI (Aichi) | 109 |
| NAGASAKI SHI (Nagasaki) | 55 |
| SHIMOTSUKE SHI (Tochigi) | 33 |
| IMABARI SHI (Ehime) | 31 |
| SASEBO SHI (Nagasaki) | 30 |
| MIYAGI GUN MATSUSHIMA MACHI | 28 |
| TOTTORI SHI (Tottori) | 28 |
| TOYAMA SHI (Toyama) | 27 |
| CHIYODA KU (Tokyo) | 26 |
| TAKAOKA SHI (Toyama) | 25 |
Why This Matters
ISIL codes are unique (all 12,065 Japan institutions have unique codes), but GHCIDs collide because:
- ✅ City abbreviations work (SAP = Sapporo, TOY = Toyohashi)
- ✅ Institution type works (L = Library)
- ❌ Name abbreviations fail for similar branch names:
- "Sapporo Shinkotoni Library" → "SSL"
- "Sapporo Sumikawa Library" → "SSL" (collision!)
- "TOYOHASHISHIFUTAGAWASHOGAIGAKUSHUSENTATOSHOSHITSU" → "T"
- "TOYOHASHISHITOYOKASHOGAIGAKUSHUSENTATOSHOSHITSU" → "T" (collision!)
Solution: Wikidata Q-Number Enrichment
GHCID Specification Requirement
Per docs/PERSISTENT_IDENTIFIERS.md and docs/GHCID_PID_SCHEME.md:
GHCID format: {country}-{region}-{city}-{type}-{name_abbrev}[-Q{wikidata_id}]
Collision Resolution Rule:
When multiple institutions generate the same base GHCID, append Wikidata Q-number:
JP-AI-TOY-L-T-Q12345(Toyohashi Futagawa Library)JP-AI-TOY-L-T-Q67890(Toyohashi Toyoka Library)
Implementation Strategy
Step 1: Wikidata Lookup (Use Wikidata API)
# For each colliding institution:
# 1. Search Wikidata by ISIL code
# 2. Fallback: Search by name + location
# 3. If found: Extract Q-number
# 4. Append to GHCID: JP-AI-TOY-L-T-Q12345
Step 2: Synthetic Q-Number Generation (Fallback) For institutions without Wikidata entries:
# Generate synthetic Q-number from GHCID numeric hash
synthetic_q = f"Q{ghcid_numeric % 100000000}"
# Result: JP-AI-TOY-L-T-Q17339437
Step 3: GHCID History Tracking
ghcid_history:
- ghcid: JP-AI-TOY-L-T-Q12345 # Current (with Q-number)
ghcid_numeric: 789012345678
valid_from: "2025-11-07T09:15:47Z"
valid_to: null
reason: "Q-number added to resolve collision with 101 other Toyohashi libraries"
- ghcid: JP-AI-TOY-L-T # Original (without Q-number)
ghcid_numeric: 123456789012
valid_from: "2025-11-07T08:00:00Z"
valid_to: "2025-11-07T09:15:47Z"
reason: "Base GHCID from ISIL parser (pre-collision resolution)"
Output Files
Main Dataset
- File:
data/instances/global/global_heritage_institutions.yaml - Size: 10,838 institutions
- Format: LinkML-compliant YAML
Analysis Reports
- Merge Report:
data/instances/global/merge_report.md - Merge Statistics:
data/instances/global/merge_statistics.yaml - GHCID Collision Analysis:
data/instances/japan/ghcid_collision_analysis.yaml
Scripts Created
scripts/merge_global_datasets.py- Global merge with deduplicationscripts/analyze_ghcid_collisions.py- Collision pattern analysis
Next Steps (Priority Order)
🔴 CRITICAL: Fix GHCID Collisions
Priority: HIGH
Impact: Recover 2,558 lost institutions (21.2% of Japan dataset)
Tasks:
- Implement Wikidata API integration for Q-number lookup
- Enrich Japanese institutions with Q-numbers
- Regenerate GHCIDs with collision resolution
- Re-merge datasets with updated GHCIDs
- Validate: All 12,065 Japan institutions present in final dataset
Expected Outcome: Global dataset grows from 10,838 → 13,396 institutions
🟡 HIGH PRIORITY: Geocoding
Priority: HIGH
Impact: Enable geographic visualization and spatial analysis
Current Coverage: 187/10,838 (1.7%) have coordinates
Geocoding Targets:
- Japan: 12,045 addresses (99.83% have street addresses)
- Netherlands: 1,032 institutions (addresses available in ISIL registry)
- EU: 10 institutions
- Latin America: Already geocoded (187/304 = 61.5%)
Tools:
- Nominatim API (OpenStreetMap)
- Batch geocoding with caching
- Quality validation (coordinate boundaries check)
Expected Outcome: ~13,000 geocoded institutions (95%+ coverage)
🟢 MEDIUM PRIORITY: Switzerland ISIL Parser
Priority: MEDIUM
Impact: Add 2,377 institutions to global dataset
Tasks:
- Implement Swiss ISIL CSV parser
- Generate GHCIDs with canton-level geography
- Cross-link with Wikidata for multilingual names
- Integrate into global merge
Expected Outcome: Global dataset grows to ~15,800 institutions
🟢 MEDIUM PRIORITY: Data Enrichment
Priority: MEDIUM
Impact: Improve data quality and linked data coverage
Enrichment Sources:
- Wikidata: Q-numbers, multilingual names, coordinates, websites
- VIAF: Authority control identifiers for cultural institutions
- GeoNames: Geographic coordinates and place name variants
- OpenStreetMap: Address normalization and POI data
Technical Observations
Deduplication Algorithm Performance
The merge script successfully handles duplicates using a tier-based priority system:
- TIER_1_AUTHORITATIVE (CSV registries) > TIER_2 > TIER_3 > TIER_4
- Completeness score (more filled fields = better)
- Extraction date (more recent = better)
Example: When Netherlands ISIL and Netherlands organizations CSV both contain the same institution, TIER_1 data wins.
GHCID Coverage Achievement
100% GHCID coverage across all 10,838 institutions demonstrates successful GHCID generation pipeline:
- ✅ Japan: All 12,065 → 9,507 (after collision dedup) have GHCIDs
- ✅ Netherlands: All 1,032 → 1,017 (after collision dedup) have GHCIDs
- ✅ EU: All 10 have GHCIDs
- ✅ Latin America: All 304 have GHCIDs
Data Tier Distribution
| Tier | Description | Count (Estimated) |
|---|---|---|
| TIER_1 | Authoritative CSV registries | ~13,100 (97.7%) |
| TIER_2 | Website crawls (future) | 0 |
| TIER_3 | Wikidata/crowd-sourced | 0 |
| TIER_4 | NLP-extracted (Latin America) | ~300 (2.2%) |
Recommendations
Immediate Actions
-
Fix GHCID collisions BEFORE further analysis
- Current global dataset is incomplete (missing 21% of Japan data)
- Collision resolution is required by GHCID spec
- Wikidata enrichment provides additional value (multilingual names, links)
-
Update validation scripts
- Add collision detection to schema validators
- Warn if base GHCID (without Q-number) appears multiple times
- Enforce Q-number requirement for collisions
-
Document collision resolution algorithm
- Create step-by-step guide for future parsers
- Example code for Wikidata Q-number lookup
- Fallback strategy for institutions without Wikidata entries
Long-term Strategy
-
Prevent future collisions
- Generate GHCIDs with Q-numbers during initial parsing
- Don't wait for merge phase to discover collisions
- Japan parser should detect branch library patterns
-
Improve abbreviation algorithm
- Current algorithm: First letter of each word
- Better approach: Distinctive substring extraction
- Example: "Futagawa" → "FUT" (not "F"), "Toyoka" → "TOY" (not "T")
- But this is complex - Q-numbers are simpler and more reliable
-
Integrate Wikidata from start
- ISIL CSV files often lack Wikidata Q-numbers
- Enrich during parsing, not after merge
- Cache Q-number lookups (rate limit: 1 req/sec)
References
Documentation
- GHCID Specification:
docs/PERSISTENT_IDENTIFIERS.md - GHCID PID Scheme:
docs/GHCID_PID_SCHEME.md - Collision Resolution:
docs/plan/global_glam/07-ghcid-collision-resolution.md - Agent Instructions:
AGENTS.md(updated with collision handling guidance)
Schema
- Core Classes:
schemas/core.yaml(HeritageCustodian, Location, Identifier) - Provenance:
schemas/provenance.yaml(GHCIDHistoryEntry, ChangeEvent) - Enumerations:
schemas/enums.yaml(InstitutionTypeEnum, DataSource, DataTier)
Data Sources
- Japan ISIL: NDL ISIL registry (12,065 institutions)
- Netherlands ISIL: KB ISIL registry (364 codes) + Dutch orgs CSV (1,351 orgs)
- EU: Manual ISIL lookups (Belgium, Italy, Luxembourg)
- Latin America: NLP-extracted from Claude conversations (Brazil, Chile, Mexico)
Session Artifacts
Files Created
data/instances/global/
├── global_heritage_institutions.yaml # 10,838 institutions
├── merge_report.md # Human-readable summary
└── merge_statistics.yaml # Detailed statistics
data/instances/japan/
└── ghcid_collision_analysis.yaml # Collision investigation
scripts/
├── merge_global_datasets.py # Merge + deduplication
└── analyze_ghcid_collisions.py # Collision analysis
docs/sessions/
└── global_merge_2025-11-07.md # This document
Commands Executed
# Global merge
python scripts/merge_global_datasets.py
# Collision analysis
python scripts/analyze_ghcid_collisions.py
Conclusion
Status: Global merge technically successful, but GHCID collision issue is blocking further work.
Key Achievement: Unified 4 regional datasets into single global database with 100% GHCID coverage.
Critical Finding: 21.2% of Japan dataset lost to GHCID collisions - Wikidata Q-number enrichment is mandatory before proceeding with geocoding, exports, or public release.
Next Session: Implement Wikidata Q-number enrichment and GHCID regeneration for Japanese institutions.
Session Date: 2025-11-07
Duration: ~90 minutes
Total Institutions: 10,838 (with 2,558 pending recovery)
Schema Version: v0.2.0 (modular LinkML)
Status: ⚠️ Collision resolution required