27 KiB
GHCID Collision Resolution Strategy
Version: 1.0
Date: 2025-11-05
Status: Implemented
Problem Statement
When two heritage institutions share:
- Same geographic location (city)
- Same institution type (e.g., both museums)
- Same name abbreviation
...they will generate identical GHCID identifiers, causing collisions.
Example Collision Scenario
Institution 1: Stedelijk Museum Amsterdam
Institution 2: Science Museum Amsterdam (hypothetical)
Both would generate:
NL-NH-AMS-M-SM
Solution: Wikidata Q-Number Suffix
When a collision is detected, append the institution's Wikidata Q-number to the GHCID.
Format
Base GHCID:
{Country}-{Region}-{City}-{Type}-{Abbreviation}
GHCID with Collision Resolver:
{Country}-{Region}-{City}-{Type}-{Abbreviation}-Q{WikidataID}
Examples
| Institution | Base GHCID | With Q-Number | Notes |
|---|---|---|---|
| Rijksmuseum Amsterdam | NL-NH-AMS-M-RM |
N/A | No collision, no Q-number needed |
| Stedelijk Museum Amsterdam | NL-NH-AMS-M-SM |
NL-NH-AMS-M-SM-Q924335 |
Collision detected |
| Science Museum Amsterdam | NL-NH-AMS-M-SM |
NL-NH-AMS-M-SM-Q123456 |
Collision detected |
| Van Gogh Museum | NL-NH-AMS-M-VGM |
N/A | No collision |
Temporal Dimension in Collision Resolution
The Critical Distinction: First Batch vs. Historical Addition
Collision resolution behavior differs based on when the collision is detected:
Scenario A: First Batch Collision (Contemporaneous Discovery)
When: Multiple institutions discovered simultaneously during initial GHCID generation (e.g., batch import from CSV).
Rule: ALL colliding institutions receive Q-number suffixes.
Why: No institution has temporal priority; all are being created at the same time. Fair treatment requires all to be disambiguated equally.
Example:
# Discovered on 2025-11-01 from Dutch ISIL registry batch import
Institution 1: Stedelijk Museum Amsterdam
→ ghcid: NL-NH-AMS-M-SM-Q924335
Institution 2: Science Museum Amsterdam
→ ghcid: NL-NH-AMS-M-SM-Q7654321
# Both get Q-numbers because both discovered simultaneously
Scenario B: Historical Addition (Post-Publication Collision)
When: A newly added historical institution collides with an already published GHCID.
Rule: ONLY the new institution receives a Q-number suffix. The existing GHCID remains unchanged.
Why: PID Stability Principle - Published persistent identifiers may already be cited in research papers, integrated into third-party datasets, or embedded in API responses. Changing existing PIDs breaks citations and external references.
Example:
# Published 2025-11-01
Institution 1: Hermitage Museum Amsterdam
→ ghcid: NL-NH-AMS-M-HM # Unchanged forever
# Historical institution added 2025-11-15
Institution 2: Amsterdam Historical Museum (historical records 1926-2001)
→ ghcid: NL-NH-AMS-M-HM-Q17339437 # New institution gets Q-number
# Existing GHCID preserved; only new addition disambiguated
Decision Matrix
| Discovery Context | Existing Institution | New Institution | Resolution Strategy |
|---|---|---|---|
| First batch (both new) | None (being created) | None (being created) | Both get Q-numbers |
| Historical addition | Already published | Being added now | Only new gets Q-number |
| Simultaneous historical additions | Already published | Multiple being added | All new get Q-numbers; existing unchanged |
Timeline Example: Demonstrating the Temporal Principle
2025-11-01 (First Batch Import)
├─ Stedelijk Museum Amsterdam added
│ └─ ghcid: NL-NH-AMS-M-SM
├─ Science Museum Amsterdam discovered (collision!)
│ └─ BOTH institutions updated:
│ ├─ Stedelijk: NL-NH-AMS-M-SM-Q924335
│ └─ Science: NL-NH-AMS-M-SM-Q7654321
2025-11-15 (Historical Research Addition)
├─ Hermitage Museum Amsterdam already exists
│ └─ ghcid: NL-NH-AMS-M-HM (published, immutable)
├─ Amsterdam Historical Museum added (historical institution, 1926-2001)
│ └─ Collision detected!
│ ├─ Hermitage GHCID: UNCHANGED (NL-NH-AMS-M-HM)
│ └─ Historical Museum: NL-NH-AMS-M-HM-Q17339437 (gets Q-number)
2025-12-01 (Another Historical Addition)
├─ Maritime Museum Amsterdam already exists
│ └─ ghcid: NL-NH-AMS-M-MM (published, immutable)
├─ Two historical naval museums discovered in archive:
│ ├─ Dutch Navy Museum (1906-1955) → collides!
│ └─ Amsterdam Naval Archive (1820-1901) → collides!
│ └─ BOTH new institutions get Q-numbers:
│ ├─ Dutch Navy: NL-NH-AMS-M-MM-Q23456789
│ └─ Naval Archive: NL-NH-AMS-M-MM-Q98765432
│ ├─ Existing Maritime Museum: UNCHANGED (NL-NH-AMS-M-MM)
Implementation Guidance
Collision Detection with Temporal Context
def resolve_collision(new_institution, existing_ghcids_registry):
"""
Resolve GHCID collision based on temporal context.
Args:
new_institution: Institution being added
existing_ghcids_registry: Dict mapping GHCID → {publication_date, institution_data}
Returns:
Resolution strategy and updated GHCIDs
"""
new_ghcid_base = generate_base_ghcid(new_institution)
# Check if base GHCID exists in published registry
if new_ghcid_base in existing_ghcids_registry:
existing_entry = existing_ghcids_registry[new_ghcid_base]
# Historical addition case
if existing_entry['publication_date'] is not None:
print(f"Collision with published GHCID {new_ghcid_base}")
print(f" → Existing: {existing_entry['name']} (published {existing_entry['publication_date']})")
print(f" → New: {new_institution.name} (being added now)")
print(f" → Strategy: Only new institution gets Q-number")
# Existing GHCID remains unchanged
existing_ghcid = new_ghcid_base
# New institution gets Q-number
new_ghcid = f"{new_ghcid_base}-Q{new_institution.wikidata_qid}"
return {
'strategy': 'HISTORICAL_ADDITION',
'existing_ghcid': existing_ghcid, # Unchanged
'new_ghcid': new_ghcid, # With Q-number
'reason': f"Historical addition collision: preserve existing published PID"
}
# First batch collision case (both being created)
# This should be handled during batch processing
return {
'strategy': 'FIRST_BATCH',
'reason': 'All colliding institutions in batch receive Q-numbers'
}
def resolve_batch_collisions(new_institutions_batch):
"""
Resolve collisions within a batch of new institutions.
When multiple institutions in the same batch collide, ALL get Q-numbers.
"""
ghcid_map = {} # base_ghcid → list of institutions
# Group by base GHCID
for inst in new_institutions_batch:
base_ghcid = generate_base_ghcid(inst)
if base_ghcid not in ghcid_map:
ghcid_map[base_ghcid] = []
ghcid_map[base_ghcid].append(inst)
# Resolve collisions
for base_ghcid, institutions in ghcid_map.items():
if len(institutions) > 1:
print(f"First batch collision detected: {base_ghcid}")
print(f" → {len(institutions)} institutions collide")
print(f" → Strategy: All {len(institutions)} get Q-numbers")
# All institutions in collision get Q-numbers
for inst in institutions:
inst.ghcid = f"{base_ghcid}-Q{inst.wikidata_qid}"
inst.provenance.notes += (
f" | First batch collision: {len(institutions)} institutions "
f"share base GHCID {base_ghcid}"
)
Edge Cases
1. Multiple Historical Additions Simultaneously
Scenario: Two historical institutions discovered at the same time, both colliding with existing GHCID.
Resolution: Both new institutions get Q-numbers; existing unchanged.
# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-MM
# Both added 2025-12-01
new_inst_1.ghcid = "NL-NH-AMS-M-MM-Q23456789"
new_inst_2.ghcid = "NL-NH-AMS-M-MM-Q98765432"
2. Historical Institution Without Wikidata ID
Scenario: New historical institution collides but has no Wikidata entry.
Fallback Strategy:
- VIAF ID:
NL-NH-AMS-M-HM-V12345678 - ISIL code:
NL-NH-AMS-M-HM-I-AsdHM - Sequential:
NL-NH-AMS-M-HM-002(increment from existing)
def get_collision_suffix(institution):
"""Get collision resolution suffix in priority order."""
if institution.wikidata_qid:
return f"Q{institution.wikidata_qid}"
elif institution.viaf_id:
return f"V{institution.viaf_id}"
elif institution.isil_code:
# Extract suffix from ISIL (e.g., NL-AsdHM → I-AsdHM)
isil_suffix = institution.isil_code.split('-', 1)[1]
return f"I-{isil_suffix}"
else:
# Last resort: sequential numbering
return f"{get_next_sequential_number(base_ghcid):03d}"
3. Retroactive Discovery of First Batch Collision
Scenario: Two institutions were created in first batch, but collision wasn't detected until later.
Resolution: Treat as first batch collision (both get Q-numbers), even though discovered late.
Justification: Intent was to create both simultaneously; detection timing doesn't change temporal relationship.
# If both have same creation_date in provenance metadata
if inst1.provenance.extraction_date == inst2.provenance.extraction_date:
strategy = 'FIRST_BATCH' # Both get Q-numbers
else:
strategy = 'HISTORICAL_ADDITION' # Only later one gets Q-number
Provenance Tracking for Collision Events
Record collision resolution in provenance metadata:
# First batch collision
provenance:
extraction_date: "2025-11-01T10:00:00Z"
collision_resolution:
strategy: FIRST_BATCH
collision_group: [Q924335, Q7654321]
resolved_date: "2025-11-01T10:00:00Z"
reason: "Contemporaneous discovery: both institutions added in ISIL registry batch import"
# Historical addition collision
provenance:
extraction_date: "2025-11-15T14:30:00Z"
collision_resolution:
strategy: HISTORICAL_ADDITION
collides_with: NL-NH-AMS-M-HM
existing_publication_date: "2025-11-01T10:00:00Z"
resolved_date: "2025-11-15T14:30:00Z"
reason: "Historical addition: existing GHCID preserved per PID stability principle"
Implementation Rules
1. Collision Detection
A collision occurs when:
# Pseudocode with temporal awareness
existing_ghcids = database.get_all_ghcids() # Include publication dates
new_ghcid = generate_ghcid(institution)
if new_ghcid in existing_ghcids:
existing_record = existing_ghcids[new_ghcid]
# Check temporal context
if existing_record['publication_date'] is not None:
# Historical addition: only new institution gets Q-number
new_ghcid_resolved = f"{new_ghcid}-Q{institution.wikidata_qid}"
existing_ghcid_resolved = new_ghcid # Unchanged
else:
# First batch: both being created simultaneously
# Fetch Wikidata Q-numbers for BOTH institutions
# Regenerate both GHCIDs with Q-number suffixes
2. When to Add Q-Number
DO add Q-number suffix:
- ✅ First batch: When collision detected during simultaneous creation → ALL colliding institutions
- ✅ Historical addition: When new institution collides with published GHCID → ONLY new institution
- ✅ When generating GHCID for institution with known collision
- ✅ When institution is added to collision registry
DO NOT add Q-number suffix:
- ❌ When no collision exists (most institutions)
- ❌ "Just in case" for future collisions
- ❌ For institutions without Wikidata entries
- ❌ NEVER for existing published GHCIDs when historical collision occurs (preserve PID stability)
3. Q-Number Normalization
Input formats accepted:
Q924335(preferred)924335(digits only)q924335(lowercase, will be normalized)
Normalized output:
- Q-prefix is stripped during processing
- Stored as:
"924335" - Displayed as:
"Q924335"
Code example:
# In GHCIDComponents.__post_init__()
if self.wikidata_qid:
# Strip Q prefix, keep only digits
self.wikidata_qid = self.wikidata_qid.upper().replace("Q", "")
# In GHCIDComponents.to_string()
if self.wikidata_qid:
return f"{base}-Q{self.wikidata_qid}" # Re-add Q prefix for display
4. Persistent Numeric ID
The SHA256 hash is computed from the full GHCID string including Q-number:
# Without Q-number
ghcid = "NL-NH-AMS-M-RM"
numeric_id = SHA256(ghcid)[:8] → int # e.g., 12345678901234567890
# With Q-number (different hash!)
ghcid = "NL-NH-AMS-M-SM-Q924335"
numeric_id = SHA256(ghcid)[:8] → int # e.g., 98765432109876543210
Important: Even if two institutions have identical base GHCIDs, their numeric IDs will be different because the Q-number is included in the hash.
Collision Registry
Maintain a collision registry to track known conflicts with temporal metadata:
{
"collisions": [
{
"base_ghcid": "NL-NH-AMS-M-SM",
"collision_type": "FIRST_BATCH",
"institutions": [
{
"ghcid": "NL-NH-AMS-M-SM-Q924335",
"name": "Stedelijk Museum Amsterdam",
"wikidata_qid": "Q924335",
"publication_date": "2025-11-01T10:00:00Z"
},
{
"ghcid": "NL-NH-AMS-M-SM-Q123456",
"name": "Science Museum Amsterdam",
"wikidata_qid": "Q123456",
"publication_date": "2025-11-01T10:00:00Z"
}
],
"detected_date": "2025-11-01T10:00:00Z",
"resolution_strategy": "Both institutions receive Q-numbers (contemporaneous discovery)"
},
{
"base_ghcid": "NL-NH-AMS-M-HM",
"collision_type": "HISTORICAL_ADDITION",
"institutions": [
{
"ghcid": "NL-NH-AMS-M-HM",
"name": "Hermitage Museum Amsterdam",
"wikidata_qid": "Q1542668",
"publication_date": "2025-11-01T10:00:00Z",
"note": "Original GHCID preserved (published identifier)"
},
{
"ghcid": "NL-NH-AMS-M-HM-Q17339437",
"name": "Amsterdam Historical Museum",
"wikidata_qid": "Q17339437",
"publication_date": "2025-11-15T14:30:00Z",
"note": "Historical addition: only new institution receives Q-number"
}
],
"detected_date": "2025-11-15T14:30:00Z",
"resolution_strategy": "Preserve existing published GHCID; only new institution gets Q-number"
}
]
}
Registry Usage
-
Before generating new GHCID:
- Check if base GHCID exists in collision registry
- Check publication date of existing record
- If published: new institution gets Q-number (historical addition)
- If unpublished: both get Q-numbers (first batch)
-
When collision detected:
- Record temporal context (first batch vs. historical)
- Apply appropriate resolution strategy
- Add entries to collision registry with publication dates
- Update
ghcid_historywith change reason including temporal context
-
Periodic audit:
- Scan all GHCIDs for duplicates
- Check publication dates to determine resolution strategy
- Resolve retroactively if found
- Update collision registry with temporal metadata
What If Institution Lacks Wikidata Entry?
Priority Order for Collision Resolution
- Wikidata Q-number (preferred)
- VIAF ID (backup)
- ISIL code suffix (Dutch institutions)
- Sequential number (last resort)
Examples
# Preferred: Wikidata
NL-NH-AMS-M-SM-Q924335
# Backup: VIAF
NL-NH-AMS-M-SM-V12345678
# Backup: ISIL
NL-NH-AMS-M-SM-I-AsdSM
# Last resort: Sequential
NL-NH-AMS-M-SM-001
NL-NH-AMS-M-SM-002
Implementation note: Start with Wikidata-only. Expand to other identifiers if needed based on real-world collision frequency.
Validation Rules
GHCID Pattern Regex
Without Q-number:
^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}$
With Q-number (updated):
^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
Breakdown:
[A-Z]{2}- Country code (2 letters)[A-Z0-9]{1,3}- Region code (1-3 chars)[A-Z]{3}- City LOCODE (3 letters)[A-Z]- Type code (1 letter)[A-Z0-9]{1,10}- Abbreviation (1-10 chars)(-Q[0-9]+)?- Optional Wikidata Q-number suffix
Validation Tests
# Valid GHCIDs
assert validate("NL-NH-AMS-M-RM") # Without Q-number
assert validate("NL-NH-AMS-M-SM-Q924335") # With Q-number
assert validate("US-NY-NYC-M-MOMA") # International
assert validate("NL-NH-AMS-M-SM-Q1") # Short Q-number
# Invalid GHCIDs
assert not validate("NL-NH-AMS-M-SM-924335") # Missing Q prefix
assert not validate("NL-NH-AMS-M-SM-QAB123") # Q-number must be numeric
assert not validate("NL-NH-AMS-M-SM-Q") # Q-number empty
Migration Strategy
For Existing Institutions Without Q-Number
When collision is detected on existing institutions without Q-numbers, temporal context determines strategy:
First Batch Collision (Retroactive Detection)
If both institutions were created simultaneously but collision not detected until later:
- Verify creation dates (check
provenance.extraction_date) - If same date: Treat as first batch → both get Q-numbers
- Fetch Wikidata Q-numbers for both institutions
- Update both GHCIDs with Q-number suffix
- Update both
ghcid_numeric(hash changes!) - Add history entries for both:
# Institution 1 GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-SM", ghcid_new="NL-NH-AMS-M-SM-Q924335", valid_from="2025-11-05T12:00:00Z", valid_to=None, reason="First batch collision (retroactive): Added Wikidata Q-number to disambiguate from Science Museum Amsterdam (both created 2025-11-01)" ) # Institution 2 GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-SM", ghcid_new="NL-NH-AMS-M-SM-Q123456", valid_from="2025-11-05T12:00:00Z", valid_to=None, reason="First batch collision (retroactive): Added Wikidata Q-number to disambiguate from Stedelijk Museum Amsterdam (both created 2025-11-01)" )
Historical Addition Collision
If new institution collides with already published existing GHCID:
- Verify publication date of existing record
- Existing GHCID: NO CHANGE (preserve published PID)
- New institution: Gets Q-number suffix
- Update only new
ghcid_numeric - Add history entry for new institution only:
GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-HM", # What it would have been ghcid_new="NL-NH-AMS-M-HM-Q17339437", # What it becomes valid_from="2025-11-15T14:30:00Z", valid_to=None, reason="Historical addition collision: Base GHCID NL-NH-AMS-M-HM already published for Hermitage Museum Amsterdam (2025-11-01). Added Q-number to preserve existing PID stability." ) - No history entry for existing institution (GHCID unchanged)
Decision Flow
def determine_collision_strategy(existing_inst, new_inst):
"""Determine whether to treat as first batch or historical addition."""
# Check if existing record has publication date
if existing_inst.provenance.publication_date is not None:
# Published identifier → Historical addition
return "HISTORICAL_ADDITION"
# Check if both created on same date
if (existing_inst.provenance.extraction_date ==
new_inst.provenance.extraction_date):
# Same creation date → First batch (retroactive)
return "FIRST_BATCH_RETROACTIVE"
# Different dates, no publication → First batch (simultaneous processing)
if abs((existing_inst.provenance.extraction_date -
new_inst.provenance.extraction_date).days) < 1:
return "FIRST_BATCH"
# Default: Historical addition (new is significantly later)
return "HISTORICAL_ADDITION"
Backward Compatibility
Q: What happens to old numeric IDs when Q-number is added?
A: The numeric ID changes because it's a hash of the full GHCID string. This is intentional:
- Old numeric ID:
SHA256("NL-NH-AMS-M-SM")[:8] - New numeric ID:
SHA256("NL-NH-AMS-M-SM-Q924335")[:8]
Migration plan depends on temporal context:
First Batch Collision (Both Updated)
- Keep
ghcid_originalunchanged for both (preserves old GHCID) - Update
ghcid_currentwith Q-number for both - Generate new
ghcid_numericfrom updated GHCID for both - Add comprehensive history entries explaining change
- Maintain mapping table:
old_numeric_id → new_numeric_id
Historical Addition (Only New Updated)
- Existing institution: NO CHANGES
ghcid_original: unchangedghcid_current: unchangedghcid_numeric: unchanged- No history entry added
- New institution: Gets Q-number from start
ghcid_original:NL-NH-AMS-M-HM-Q17339437(includes Q-number)ghcid_current:NL-NH-AMS-M-HM-Q17339437ghcid_numeric: Hash of full GHCID with Q-number- History entry documents collision with existing published PID
Testing Strategy
Unit Tests
def test_ghcid_without_collision():
"""Most institutions should NOT have Q-number"""
components = GHCIDComponents(
country_code="NL",
region_code="NH",
city_locode="AMS",
institution_type="M",
abbreviation="RM"
)
assert components.to_string() == "NL-NH-AMS-M-RM"
assert components.wikidata_qid is None
def test_first_batch_collision():
"""First batch: BOTH institutions get Q-numbers"""
inst1 = create_institution(
name="Stedelijk Museum Amsterdam",
wikidata_qid="Q924335",
extraction_date="2025-11-01T10:00:00Z"
)
inst2 = create_institution(
name="Science Museum Amsterdam",
wikidata_qid="Q123456",
extraction_date="2025-11-01T10:00:00Z"
)
# Process batch
resolve_batch_collisions([inst1, inst2])
# Both should have Q-numbers
assert inst1.ghcid == "NL-NH-AMS-M-SM-Q924335"
assert inst2.ghcid == "NL-NH-AMS-M-SM-Q123456"
def test_historical_addition_collision():
"""Historical addition: Only NEW institution gets Q-number"""
# Existing published institution
existing = create_institution(
name="Hermitage Museum Amsterdam",
wikidata_qid="Q1542668",
extraction_date="2025-11-01T10:00:00Z",
publication_date="2025-11-01T10:00:00Z"
)
existing.ghcid = "NL-NH-AMS-M-HM"
# New historical institution
new = create_institution(
name="Amsterdam Historical Museum",
wikidata_qid="Q17339437",
extraction_date="2025-11-15T14:30:00Z"
)
# Resolve collision
resolve_collision(new, existing_ghcids={existing.ghcid: existing})
# Existing GHCID unchanged
assert existing.ghcid == "NL-NH-AMS-M-HM"
# New institution gets Q-number
assert new.ghcid == "NL-NH-AMS-M-HM-Q17339437"
def test_qid_normalization():
"""Q-prefix should be stripped and re-added"""
components = GHCIDComponents(
country_code="NL",
region_code="NH",
city_locode="AMS",
institution_type="M",
abbreviation="SM",
wikidata_qid="Q924335"
)
assert components.wikidata_qid == "924335" # Stored without Q
assert "Q924335" in components.to_string() # Displayed with Q
def test_collision_strategy_determination():
"""Test temporal context determines resolution strategy"""
# Same date → First batch
strategy = determine_collision_strategy(
existing=Institution(extraction_date="2025-11-01T10:00:00Z"),
new=Institution(extraction_date="2025-11-01T10:00:00Z")
)
assert strategy == "FIRST_BATCH"
# Existing published → Historical addition
strategy = determine_collision_strategy(
existing=Institution(
extraction_date="2025-11-01T10:00:00Z",
publication_date="2025-11-01T10:00:00Z"
),
new=Institution(extraction_date="2025-11-15T14:30:00Z")
)
assert strategy == "HISTORICAL_ADDITION"
def test_simultaneous_historical_additions():
"""Multiple historical additions: all new get Q-numbers"""
existing = Institution(
name="Maritime Museum Amsterdam",
ghcid="NL-NH-AMS-M-MM",
publication_date="2025-11-01T10:00:00Z"
)
new1 = Institution(
name="Dutch Navy Museum",
wikidata_qid="Q23456789",
extraction_date="2025-12-01T09:00:00Z"
)
new2 = Institution(
name="Amsterdam Naval Archive",
wikidata_qid="Q98765432",
extraction_date="2025-12-01T09:00:00Z"
)
resolve_batch_collisions([new1, new2], existing_ghcids={"NL-NH-AMS-M-MM": existing})
# Existing unchanged
assert existing.ghcid == "NL-NH-AMS-M-MM"
# Both new get Q-numbers
assert new1.ghcid == "NL-NH-AMS-M-MM-Q23456789"
assert new2.ghcid == "NL-NH-AMS-M-MM-Q98765432"
Integration Tests
def test_real_dutch_museums_no_collisions():
"""Verify no actual collisions in Dutch ISIL registry"""
parser = ISILRegistryParser()
records = parser.parse_file("data/ISIL-codes_2025-08-01.csv")
ghcids = {}
collisions = []
for record in records:
ghcid_base = generate_base_ghcid(record)
if ghcid_base in ghcids:
collisions.append((ghcid_base, ghcids[ghcid_base], record))
ghcids[ghcid_base] = record
# Report any collisions found
print(f"Found {len(collisions)} collisions")
for base, record1, record2 in collisions:
print(f" {base}: {record1.instelling} vs {record2.instelling}")
Future Enhancements
-
Collision probability calculator
- Analyze existing institutions
- Predict collision likelihood
- Suggest alternative abbreviations
-
Auto-fetch Wikidata Q-numbers
- Use Wikidata API/SPARQL
- Match by name + location
- Confidence scoring
-
Collision warning system
- Alert when generating GHCID similar to existing
- Suggest checking for duplicates
- Proactive collision prevention
-
Alternative disambiguation
- Use founding year:
NL-NH-AMS-M-SM-Y1895 - Use street address:
NL-NH-AMS-M-SM-PAULUS - Use parent org:
NL-NH-AMS-M-SM-AMS(City of Amsterdam)
- Use founding year:
References
- GHCID Spec:
docs/plan/global_glam/06-global-identifier-system.md - Wikidata: https://www.wikidata.org
- Implementation:
src/glam_extractor/identifiers/ghcid.py - Schema:
schemas/heritage_custodian.yaml(ghcid_current, ghcid_original slots)
Last Updated: 2025-11-05
Authors: GLAM Data Extraction Project Team