31 KiB
GHCID Collision Resolution Strategy
Version: 2.0
Date: 2025-11-30
Status: Implemented
Problem Statement
When two heritage institutions share:
- Same geographic location (city)
- Same institution type (e.g., both museums)
- Same name abbreviation
...they will generate identical GHCID identifiers, causing collisions.
Example Collision Scenario
Institution 1: Stedelijk Museum Amsterdam
Institution 2: Science Museum Amsterdam (hypothetical)
Both would generate:
NL-NH-AMS-M-SM
Solution: Native Language Name Suffix
When a collision is detected, append the institution's full legal name in native language in snake_case format to the GHCID.
Format
Base GHCID:
{Country}-{Region}-{City}-{Type}-{Abbreviation}
GHCID with Collision Resolver:
{Country}-{Region}-{City}-{Type}-{Abbreviation}-{native_name_in_snake_case}
Name Suffix Generation
Converting institution names to snake_case suffixes:
import re
import unicodedata
def generate_name_suffix(native_name: str) -> str:
"""Convert native language institution name to snake_case suffix.
Examples:
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
"""
# Normalize unicode (NFD decomposition) and remove diacritics
normalized = unicodedata.normalize('NFD', native_name)
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Convert to lowercase
lowercase = ascii_name.lower()
# Remove apostrophes, commas, and other punctuation
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
# Replace spaces and hyphens with underscores
underscored = re.sub(r'[\s\-]+', '_', no_punct)
# Remove any remaining non-alphanumeric characters (except underscores)
clean = re.sub(r'[^a-z0-9_]', '', underscored)
# Collapse multiple underscores
final = re.sub(r'_+', '_', clean).strip('_')
return final
Name suffix rules:
- Use the institution's full official name in its native language
- Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
- Remove all diacritics (é → e, ö → o, ñ → n)
- Remove punctuation (apostrophes, commas, periods)
- Replace spaces with underscores
- All lowercase
Examples
| Institution | Base GHCID | With Name Suffix | Notes |
|---|---|---|---|
| Rijksmuseum Amsterdam | NL-NH-AMS-M-RM |
N/A | No collision, no suffix needed |
| Stedelijk Museum Amsterdam | NL-NH-AMS-M-SM |
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam |
Collision detected |
| Science Museum Amsterdam | NL-NH-AMS-M-SM |
NL-NH-AMS-M-SM-science_museum_amsterdam |
Collision detected |
| Van Gogh Museum | NL-NH-AMS-M-VGM |
N/A | No collision |
Temporal Dimension in Collision Resolution
The Critical Distinction: First Batch vs. Historical Addition
Collision resolution behavior differs based on when the collision is detected:
Scenario A: First Batch Collision (Contemporaneous Discovery)
When: Multiple institutions discovered simultaneously during initial GHCID generation (e.g., batch import from CSV).
Rule: ALL colliding institutions receive name suffixes.
Why: No institution has temporal priority; all are being created at the same time. Fair treatment requires all to be disambiguated equally.
Example:
# Discovered on 2025-11-01 from Dutch ISIL registry batch import
Institution 1: Stedelijk Museum Amsterdam
→ ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
Institution 2: Science Museum Amsterdam
→ ghcid: NL-NH-AMS-M-SM-science_museum_amsterdam
# Both get name suffixes because both discovered simultaneously
Scenario B: Historical Addition (Post-Publication Collision)
When: A newly added historical institution collides with an already published GHCID.
Rule: ONLY the new institution receives a name suffix. The existing GHCID remains unchanged.
Why: PID Stability Principle - Published persistent identifiers may already be cited in research papers, integrated into third-party datasets, or embedded in API responses. Changing existing PIDs breaks citations and external references.
Example:
# Published 2025-11-01
Institution 1: Hermitage Museum Amsterdam
→ ghcid: NL-NH-AMS-M-HM # Unchanged forever
# Historical institution added 2025-11-15
Institution 2: Amsterdam Historical Museum (historical records 1926-2001)
→ ghcid: NL-NH-AMS-M-HM-amsterdam_historical_museum # New institution gets name suffix
# Existing GHCID preserved; only new addition disambiguated
Decision Matrix
| Discovery Context | Existing Institution | New Institution | Resolution Strategy |
|---|---|---|---|
| First batch (both new) | None (being created) | None (being created) | Both get name suffixes |
| Historical addition | Already published | Being added now | Only new gets name suffix |
| Simultaneous historical additions | Already published | Multiple being added | All new get name suffixes; existing unchanged |
Timeline Example: Demonstrating the Temporal Principle
2025-11-01 (First Batch Import)
├─ Stedelijk Museum Amsterdam added
│ └─ ghcid: NL-NH-AMS-M-SM
├─ Science Museum Amsterdam discovered (collision!)
│ └─ BOTH institutions updated:
│ ├─ Stedelijk: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
│ └─ Science: NL-NH-AMS-M-SM-science_museum_amsterdam
2025-11-15 (Historical Research Addition)
├─ Hermitage Museum Amsterdam already exists
│ └─ ghcid: NL-NH-AMS-M-HM (published, immutable)
├─ Amsterdam Historical Museum added (historical institution, 1926-2001)
│ └─ Collision detected!
│ ├─ Hermitage GHCID: UNCHANGED (NL-NH-AMS-M-HM)
│ └─ Historical Museum: NL-NH-AMS-M-HM-amsterdam_historical_museum (gets name suffix)
2025-12-01 (Another Historical Addition)
├─ Maritime Museum Amsterdam already exists
│ └─ ghcid: NL-NH-AMS-M-MM (published, immutable)
├─ Two historical naval museums discovered in archive:
│ ├─ Dutch Navy Museum (1906-1955) → collides!
│ └─ Amsterdam Naval Archive (1820-1901) → collides!
│ └─ BOTH new institutions get name suffixes:
│ ├─ Dutch Navy: NL-NH-AMS-M-MM-dutch_navy_museum
│ └─ Naval Archive: NL-NH-AMS-M-MM-amsterdam_naval_archive
│ ├─ Existing Maritime Museum: UNCHANGED (NL-NH-AMS-M-MM)
Implementation Guidance
Collision Detection with Temporal Context
def resolve_collision(new_institution, existing_ghcids_registry):
"""
Resolve GHCID collision based on temporal context.
Args:
new_institution: Institution being added
existing_ghcids_registry: Dict mapping GHCID → {publication_date, institution_data}
Returns:
Resolution strategy and updated GHCIDs
"""
new_ghcid_base = generate_base_ghcid(new_institution)
# Check if base GHCID exists in published registry
if new_ghcid_base in existing_ghcids_registry:
existing_entry = existing_ghcids_registry[new_ghcid_base]
# Historical addition case
if existing_entry['publication_date'] is not None:
print(f"Collision with published GHCID {new_ghcid_base}")
print(f" → Existing: {existing_entry['name']} (published {existing_entry['publication_date']})")
print(f" → New: {new_institution.name} (being added now)")
print(f" → Strategy: Only new institution gets name suffix")
# Existing GHCID remains unchanged
existing_ghcid = new_ghcid_base
# New institution gets name suffix
name_suffix = generate_name_suffix(new_institution.name)
new_ghcid = f"{new_ghcid_base}-{name_suffix}"
return {
'strategy': 'HISTORICAL_ADDITION',
'existing_ghcid': existing_ghcid, # Unchanged
'new_ghcid': new_ghcid, # With name suffix
'reason': f"Historical addition collision: preserve existing published PID"
}
# First batch collision case (both being created)
# This should be handled during batch processing
return {
'strategy': 'FIRST_BATCH',
'reason': 'All colliding institutions in batch receive name suffixes'
}
def resolve_batch_collisions(new_institutions_batch):
"""
Resolve collisions within a batch of new institutions.
When multiple institutions in the same batch collide, ALL get name suffixes.
"""
ghcid_map = {} # base_ghcid → list of institutions
# Group by base GHCID
for inst in new_institutions_batch:
base_ghcid = generate_base_ghcid(inst)
if base_ghcid not in ghcid_map:
ghcid_map[base_ghcid] = []
ghcid_map[base_ghcid].append(inst)
# Resolve collisions
for base_ghcid, institutions in ghcid_map.items():
if len(institutions) > 1:
print(f"First batch collision detected: {base_ghcid}")
print(f" → {len(institutions)} institutions collide")
print(f" → Strategy: All {len(institutions)} get name suffixes")
# All institutions in collision get name suffixes
for inst in institutions:
name_suffix = generate_name_suffix(inst.name)
inst.ghcid = f"{base_ghcid}-{name_suffix}"
inst.provenance.notes += (
f" | First batch collision: {len(institutions)} institutions "
f"share base GHCID {base_ghcid}"
)
Edge Cases
1. Multiple Historical Additions Simultaneously
Scenario: Two historical institutions discovered at the same time, both colliding with existing GHCID.
Resolution: Both new institutions get name suffixes; existing unchanged.
# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-MM
# Both added 2025-12-01
new_inst_1.ghcid = "NL-NH-AMS-M-MM-dutch_navy_museum"
new_inst_2.ghcid = "NL-NH-AMS-M-MM-amsterdam_naval_archive"
2. Historical Institution Without Unique Name
Scenario: New historical institution collides and has a generic name that may not be unique.
Fallback Strategy (if name suffix still causes collision):
- Name + founding year:
NL-NH-AMS-M-HM-historical_museum_1926 - Name + city qualifier:
NL-NH-AMS-M-HM-historical_museum_amsterdam - Sequential:
NL-NH-AMS-M-HM-002(increment from existing)
def get_collision_suffix(institution, existing_suffixes=None):
"""Get collision resolution suffix, handling duplicates."""
# Primary: Native language name in snake_case
name_suffix = generate_name_suffix(institution.name)
# Check if this suffix already exists
if existing_suffixes and name_suffix in existing_suffixes:
# Add founding year if available
if institution.founding_year:
name_suffix = f"{name_suffix}_{institution.founding_year}"
# Still collision? Add sequential number
if name_suffix in existing_suffixes:
counter = 2
base_suffix = name_suffix
while name_suffix in existing_suffixes:
name_suffix = f"{base_suffix}_{counter}"
counter += 1
return name_suffix
3. Retroactive Discovery of First Batch Collision
Scenario: Two institutions were created in first batch, but collision wasn't detected until later.
Resolution: Treat as first batch collision (both get name suffixes), even though discovered late.
Justification: Intent was to create both simultaneously; detection timing doesn't change temporal relationship.
# If both have same creation_date in provenance metadata
if inst1.provenance.extraction_date == inst2.provenance.extraction_date:
strategy = 'FIRST_BATCH' # Both get name suffixes
else:
strategy = 'HISTORICAL_ADDITION' # Only later one gets name suffix
Provenance Tracking for Collision Events
Record collision resolution in provenance metadata:
# First batch collision
provenance:
extraction_date: "2025-11-01T10:00:00Z"
collision_resolution:
strategy: FIRST_BATCH
collision_group: [stedelijk_museum_amsterdam, science_museum_amsterdam]
resolved_date: "2025-11-01T10:00:00Z"
reason: "Contemporaneous discovery: both institutions added in ISIL registry batch import"
# Historical addition collision
provenance:
extraction_date: "2025-11-15T14:30:00Z"
collision_resolution:
strategy: HISTORICAL_ADDITION
collides_with: NL-NH-AMS-M-HM
existing_publication_date: "2025-11-01T10:00:00Z"
resolved_date: "2025-11-15T14:30:00Z"
reason: "Historical addition: existing GHCID preserved per PID stability principle"
Implementation Rules
1. Collision Detection
A collision occurs when:
# Pseudocode with temporal awareness
existing_ghcids = database.get_all_ghcids() # Include publication dates
new_ghcid = generate_ghcid(institution)
if new_ghcid in existing_ghcids:
existing_record = existing_ghcids[new_ghcid]
# Check temporal context
if existing_record['publication_date'] is not None:
# Historical addition: only new institution gets name suffix
name_suffix = generate_name_suffix(institution.name)
new_ghcid_resolved = f"{new_ghcid}-{name_suffix}"
existing_ghcid_resolved = new_ghcid # Unchanged
else:
# First batch: both being created simultaneously
# Generate name suffixes for BOTH institutions
# Regenerate both GHCIDs with name suffixes
2. When to Add Name Suffix
DO add name suffix:
- ✅ First batch: When collision detected during simultaneous creation → ALL colliding institutions
- ✅ Historical addition: When new institution collides with published GHCID → ONLY new institution
- ✅ When generating GHCID for institution with known collision
- ✅ When institution is added to collision registry
DO NOT add name suffix:
- ❌ When no collision exists (most institutions)
- ❌ "Just in case" for future collisions
- ❌ NEVER for existing published GHCIDs when historical collision occurs (preserve PID stability)
3. Name Suffix Normalization
Input formats accepted:
- Native language institution name (any script)
- Will be normalized to ASCII snake_case
Normalized output:
- All lowercase
- Spaces/hyphens → underscores
- Diacritics removed (é → e, ö → o)
- Punctuation removed
- Non-Latin scripts transliterated
Code example:
# Name suffix generation
def generate_name_suffix(native_name: str) -> str:
"""Convert native language name to snake_case suffix."""
# Normalize unicode and remove diacritics
normalized = unicodedata.normalize('NFD', native_name)
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Convert to lowercase
lowercase = ascii_name.lower()
# Remove punctuation
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
# Replace spaces/hyphens with underscores
underscored = re.sub(r'[\s\-]+', '_', no_punct)
# Remove remaining non-alphanumeric (except underscores)
clean = re.sub(r'[^a-z0-9_]', '', underscored)
# Collapse multiple underscores
return re.sub(r'_+', '_', clean).strip('_')
4. Persistent Numeric ID
The SHA256 hash is computed from the full GHCID string including name suffix:
# Without name suffix
ghcid = "NL-NH-AMS-M-RM"
numeric_id = SHA256(ghcid)[:8] → int # e.g., 12345678901234567890
# With name suffix (different hash!)
ghcid = "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
numeric_id = SHA256(ghcid)[:8] → int # e.g., 98765432109876543210
Important: Even if two institutions have identical base GHCIDs, their numeric IDs will be different because the name suffix is included in the hash.
Collision Registry
Maintain a collision registry to track known conflicts with temporal metadata:
{
"collisions": [
{
"base_ghcid": "NL-NH-AMS-M-SM",
"collision_type": "FIRST_BATCH",
"institutions": [
{
"ghcid": "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam",
"name": "Stedelijk Museum Amsterdam",
"name_suffix": "stedelijk_museum_amsterdam",
"publication_date": "2025-11-01T10:00:00Z"
},
{
"ghcid": "NL-NH-AMS-M-SM-science_museum_amsterdam",
"name": "Science Museum Amsterdam",
"name_suffix": "science_museum_amsterdam",
"publication_date": "2025-11-01T10:00:00Z"
}
],
"detected_date": "2025-11-01T10:00:00Z",
"resolution_strategy": "Both institutions receive name suffixes (contemporaneous discovery)"
},
{
"base_ghcid": "NL-NH-AMS-M-HM",
"collision_type": "HISTORICAL_ADDITION",
"institutions": [
{
"ghcid": "NL-NH-AMS-M-HM",
"name": "Hermitage Museum Amsterdam",
"name_suffix": null,
"publication_date": "2025-11-01T10:00:00Z",
"note": "Original GHCID preserved (published identifier)"
},
{
"ghcid": "NL-NH-AMS-M-HM-amsterdam_historical_museum",
"name": "Amsterdam Historical Museum",
"name_suffix": "amsterdam_historical_museum",
"publication_date": "2025-11-15T14:30:00Z",
"note": "Historical addition: only new institution receives name suffix"
}
],
"detected_date": "2025-11-15T14:30:00Z",
"resolution_strategy": "Preserve existing published GHCID; only new institution gets name suffix"
}
]
}
Registry Usage
-
Before generating new GHCID:
- Check if base GHCID exists in collision registry
- Check publication date of existing record
- If published: new institution gets name suffix (historical addition)
- If unpublished: both get name suffixes (first batch)
-
When collision detected:
- Record temporal context (first batch vs. historical)
- Apply appropriate resolution strategy
- Add entries to collision registry with publication dates
- Update
ghcid_historywith change reason including temporal context
-
Periodic audit:
- Scan all GHCIDs for duplicates
- Check publication dates to determine resolution strategy
- Resolve retroactively if found
- Update collision registry with temporal metadata
What If Name Suffix Still Causes Collision?
Priority Order for Collision Resolution
- Native language name (snake_case) - Primary
- Name + founding year:
institution_name_1895 - Name + city qualifier:
institution_name_amsterdam - Sequential number (last resort):
institution_name_002
Examples
# Primary: Native language name
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
# If still collision: Add founding year
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_1895
# If still collision: Sequential
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_002
Implementation note: Start with name suffix only. The vast majority of collisions will be resolved by native language names since they are inherently unique.
Validation Rules
GHCID Pattern Regex
Without name suffix:
^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}$
With name suffix (updated):
^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-[a-z0-9_]+)?$
Breakdown:
[A-Z]{2}- Country code (2 letters)[A-Z0-9]{1,3}- Region code (1-3 chars)[A-Z]{3}- City LOCODE (3 letters)[A-Z]- Type code (1 letter)[A-Z0-9]{1,10}- Abbreviation (1-10 chars)(-[a-z0-9_]+)?- Optional name suffix in snake_case
Validation Tests
# Valid GHCIDs
assert validate("NL-NH-AMS-M-RM") # Without name suffix
assert validate("NL-NH-AMS-M-SM-stedelijk_museum_amsterdam") # With name suffix
assert validate("US-NY-NYC-M-MOMA") # International
assert validate("NL-NH-AMS-M-SM-museum") # Short name suffix
assert validate("FR-75-PAR-M-DO-musee_dorsay") # French with diacritics removed
# Invalid GHCIDs
assert not validate("NL-NH-AMS-M-SM-Stedelijk_Museum") # Uppercase in name suffix
assert not validate("NL-NH-AMS-M-SM-musée_d'orsay") # Diacritics/apostrophe not allowed
assert not validate("NL-NH-AMS-M-SM-") # Empty name suffix
Migration Strategy
For Existing Institutions Without Name Suffix
When collision is detected on existing institutions without name suffixes, temporal context determines strategy:
First Batch Collision (Retroactive Detection)
If both institutions were created simultaneously but collision not detected until later:
- Verify creation dates (check
provenance.extraction_date) - If same date: Treat as first batch → both get name suffixes
- Generate name suffixes for both institutions from native language names
- Update both GHCIDs with name suffixes
- Update both
ghcid_numeric(hash changes!) - Add history entries for both:
# Institution 1 GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-SM", ghcid_new="NL-NH-AMS-M-SM-stedelijk_museum_amsterdam", valid_from="2025-11-05T12:00:00Z", valid_to=None, reason="First batch collision (retroactive): Added name suffix to disambiguate from Science Museum Amsterdam (both created 2025-11-01)" ) # Institution 2 GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-SM", ghcid_new="NL-NH-AMS-M-SM-science_museum_amsterdam", valid_from="2025-11-05T12:00:00Z", valid_to=None, reason="First batch collision (retroactive): Added name suffix to disambiguate from Stedelijk Museum Amsterdam (both created 2025-11-01)" )
Historical Addition Collision
If new institution collides with already published existing GHCID:
- Verify publication date of existing record
- Existing GHCID: NO CHANGE (preserve published PID)
- New institution: Gets name suffix
- Update only new
ghcid_numeric - Add history entry for new institution only:
GHCIDHistoryEntry( ghcid_old="NL-NH-AMS-M-HM", # What it would have been ghcid_new="NL-NH-AMS-M-HM-amsterdam_historical_museum", # What it becomes valid_from="2025-11-15T14:30:00Z", valid_to=None, reason="Historical addition collision: Base GHCID NL-NH-AMS-M-HM already published for Hermitage Museum Amsterdam (2025-11-01). Added name suffix to preserve existing PID stability." ) - No history entry for existing institution (GHCID unchanged)
Decision Flow
def determine_collision_strategy(existing_inst, new_inst):
"""Determine whether to treat as first batch or historical addition."""
# Check if existing record has publication date
if existing_inst.provenance.publication_date is not None:
# Published identifier → Historical addition
return "HISTORICAL_ADDITION"
# Check if both created on same date
if (existing_inst.provenance.extraction_date ==
new_inst.provenance.extraction_date):
# Same creation date → First batch (retroactive)
return "FIRST_BATCH_RETROACTIVE"
# Different dates, no publication → First batch (simultaneous processing)
if abs((existing_inst.provenance.extraction_date -
new_inst.provenance.extraction_date).days) < 1:
return "FIRST_BATCH"
# Default: Historical addition (new is significantly later)
return "HISTORICAL_ADDITION"
Backward Compatibility
Q: What happens to old numeric IDs when name suffix is added?
A: The numeric ID changes because it's a hash of the full GHCID string. This is intentional:
- Old numeric ID:
SHA256("NL-NH-AMS-M-SM")[:8] - New numeric ID:
SHA256("NL-NH-AMS-M-SM-stedelijk_museum_amsterdam")[:8]
Migration plan depends on temporal context:
First Batch Collision (Both Updated)
- Keep
ghcid_originalunchanged for both (preserves old GHCID) - Update
ghcid_currentwith name suffix for both - Generate new
ghcid_numericfrom updated GHCID for both - Add comprehensive history entries explaining change
- Maintain mapping table:
old_numeric_id → new_numeric_id
Historical Addition (Only New Updated)
- Existing institution: NO CHANGES
ghcid_original: unchangedghcid_current: unchangedghcid_numeric: unchanged- No history entry added
- New institution: Gets name suffix from start
ghcid_original:NL-NH-AMS-M-HM-amsterdam_historical_museum(includes name suffix)ghcid_current:NL-NH-AMS-M-HM-amsterdam_historical_museumghcid_numeric: Hash of full GHCID with name suffix- History entry documents collision with existing published PID
Testing Strategy
Unit Tests
def test_ghcid_without_collision():
"""Most institutions should NOT have name suffix"""
components = GHCIDComponents(
country_code="NL",
region_code="NH",
city_locode="AMS",
institution_type="M",
abbreviation="RM"
)
assert components.to_string() == "NL-NH-AMS-M-RM"
assert components.name_suffix is None
def test_first_batch_collision():
"""First batch: BOTH institutions get name suffixes"""
inst1 = create_institution(
name="Stedelijk Museum Amsterdam",
extraction_date="2025-11-01T10:00:00Z"
)
inst2 = create_institution(
name="Science Museum Amsterdam",
extraction_date="2025-11-01T10:00:00Z"
)
# Process batch
resolve_batch_collisions([inst1, inst2])
# Both should have name suffixes
assert inst1.ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
assert inst2.ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"
def test_historical_addition_collision():
"""Historical addition: Only NEW institution gets name suffix"""
# Existing published institution
existing = create_institution(
name="Hermitage Museum Amsterdam",
extraction_date="2025-11-01T10:00:00Z",
publication_date="2025-11-01T10:00:00Z"
)
existing.ghcid = "NL-NH-AMS-M-HM"
# New historical institution
new = create_institution(
name="Amsterdam Historical Museum",
extraction_date="2025-11-15T14:30:00Z"
)
# Resolve collision
resolve_collision(new, existing_ghcids={existing.ghcid: existing})
# Existing GHCID unchanged
assert existing.ghcid == "NL-NH-AMS-M-HM"
# New institution gets name suffix
assert new.ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"
def test_name_suffix_normalization():
"""Name suffix should be normalized to snake_case"""
# Test diacritics removal
assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
# Test German umlauts
assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
# Test spaces and hyphens
assert generate_name_suffix("Van Gogh Museum") == "van_gogh_museum"
# Test punctuation removal
assert generate_name_suffix("St. Peter's Church Archive") == "st_peters_church_archive"
def test_collision_strategy_determination():
"""Test temporal context determines resolution strategy"""
# Same date → First batch
strategy = determine_collision_strategy(
existing=Institution(extraction_date="2025-11-01T10:00:00Z"),
new=Institution(extraction_date="2025-11-01T10:00:00Z")
)
assert strategy == "FIRST_BATCH"
# Existing published → Historical addition
strategy = determine_collision_strategy(
existing=Institution(
extraction_date="2025-11-01T10:00:00Z",
publication_date="2025-11-01T10:00:00Z"
),
new=Institution(extraction_date="2025-11-15T14:30:00Z")
)
assert strategy == "HISTORICAL_ADDITION"
def test_simultaneous_historical_additions():
"""Multiple historical additions: all new get name suffixes"""
existing = Institution(
name="Maritime Museum Amsterdam",
ghcid="NL-NH-AMS-M-MM",
publication_date="2025-11-01T10:00:00Z"
)
new1 = Institution(
name="Dutch Navy Museum",
extraction_date="2025-12-01T09:00:00Z"
)
new2 = Institution(
name="Amsterdam Naval Archive",
extraction_date="2025-12-01T09:00:00Z"
)
resolve_batch_collisions([new1, new2], existing_ghcids={"NL-NH-AMS-M-MM": existing})
# Existing unchanged
assert existing.ghcid == "NL-NH-AMS-M-MM"
# Both new get name suffixes
assert new1.ghcid == "NL-NH-AMS-M-MM-dutch_navy_museum"
assert new2.ghcid == "NL-NH-AMS-M-MM-amsterdam_naval_archive"
Integration Tests
def test_real_dutch_museums_no_collisions():
"""Verify no actual collisions in Dutch ISIL registry"""
parser = ISILRegistryParser()
records = parser.parse_file("data/ISIL-codes_2025-08-01.csv")
ghcids = {}
collisions = []
for record in records:
ghcid_base = generate_base_ghcid(record)
if ghcid_base in ghcids:
collisions.append((ghcid_base, ghcids[ghcid_base], record))
ghcids[ghcid_base] = record
# Report any collisions found
print(f"Found {len(collisions)} collisions")
for base, record1, record2 in collisions:
print(f" {base}: {record1.instelling} vs {record2.instelling}")
Future Enhancements
-
Collision probability calculator
- Analyze existing institutions
- Predict collision likelihood
- Suggest alternative abbreviations
-
Name suffix validation
- Verify name suffix matches institution's official name
- Flag inconsistencies for review
- Support multilingual name variants
-
Collision warning system
- Alert when generating GHCID similar to existing
- Suggest checking for duplicates
- Proactive collision prevention
-
Alternative disambiguation (fallback for name suffix collisions)
- Use founding year:
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_1895 - Use street address:
NL-NH-AMS-M-SM-museum_paulus_potterstraat - Use parent org:
NL-NH-AMS-M-SM-museum_gemeente_amsterdam
- Use founding year:
References
- GHCID Spec:
docs/plan/global_glam/06-global-identifier-system.md - Name Suffix Generation:
generate_name_suffix()function in this document - Implementation:
src/glam_extractor/identifiers/ghcid.py - Schema:
schemas/heritage_custodian.yaml(ghcid_current, ghcid_original slots)
Last Updated: 2025-11-30
Authors: GLAM Data Extraction Project Team