glam/docs/plan/global_glam/07-ghcid-collision-resolution.md
2025-11-30 23:30:29 +01:00

31 KiB

GHCID Collision Resolution Strategy

Version: 2.0
Date: 2025-11-30
Status: Implemented

Problem Statement

When two heritage institutions share:

  • Same geographic location (city)
  • Same institution type (e.g., both museums)
  • Same name abbreviation

...they will generate identical GHCID identifiers, causing collisions.

Example Collision Scenario

Institution 1: Stedelijk Museum Amsterdam
Institution 2: Science Museum Amsterdam (hypothetical)

Both would generate:

NL-NH-AMS-M-SM

Solution: Native Language Name Suffix

When a collision is detected, append the institution's full legal name in native language in snake_case format to the GHCID.

Format

Base GHCID:

{Country}-{Region}-{City}-{Type}-{Abbreviation}

GHCID with Collision Resolver:

{Country}-{Region}-{City}-{Type}-{Abbreviation}-{native_name_in_snake_case}

Name Suffix Generation

Converting institution names to snake_case suffixes:

import re
import unicodedata

def generate_name_suffix(native_name: str) -> str:
    """Convert native language institution name to snake_case suffix.
    
    Examples:
        "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
        "Musée d'Orsay" → "musee_dorsay"
        "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
    """
    # Normalize unicode (NFD decomposition) and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Convert to lowercase
    lowercase = ascii_name.lower()
    
    # Remove apostrophes, commas, and other punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces and hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)
    
    # Remove any remaining non-alphanumeric characters (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')
    
    return final

Name suffix rules:

  • Use the institution's full official name in its native language
  • Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
  • Remove all diacritics (é → e, ö → o, ñ → n)
  • Remove punctuation (apostrophes, commas, periods)
  • Replace spaces with underscores
  • All lowercase

Examples

Institution Base GHCID With Name Suffix Notes
Rijksmuseum Amsterdam NL-NH-AMS-M-RM N/A No collision, no suffix needed
Stedelijk Museum Amsterdam NL-NH-AMS-M-SM NL-NH-AMS-M-SM-stedelijk_museum_amsterdam Collision detected
Science Museum Amsterdam NL-NH-AMS-M-SM NL-NH-AMS-M-SM-science_museum_amsterdam Collision detected
Van Gogh Museum NL-NH-AMS-M-VGM N/A No collision

Temporal Dimension in Collision Resolution

The Critical Distinction: First Batch vs. Historical Addition

Collision resolution behavior differs based on when the collision is detected:

Scenario A: First Batch Collision (Contemporaneous Discovery)

When: Multiple institutions discovered simultaneously during initial GHCID generation (e.g., batch import from CSV).

Rule: ALL colliding institutions receive name suffixes.

Why: No institution has temporal priority; all are being created at the same time. Fair treatment requires all to be disambiguated equally.

Example:

# Discovered on 2025-11-01 from Dutch ISIL registry batch import
Institution 1: Stedelijk Museum Amsterdam
  → ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

Institution 2: Science Museum Amsterdam  
  → ghcid: NL-NH-AMS-M-SM-science_museum_amsterdam

# Both get name suffixes because both discovered simultaneously

Scenario B: Historical Addition (Post-Publication Collision)

When: A newly added historical institution collides with an already published GHCID.

Rule: ONLY the new institution receives a name suffix. The existing GHCID remains unchanged.

Why: PID Stability Principle - Published persistent identifiers may already be cited in research papers, integrated into third-party datasets, or embedded in API responses. Changing existing PIDs breaks citations and external references.

Example:

# Published 2025-11-01
Institution 1: Hermitage Museum Amsterdam
  → ghcid: NL-NH-AMS-M-HM  # Unchanged forever

# Historical institution added 2025-11-15
Institution 2: Amsterdam Historical Museum (historical records 1926-2001)
  → ghcid: NL-NH-AMS-M-HM-amsterdam_historical_museum  # New institution gets name suffix

# Existing GHCID preserved; only new addition disambiguated

Decision Matrix

Discovery Context Existing Institution New Institution Resolution Strategy
First batch (both new) None (being created) None (being created) Both get name suffixes
Historical addition Already published Being added now Only new gets name suffix
Simultaneous historical additions Already published Multiple being added All new get name suffixes; existing unchanged

Timeline Example: Demonstrating the Temporal Principle

2025-11-01 (First Batch Import)
├─ Stedelijk Museum Amsterdam added
│  └─ ghcid: NL-NH-AMS-M-SM
├─ Science Museum Amsterdam discovered (collision!)
│  └─ BOTH institutions updated:
│     ├─ Stedelijk: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
│     └─ Science:   NL-NH-AMS-M-SM-science_museum_amsterdam

2025-11-15 (Historical Research Addition)
├─ Hermitage Museum Amsterdam already exists
│  └─ ghcid: NL-NH-AMS-M-HM (published, immutable)
├─ Amsterdam Historical Museum added (historical institution, 1926-2001)
│  └─ Collision detected!
│     ├─ Hermitage GHCID: UNCHANGED (NL-NH-AMS-M-HM)
│     └─ Historical Museum: NL-NH-AMS-M-HM-amsterdam_historical_museum (gets name suffix)

2025-12-01 (Another Historical Addition)
├─ Maritime Museum Amsterdam already exists
│  └─ ghcid: NL-NH-AMS-M-MM (published, immutable)
├─ Two historical naval museums discovered in archive:
│  ├─ Dutch Navy Museum (1906-1955) → collides!
│  └─ Amsterdam Naval Archive (1820-1901) → collides!
│     └─ BOTH new institutions get name suffixes:
│        ├─ Dutch Navy: NL-NH-AMS-M-MM-dutch_navy_museum
│        └─ Naval Archive: NL-NH-AMS-M-MM-amsterdam_naval_archive
│        ├─ Existing Maritime Museum: UNCHANGED (NL-NH-AMS-M-MM)

Implementation Guidance

Collision Detection with Temporal Context

def resolve_collision(new_institution, existing_ghcids_registry):
    """
    Resolve GHCID collision based on temporal context.
    
    Args:
        new_institution: Institution being added
        existing_ghcids_registry: Dict mapping GHCID → {publication_date, institution_data}
    
    Returns:
        Resolution strategy and updated GHCIDs
    """
    new_ghcid_base = generate_base_ghcid(new_institution)
    
    # Check if base GHCID exists in published registry
    if new_ghcid_base in existing_ghcids_registry:
        existing_entry = existing_ghcids_registry[new_ghcid_base]
        
        # Historical addition case
        if existing_entry['publication_date'] is not None:
            print(f"Collision with published GHCID {new_ghcid_base}")
            print(f"  → Existing: {existing_entry['name']} (published {existing_entry['publication_date']})")
            print(f"  → New: {new_institution.name} (being added now)")
            print(f"  → Strategy: Only new institution gets name suffix")
            
            # Existing GHCID remains unchanged
            existing_ghcid = new_ghcid_base
            
            # New institution gets name suffix
            name_suffix = generate_name_suffix(new_institution.name)
            new_ghcid = f"{new_ghcid_base}-{name_suffix}"
            
            return {
                'strategy': 'HISTORICAL_ADDITION',
                'existing_ghcid': existing_ghcid,  # Unchanged
                'new_ghcid': new_ghcid,            # With name suffix
                'reason': f"Historical addition collision: preserve existing published PID"
            }
    
    # First batch collision case (both being created)
    # This should be handled during batch processing
    return {
        'strategy': 'FIRST_BATCH',
        'reason': 'All colliding institutions in batch receive name suffixes'
    }


def resolve_batch_collisions(new_institutions_batch):
    """
    Resolve collisions within a batch of new institutions.
    
    When multiple institutions in the same batch collide, ALL get name suffixes.
    """
    ghcid_map = {}  # base_ghcid → list of institutions
    
    # Group by base GHCID
    for inst in new_institutions_batch:
        base_ghcid = generate_base_ghcid(inst)
        if base_ghcid not in ghcid_map:
            ghcid_map[base_ghcid] = []
        ghcid_map[base_ghcid].append(inst)
    
    # Resolve collisions
    for base_ghcid, institutions in ghcid_map.items():
        if len(institutions) > 1:
            print(f"First batch collision detected: {base_ghcid}")
            print(f"  → {len(institutions)} institutions collide")
            print(f"  → Strategy: All {len(institutions)} get name suffixes")
            
            # All institutions in collision get name suffixes
            for inst in institutions:
                name_suffix = generate_name_suffix(inst.name)
                inst.ghcid = f"{base_ghcid}-{name_suffix}"
                inst.provenance.notes += (
                    f" | First batch collision: {len(institutions)} institutions "
                    f"share base GHCID {base_ghcid}"
                )

Edge Cases

1. Multiple Historical Additions Simultaneously

Scenario: Two historical institutions discovered at the same time, both colliding with existing GHCID.

Resolution: Both new institutions get name suffixes; existing unchanged.

# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-MM

# Both added 2025-12-01
new_inst_1.ghcid = "NL-NH-AMS-M-MM-dutch_navy_museum"
new_inst_2.ghcid = "NL-NH-AMS-M-MM-amsterdam_naval_archive"

2. Historical Institution Without Unique Name

Scenario: New historical institution collides and has a generic name that may not be unique.

Fallback Strategy (if name suffix still causes collision):

  1. Name + founding year: NL-NH-AMS-M-HM-historical_museum_1926
  2. Name + city qualifier: NL-NH-AMS-M-HM-historical_museum_amsterdam
  3. Sequential: NL-NH-AMS-M-HM-002 (increment from existing)
def get_collision_suffix(institution, existing_suffixes=None):
    """Get collision resolution suffix, handling duplicates."""
    # Primary: Native language name in snake_case
    name_suffix = generate_name_suffix(institution.name)
    
    # Check if this suffix already exists
    if existing_suffixes and name_suffix in existing_suffixes:
        # Add founding year if available
        if institution.founding_year:
            name_suffix = f"{name_suffix}_{institution.founding_year}"
        
        # Still collision? Add sequential number
        if name_suffix in existing_suffixes:
            counter = 2
            base_suffix = name_suffix
            while name_suffix in existing_suffixes:
                name_suffix = f"{base_suffix}_{counter}"
                counter += 1
    
    return name_suffix

3. Retroactive Discovery of First Batch Collision

Scenario: Two institutions were created in first batch, but collision wasn't detected until later.

Resolution: Treat as first batch collision (both get name suffixes), even though discovered late.

Justification: Intent was to create both simultaneously; detection timing doesn't change temporal relationship.

# If both have same creation_date in provenance metadata
if inst1.provenance.extraction_date == inst2.provenance.extraction_date:
    strategy = 'FIRST_BATCH'  # Both get name suffixes
else:
    strategy = 'HISTORICAL_ADDITION'  # Only later one gets name suffix

Provenance Tracking for Collision Events

Record collision resolution in provenance metadata:

# First batch collision
provenance:
  extraction_date: "2025-11-01T10:00:00Z"
  collision_resolution:
    strategy: FIRST_BATCH
    collision_group: [stedelijk_museum_amsterdam, science_museum_amsterdam]
    resolved_date: "2025-11-01T10:00:00Z"
    reason: "Contemporaneous discovery: both institutions added in ISIL registry batch import"

# Historical addition collision
provenance:
  extraction_date: "2025-11-15T14:30:00Z"
  collision_resolution:
    strategy: HISTORICAL_ADDITION
    collides_with: NL-NH-AMS-M-HM
    existing_publication_date: "2025-11-01T10:00:00Z"
    resolved_date: "2025-11-15T14:30:00Z"
    reason: "Historical addition: existing GHCID preserved per PID stability principle"

Implementation Rules

1. Collision Detection

A collision occurs when:

# Pseudocode with temporal awareness
existing_ghcids = database.get_all_ghcids()  # Include publication dates
new_ghcid = generate_ghcid(institution)

if new_ghcid in existing_ghcids:
    existing_record = existing_ghcids[new_ghcid]
    
    # Check temporal context
    if existing_record['publication_date'] is not None:
        # Historical addition: only new institution gets name suffix
        name_suffix = generate_name_suffix(institution.name)
        new_ghcid_resolved = f"{new_ghcid}-{name_suffix}"
        existing_ghcid_resolved = new_ghcid  # Unchanged
    else:
        # First batch: both being created simultaneously
        # Generate name suffixes for BOTH institutions
        # Regenerate both GHCIDs with name suffixes

2. When to Add Name Suffix

DO add name suffix:

  • First batch: When collision detected during simultaneous creation → ALL colliding institutions
  • Historical addition: When new institution collides with published GHCID → ONLY new institution
  • When generating GHCID for institution with known collision
  • When institution is added to collision registry

DO NOT add name suffix:

  • When no collision exists (most institutions)
  • "Just in case" for future collisions
  • NEVER for existing published GHCIDs when historical collision occurs (preserve PID stability)

3. Name Suffix Normalization

Input formats accepted:

  • Native language institution name (any script)
  • Will be normalized to ASCII snake_case

Normalized output:

  • All lowercase
  • Spaces/hyphens → underscores
  • Diacritics removed (é → e, ö → o)
  • Punctuation removed
  • Non-Latin scripts transliterated

Code example:

# Name suffix generation
def generate_name_suffix(native_name: str) -> str:
    """Convert native language name to snake_case suffix."""
    # Normalize unicode and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Convert to lowercase
    lowercase = ascii_name.lower()
    
    # Remove punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces/hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)
    
    # Remove remaining non-alphanumeric (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    return re.sub(r'_+', '_', clean).strip('_')

4. Persistent Numeric ID

The SHA256 hash is computed from the full GHCID string including name suffix:

# Without name suffix
ghcid = "NL-NH-AMS-M-RM"
numeric_id = SHA256(ghcid)[:8]  int  # e.g., 12345678901234567890

# With name suffix (different hash!)
ghcid = "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
numeric_id = SHA256(ghcid)[:8]  int  # e.g., 98765432109876543210

Important: Even if two institutions have identical base GHCIDs, their numeric IDs will be different because the name suffix is included in the hash.

Collision Registry

Maintain a collision registry to track known conflicts with temporal metadata:

{
  "collisions": [
    {
      "base_ghcid": "NL-NH-AMS-M-SM",
      "collision_type": "FIRST_BATCH",
      "institutions": [
        {
          "ghcid": "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam",
          "name": "Stedelijk Museum Amsterdam",
          "name_suffix": "stedelijk_museum_amsterdam",
          "publication_date": "2025-11-01T10:00:00Z"
        },
        {
          "ghcid": "NL-NH-AMS-M-SM-science_museum_amsterdam",
          "name": "Science Museum Amsterdam",
          "name_suffix": "science_museum_amsterdam",
          "publication_date": "2025-11-01T10:00:00Z"
        }
      ],
      "detected_date": "2025-11-01T10:00:00Z",
      "resolution_strategy": "Both institutions receive name suffixes (contemporaneous discovery)"
    },
    {
      "base_ghcid": "NL-NH-AMS-M-HM",
      "collision_type": "HISTORICAL_ADDITION",
      "institutions": [
        {
          "ghcid": "NL-NH-AMS-M-HM",
          "name": "Hermitage Museum Amsterdam",
          "name_suffix": null,
          "publication_date": "2025-11-01T10:00:00Z",
          "note": "Original GHCID preserved (published identifier)"
        },
        {
          "ghcid": "NL-NH-AMS-M-HM-amsterdam_historical_museum",
          "name": "Amsterdam Historical Museum",
          "name_suffix": "amsterdam_historical_museum",
          "publication_date": "2025-11-15T14:30:00Z",
          "note": "Historical addition: only new institution receives name suffix"
        }
      ],
      "detected_date": "2025-11-15T14:30:00Z",
      "resolution_strategy": "Preserve existing published GHCID; only new institution gets name suffix"
    }
  ]
}

Registry Usage

  1. Before generating new GHCID:

    • Check if base GHCID exists in collision registry
    • Check publication date of existing record
    • If published: new institution gets name suffix (historical addition)
    • If unpublished: both get name suffixes (first batch)
  2. When collision detected:

    • Record temporal context (first batch vs. historical)
    • Apply appropriate resolution strategy
    • Add entries to collision registry with publication dates
    • Update ghcid_history with change reason including temporal context
  3. Periodic audit:

    • Scan all GHCIDs for duplicates
    • Check publication dates to determine resolution strategy
    • Resolve retroactively if found
    • Update collision registry with temporal metadata

What If Name Suffix Still Causes Collision?

Priority Order for Collision Resolution

  1. Native language name (snake_case) - Primary
  2. Name + founding year: institution_name_1895
  3. Name + city qualifier: institution_name_amsterdam
  4. Sequential number (last resort): institution_name_002

Examples

# Primary: Native language name
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

# If still collision: Add founding year
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_1895

# If still collision: Sequential
NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_002

Implementation note: Start with name suffix only. The vast majority of collisions will be resolved by native language names since they are inherently unique.

Validation Rules

GHCID Pattern Regex

Without name suffix:

^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}$

With name suffix (updated):

^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-[a-z0-9_]+)?$

Breakdown:

  • [A-Z]{2} - Country code (2 letters)
  • [A-Z0-9]{1,3} - Region code (1-3 chars)
  • [A-Z]{3} - City LOCODE (3 letters)
  • [A-Z] - Type code (1 letter)
  • [A-Z0-9]{1,10} - Abbreviation (1-10 chars)
  • (-[a-z0-9_]+)? - Optional name suffix in snake_case

Validation Tests

# Valid GHCIDs
assert validate("NL-NH-AMS-M-RM")                              # Without name suffix
assert validate("NL-NH-AMS-M-SM-stedelijk_museum_amsterdam")   # With name suffix
assert validate("US-NY-NYC-M-MOMA")                            # International
assert validate("NL-NH-AMS-M-SM-museum")                       # Short name suffix
assert validate("FR-75-PAR-M-DO-musee_dorsay")                 # French with diacritics removed

# Invalid GHCIDs
assert not validate("NL-NH-AMS-M-SM-Stedelijk_Museum")  # Uppercase in name suffix
assert not validate("NL-NH-AMS-M-SM-musée_d'orsay")     # Diacritics/apostrophe not allowed
assert not validate("NL-NH-AMS-M-SM-")                  # Empty name suffix

Migration Strategy

For Existing Institutions Without Name Suffix

When collision is detected on existing institutions without name suffixes, temporal context determines strategy:

First Batch Collision (Retroactive Detection)

If both institutions were created simultaneously but collision not detected until later:

  1. Verify creation dates (check provenance.extraction_date)
  2. If same date: Treat as first batch → both get name suffixes
  3. Generate name suffixes for both institutions from native language names
  4. Update both GHCIDs with name suffixes
  5. Update both ghcid_numeric (hash changes!)
  6. Add history entries for both:
    # Institution 1
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-SM",
        ghcid_new="NL-NH-AMS-M-SM-stedelijk_museum_amsterdam",
        valid_from="2025-11-05T12:00:00Z",
        valid_to=None,
        reason="First batch collision (retroactive): Added name suffix to disambiguate from Science Museum Amsterdam (both created 2025-11-01)"
    )
    
    # Institution 2
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-SM",
        ghcid_new="NL-NH-AMS-M-SM-science_museum_amsterdam",
        valid_from="2025-11-05T12:00:00Z",
        valid_to=None,
        reason="First batch collision (retroactive): Added name suffix to disambiguate from Stedelijk Museum Amsterdam (both created 2025-11-01)"
    )
    

Historical Addition Collision

If new institution collides with already published existing GHCID:

  1. Verify publication date of existing record
  2. Existing GHCID: NO CHANGE (preserve published PID)
  3. New institution: Gets name suffix
  4. Update only new ghcid_numeric
  5. Add history entry for new institution only:
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-HM",  # What it would have been
        ghcid_new="NL-NH-AMS-M-HM-amsterdam_historical_museum",  # What it becomes
        valid_from="2025-11-15T14:30:00Z",
        valid_to=None,
        reason="Historical addition collision: Base GHCID NL-NH-AMS-M-HM already published for Hermitage Museum Amsterdam (2025-11-01). Added name suffix to preserve existing PID stability."
    )
    
  6. No history entry for existing institution (GHCID unchanged)

Decision Flow

def determine_collision_strategy(existing_inst, new_inst):
    """Determine whether to treat as first batch or historical addition."""
    
    # Check if existing record has publication date
    if existing_inst.provenance.publication_date is not None:
        # Published identifier → Historical addition
        return "HISTORICAL_ADDITION"
    
    # Check if both created on same date
    if (existing_inst.provenance.extraction_date == 
        new_inst.provenance.extraction_date):
        # Same creation date → First batch (retroactive)
        return "FIRST_BATCH_RETROACTIVE"
    
    # Different dates, no publication → First batch (simultaneous processing)
    if abs((existing_inst.provenance.extraction_date - 
            new_inst.provenance.extraction_date).days) < 1:
        return "FIRST_BATCH"
    
    # Default: Historical addition (new is significantly later)
    return "HISTORICAL_ADDITION"

Backward Compatibility

Q: What happens to old numeric IDs when name suffix is added?

A: The numeric ID changes because it's a hash of the full GHCID string. This is intentional:

  • Old numeric ID: SHA256("NL-NH-AMS-M-SM")[:8]
  • New numeric ID: SHA256("NL-NH-AMS-M-SM-stedelijk_museum_amsterdam")[:8]

Migration plan depends on temporal context:

First Batch Collision (Both Updated)

  1. Keep ghcid_original unchanged for both (preserves old GHCID)
  2. Update ghcid_current with name suffix for both
  3. Generate new ghcid_numeric from updated GHCID for both
  4. Add comprehensive history entries explaining change
  5. Maintain mapping table: old_numeric_id → new_numeric_id

Historical Addition (Only New Updated)

  1. Existing institution: NO CHANGES
    • ghcid_original: unchanged
    • ghcid_current: unchanged
    • ghcid_numeric: unchanged
    • No history entry added
  2. New institution: Gets name suffix from start
    • ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum (includes name suffix)
    • ghcid_current: NL-NH-AMS-M-HM-amsterdam_historical_museum
    • ghcid_numeric: Hash of full GHCID with name suffix
    • History entry documents collision with existing published PID

Testing Strategy

Unit Tests

def test_ghcid_without_collision():
    """Most institutions should NOT have name suffix"""
    components = GHCIDComponents(
        country_code="NL",
        region_code="NH",
        city_locode="AMS",
        institution_type="M",
        abbreviation="RM"
    )
    assert components.to_string() == "NL-NH-AMS-M-RM"
    assert components.name_suffix is None

def test_first_batch_collision():
    """First batch: BOTH institutions get name suffixes"""
    inst1 = create_institution(
        name="Stedelijk Museum Amsterdam",
        extraction_date="2025-11-01T10:00:00Z"
    )
    inst2 = create_institution(
        name="Science Museum Amsterdam",
        extraction_date="2025-11-01T10:00:00Z"
    )
    
    # Process batch
    resolve_batch_collisions([inst1, inst2])
    
    # Both should have name suffixes
    assert inst1.ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
    assert inst2.ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"

def test_historical_addition_collision():
    """Historical addition: Only NEW institution gets name suffix"""
    # Existing published institution
    existing = create_institution(
        name="Hermitage Museum Amsterdam",
        extraction_date="2025-11-01T10:00:00Z",
        publication_date="2025-11-01T10:00:00Z"
    )
    existing.ghcid = "NL-NH-AMS-M-HM"
    
    # New historical institution
    new = create_institution(
        name="Amsterdam Historical Museum",
        extraction_date="2025-11-15T14:30:00Z"
    )
    
    # Resolve collision
    resolve_collision(new, existing_ghcids={existing.ghcid: existing})
    
    # Existing GHCID unchanged
    assert existing.ghcid == "NL-NH-AMS-M-HM"
    
    # New institution gets name suffix
    assert new.ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"

def test_name_suffix_normalization():
    """Name suffix should be normalized to snake_case"""
    # Test diacritics removal
    assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
    
    # Test German umlauts
    assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
    
    # Test spaces and hyphens
    assert generate_name_suffix("Van Gogh Museum") == "van_gogh_museum"
    
    # Test punctuation removal
    assert generate_name_suffix("St. Peter's Church Archive") == "st_peters_church_archive"

def test_collision_strategy_determination():
    """Test temporal context determines resolution strategy"""
    # Same date → First batch
    strategy = determine_collision_strategy(
        existing=Institution(extraction_date="2025-11-01T10:00:00Z"),
        new=Institution(extraction_date="2025-11-01T10:00:00Z")
    )
    assert strategy == "FIRST_BATCH"
    
    # Existing published → Historical addition
    strategy = determine_collision_strategy(
        existing=Institution(
            extraction_date="2025-11-01T10:00:00Z",
            publication_date="2025-11-01T10:00:00Z"
        ),
        new=Institution(extraction_date="2025-11-15T14:30:00Z")
    )
    assert strategy == "HISTORICAL_ADDITION"

def test_simultaneous_historical_additions():
    """Multiple historical additions: all new get name suffixes"""
    existing = Institution(
        name="Maritime Museum Amsterdam",
        ghcid="NL-NH-AMS-M-MM",
        publication_date="2025-11-01T10:00:00Z"
    )
    
    new1 = Institution(
        name="Dutch Navy Museum",
        extraction_date="2025-12-01T09:00:00Z"
    )
    
    new2 = Institution(
        name="Amsterdam Naval Archive",
        extraction_date="2025-12-01T09:00:00Z"
    )
    
    resolve_batch_collisions([new1, new2], existing_ghcids={"NL-NH-AMS-M-MM": existing})
    
    # Existing unchanged
    assert existing.ghcid == "NL-NH-AMS-M-MM"
    
    # Both new get name suffixes
    assert new1.ghcid == "NL-NH-AMS-M-MM-dutch_navy_museum"
    assert new2.ghcid == "NL-NH-AMS-M-MM-amsterdam_naval_archive"

Integration Tests

def test_real_dutch_museums_no_collisions():
    """Verify no actual collisions in Dutch ISIL registry"""
    parser = ISILRegistryParser()
    records = parser.parse_file("data/ISIL-codes_2025-08-01.csv")
    
    ghcids = {}
    collisions = []
    
    for record in records:
        ghcid_base = generate_base_ghcid(record)
        if ghcid_base in ghcids:
            collisions.append((ghcid_base, ghcids[ghcid_base], record))
        ghcids[ghcid_base] = record
    
    # Report any collisions found
    print(f"Found {len(collisions)} collisions")
    for base, record1, record2 in collisions:
        print(f"  {base}: {record1.instelling} vs {record2.instelling}")

Future Enhancements

  1. Collision probability calculator

    • Analyze existing institutions
    • Predict collision likelihood
    • Suggest alternative abbreviations
  2. Name suffix validation

    • Verify name suffix matches institution's official name
    • Flag inconsistencies for review
    • Support multilingual name variants
  3. Collision warning system

    • Alert when generating GHCID similar to existing
    • Suggest checking for duplicates
    • Proactive collision prevention
  4. Alternative disambiguation (fallback for name suffix collisions)

    • Use founding year: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam_1895
    • Use street address: NL-NH-AMS-M-SM-museum_paulus_potterstraat
    • Use parent org: NL-NH-AMS-M-SM-museum_gemeente_amsterdam

References

  • GHCID Spec: docs/plan/global_glam/06-global-identifier-system.md
  • Name Suffix Generation: generate_name_suffix() function in this document
  • Implementation: src/glam_extractor/identifiers/ghcid.py
  • Schema: schemas/heritage_custodian.yaml (ghcid_current, ghcid_original slots)

Last Updated: 2025-11-30
Authors: GLAM Data Extraction Project Team