glam/docs/plan/global_glam/07-ghcid-collision-resolution.md
2025-11-19 23:25:22 +01:00

27 KiB

GHCID Collision Resolution Strategy

Version: 1.0
Date: 2025-11-05
Status: Implemented

Problem Statement

When two heritage institutions share:

  • Same geographic location (city)
  • Same institution type (e.g., both museums)
  • Same name abbreviation

...they will generate identical GHCID identifiers, causing collisions.

Example Collision Scenario

Institution 1: Stedelijk Museum Amsterdam
Institution 2: Science Museum Amsterdam (hypothetical)

Both would generate:

NL-NH-AMS-M-SM

Solution: Wikidata Q-Number Suffix

When a collision is detected, append the institution's Wikidata Q-number to the GHCID.

Format

Base GHCID:

{Country}-{Region}-{City}-{Type}-{Abbreviation}

GHCID with Collision Resolver:

{Country}-{Region}-{City}-{Type}-{Abbreviation}-Q{WikidataID}

Examples

Institution Base GHCID With Q-Number Notes
Rijksmuseum Amsterdam NL-NH-AMS-M-RM N/A No collision, no Q-number needed
Stedelijk Museum Amsterdam NL-NH-AMS-M-SM NL-NH-AMS-M-SM-Q924335 Collision detected
Science Museum Amsterdam NL-NH-AMS-M-SM NL-NH-AMS-M-SM-Q123456 Collision detected
Van Gogh Museum NL-NH-AMS-M-VGM N/A No collision

Temporal Dimension in Collision Resolution

The Critical Distinction: First Batch vs. Historical Addition

Collision resolution behavior differs based on when the collision is detected:

Scenario A: First Batch Collision (Contemporaneous Discovery)

When: Multiple institutions discovered simultaneously during initial GHCID generation (e.g., batch import from CSV).

Rule: ALL colliding institutions receive Q-number suffixes.

Why: No institution has temporal priority; all are being created at the same time. Fair treatment requires all to be disambiguated equally.

Example:

# Discovered on 2025-11-01 from Dutch ISIL registry batch import
Institution 1: Stedelijk Museum Amsterdam
  → ghcid: NL-NH-AMS-M-SM-Q924335

Institution 2: Science Museum Amsterdam  
  → ghcid: NL-NH-AMS-M-SM-Q7654321

# Both get Q-numbers because both discovered simultaneously

Scenario B: Historical Addition (Post-Publication Collision)

When: A newly added historical institution collides with an already published GHCID.

Rule: ONLY the new institution receives a Q-number suffix. The existing GHCID remains unchanged.

Why: PID Stability Principle - Published persistent identifiers may already be cited in research papers, integrated into third-party datasets, or embedded in API responses. Changing existing PIDs breaks citations and external references.

Example:

# Published 2025-11-01
Institution 1: Hermitage Museum Amsterdam
  → ghcid: NL-NH-AMS-M-HM  # Unchanged forever

# Historical institution added 2025-11-15
Institution 2: Amsterdam Historical Museum (historical records 1926-2001)
  → ghcid: NL-NH-AMS-M-HM-Q17339437  # New institution gets Q-number

# Existing GHCID preserved; only new addition disambiguated

Decision Matrix

Discovery Context Existing Institution New Institution Resolution Strategy
First batch (both new) None (being created) None (being created) Both get Q-numbers
Historical addition Already published Being added now Only new gets Q-number
Simultaneous historical additions Already published Multiple being added All new get Q-numbers; existing unchanged

Timeline Example: Demonstrating the Temporal Principle

2025-11-01 (First Batch Import)
├─ Stedelijk Museum Amsterdam added
│  └─ ghcid: NL-NH-AMS-M-SM
├─ Science Museum Amsterdam discovered (collision!)
│  └─ BOTH institutions updated:
│     ├─ Stedelijk: NL-NH-AMS-M-SM-Q924335
│     └─ Science:   NL-NH-AMS-M-SM-Q7654321

2025-11-15 (Historical Research Addition)
├─ Hermitage Museum Amsterdam already exists
│  └─ ghcid: NL-NH-AMS-M-HM (published, immutable)
├─ Amsterdam Historical Museum added (historical institution, 1926-2001)
│  └─ Collision detected!
│     ├─ Hermitage GHCID: UNCHANGED (NL-NH-AMS-M-HM)
│     └─ Historical Museum: NL-NH-AMS-M-HM-Q17339437 (gets Q-number)

2025-12-01 (Another Historical Addition)
├─ Maritime Museum Amsterdam already exists
│  └─ ghcid: NL-NH-AMS-M-MM (published, immutable)
├─ Two historical naval museums discovered in archive:
│  ├─ Dutch Navy Museum (1906-1955) → collides!
│  └─ Amsterdam Naval Archive (1820-1901) → collides!
│     └─ BOTH new institutions get Q-numbers:
│        ├─ Dutch Navy: NL-NH-AMS-M-MM-Q23456789
│        └─ Naval Archive: NL-NH-AMS-M-MM-Q98765432
│        ├─ Existing Maritime Museum: UNCHANGED (NL-NH-AMS-M-MM)

Implementation Guidance

Collision Detection with Temporal Context

def resolve_collision(new_institution, existing_ghcids_registry):
    """
    Resolve GHCID collision based on temporal context.
    
    Args:
        new_institution: Institution being added
        existing_ghcids_registry: Dict mapping GHCID → {publication_date, institution_data}
    
    Returns:
        Resolution strategy and updated GHCIDs
    """
    new_ghcid_base = generate_base_ghcid(new_institution)
    
    # Check if base GHCID exists in published registry
    if new_ghcid_base in existing_ghcids_registry:
        existing_entry = existing_ghcids_registry[new_ghcid_base]
        
        # Historical addition case
        if existing_entry['publication_date'] is not None:
            print(f"Collision with published GHCID {new_ghcid_base}")
            print(f"  → Existing: {existing_entry['name']} (published {existing_entry['publication_date']})")
            print(f"  → New: {new_institution.name} (being added now)")
            print(f"  → Strategy: Only new institution gets Q-number")
            
            # Existing GHCID remains unchanged
            existing_ghcid = new_ghcid_base
            
            # New institution gets Q-number
            new_ghcid = f"{new_ghcid_base}-Q{new_institution.wikidata_qid}"
            
            return {
                'strategy': 'HISTORICAL_ADDITION',
                'existing_ghcid': existing_ghcid,  # Unchanged
                'new_ghcid': new_ghcid,            # With Q-number
                'reason': f"Historical addition collision: preserve existing published PID"
            }
    
    # First batch collision case (both being created)
    # This should be handled during batch processing
    return {
        'strategy': 'FIRST_BATCH',
        'reason': 'All colliding institutions in batch receive Q-numbers'
    }


def resolve_batch_collisions(new_institutions_batch):
    """
    Resolve collisions within a batch of new institutions.
    
    When multiple institutions in the same batch collide, ALL get Q-numbers.
    """
    ghcid_map = {}  # base_ghcid → list of institutions
    
    # Group by base GHCID
    for inst in new_institutions_batch:
        base_ghcid = generate_base_ghcid(inst)
        if base_ghcid not in ghcid_map:
            ghcid_map[base_ghcid] = []
        ghcid_map[base_ghcid].append(inst)
    
    # Resolve collisions
    for base_ghcid, institutions in ghcid_map.items():
        if len(institutions) > 1:
            print(f"First batch collision detected: {base_ghcid}")
            print(f"  → {len(institutions)} institutions collide")
            print(f"  → Strategy: All {len(institutions)} get Q-numbers")
            
            # All institutions in collision get Q-numbers
            for inst in institutions:
                inst.ghcid = f"{base_ghcid}-Q{inst.wikidata_qid}"
                inst.provenance.notes += (
                    f" | First batch collision: {len(institutions)} institutions "
                    f"share base GHCID {base_ghcid}"
                )

Edge Cases

1. Multiple Historical Additions Simultaneously

Scenario: Two historical institutions discovered at the same time, both colliding with existing GHCID.

Resolution: Both new institutions get Q-numbers; existing unchanged.

# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-MM

# Both added 2025-12-01
new_inst_1.ghcid = "NL-NH-AMS-M-MM-Q23456789"
new_inst_2.ghcid = "NL-NH-AMS-M-MM-Q98765432"

2. Historical Institution Without Wikidata ID

Scenario: New historical institution collides but has no Wikidata entry.

Fallback Strategy:

  1. VIAF ID: NL-NH-AMS-M-HM-V12345678
  2. ISIL code: NL-NH-AMS-M-HM-I-AsdHM
  3. Sequential: NL-NH-AMS-M-HM-002 (increment from existing)
def get_collision_suffix(institution):
    """Get collision resolution suffix in priority order."""
    if institution.wikidata_qid:
        return f"Q{institution.wikidata_qid}"
    elif institution.viaf_id:
        return f"V{institution.viaf_id}"
    elif institution.isil_code:
        # Extract suffix from ISIL (e.g., NL-AsdHM → I-AsdHM)
        isil_suffix = institution.isil_code.split('-', 1)[1]
        return f"I-{isil_suffix}"
    else:
        # Last resort: sequential numbering
        return f"{get_next_sequential_number(base_ghcid):03d}"

3. Retroactive Discovery of First Batch Collision

Scenario: Two institutions were created in first batch, but collision wasn't detected until later.

Resolution: Treat as first batch collision (both get Q-numbers), even though discovered late.

Justification: Intent was to create both simultaneously; detection timing doesn't change temporal relationship.

# If both have same creation_date in provenance metadata
if inst1.provenance.extraction_date == inst2.provenance.extraction_date:
    strategy = 'FIRST_BATCH'  # Both get Q-numbers
else:
    strategy = 'HISTORICAL_ADDITION'  # Only later one gets Q-number

Provenance Tracking for Collision Events

Record collision resolution in provenance metadata:

# First batch collision
provenance:
  extraction_date: "2025-11-01T10:00:00Z"
  collision_resolution:
    strategy: FIRST_BATCH
    collision_group: [Q924335, Q7654321]
    resolved_date: "2025-11-01T10:00:00Z"
    reason: "Contemporaneous discovery: both institutions added in ISIL registry batch import"

# Historical addition collision
provenance:
  extraction_date: "2025-11-15T14:30:00Z"
  collision_resolution:
    strategy: HISTORICAL_ADDITION
    collides_with: NL-NH-AMS-M-HM
    existing_publication_date: "2025-11-01T10:00:00Z"
    resolved_date: "2025-11-15T14:30:00Z"
    reason: "Historical addition: existing GHCID preserved per PID stability principle"

Implementation Rules

1. Collision Detection

A collision occurs when:

# Pseudocode with temporal awareness
existing_ghcids = database.get_all_ghcids()  # Include publication dates
new_ghcid = generate_ghcid(institution)

if new_ghcid in existing_ghcids:
    existing_record = existing_ghcids[new_ghcid]
    
    # Check temporal context
    if existing_record['publication_date'] is not None:
        # Historical addition: only new institution gets Q-number
        new_ghcid_resolved = f"{new_ghcid}-Q{institution.wikidata_qid}"
        existing_ghcid_resolved = new_ghcid  # Unchanged
    else:
        # First batch: both being created simultaneously
        # Fetch Wikidata Q-numbers for BOTH institutions
        # Regenerate both GHCIDs with Q-number suffixes

2. When to Add Q-Number

DO add Q-number suffix:

  • First batch: When collision detected during simultaneous creation → ALL colliding institutions
  • Historical addition: When new institution collides with published GHCID → ONLY new institution
  • When generating GHCID for institution with known collision
  • When institution is added to collision registry

DO NOT add Q-number suffix:

  • When no collision exists (most institutions)
  • "Just in case" for future collisions
  • For institutions without Wikidata entries
  • NEVER for existing published GHCIDs when historical collision occurs (preserve PID stability)

3. Q-Number Normalization

Input formats accepted:

  • Q924335 (preferred)
  • 924335 (digits only)
  • q924335 (lowercase, will be normalized)

Normalized output:

  • Q-prefix is stripped during processing
  • Stored as: "924335"
  • Displayed as: "Q924335"

Code example:

# In GHCIDComponents.__post_init__()
if self.wikidata_qid:
    # Strip Q prefix, keep only digits
    self.wikidata_qid = self.wikidata_qid.upper().replace("Q", "")

# In GHCIDComponents.to_string()
if self.wikidata_qid:
    return f"{base}-Q{self.wikidata_qid}"  # Re-add Q prefix for display

4. Persistent Numeric ID

The SHA256 hash is computed from the full GHCID string including Q-number:

# Without Q-number
ghcid = "NL-NH-AMS-M-RM"
numeric_id = SHA256(ghcid)[:8]  int  # e.g., 12345678901234567890

# With Q-number (different hash!)
ghcid = "NL-NH-AMS-M-SM-Q924335"
numeric_id = SHA256(ghcid)[:8]  int  # e.g., 98765432109876543210

Important: Even if two institutions have identical base GHCIDs, their numeric IDs will be different because the Q-number is included in the hash.

Collision Registry

Maintain a collision registry to track known conflicts with temporal metadata:

{
  "collisions": [
    {
      "base_ghcid": "NL-NH-AMS-M-SM",
      "collision_type": "FIRST_BATCH",
      "institutions": [
        {
          "ghcid": "NL-NH-AMS-M-SM-Q924335",
          "name": "Stedelijk Museum Amsterdam",
          "wikidata_qid": "Q924335",
          "publication_date": "2025-11-01T10:00:00Z"
        },
        {
          "ghcid": "NL-NH-AMS-M-SM-Q123456",
          "name": "Science Museum Amsterdam",
          "wikidata_qid": "Q123456",
          "publication_date": "2025-11-01T10:00:00Z"
        }
      ],
      "detected_date": "2025-11-01T10:00:00Z",
      "resolution_strategy": "Both institutions receive Q-numbers (contemporaneous discovery)"
    },
    {
      "base_ghcid": "NL-NH-AMS-M-HM",
      "collision_type": "HISTORICAL_ADDITION",
      "institutions": [
        {
          "ghcid": "NL-NH-AMS-M-HM",
          "name": "Hermitage Museum Amsterdam",
          "wikidata_qid": "Q1542668",
          "publication_date": "2025-11-01T10:00:00Z",
          "note": "Original GHCID preserved (published identifier)"
        },
        {
          "ghcid": "NL-NH-AMS-M-HM-Q17339437",
          "name": "Amsterdam Historical Museum",
          "wikidata_qid": "Q17339437",
          "publication_date": "2025-11-15T14:30:00Z",
          "note": "Historical addition: only new institution receives Q-number"
        }
      ],
      "detected_date": "2025-11-15T14:30:00Z",
      "resolution_strategy": "Preserve existing published GHCID; only new institution gets Q-number"
    }
  ]
}

Registry Usage

  1. Before generating new GHCID:

    • Check if base GHCID exists in collision registry
    • Check publication date of existing record
    • If published: new institution gets Q-number (historical addition)
    • If unpublished: both get Q-numbers (first batch)
  2. When collision detected:

    • Record temporal context (first batch vs. historical)
    • Apply appropriate resolution strategy
    • Add entries to collision registry with publication dates
    • Update ghcid_history with change reason including temporal context
  3. Periodic audit:

    • Scan all GHCIDs for duplicates
    • Check publication dates to determine resolution strategy
    • Resolve retroactively if found
    • Update collision registry with temporal metadata

What If Institution Lacks Wikidata Entry?

Priority Order for Collision Resolution

  1. Wikidata Q-number (preferred)
  2. VIAF ID (backup)
  3. ISIL code suffix (Dutch institutions)
  4. Sequential number (last resort)

Examples

# Preferred: Wikidata
NL-NH-AMS-M-SM-Q924335

# Backup: VIAF
NL-NH-AMS-M-SM-V12345678

# Backup: ISIL
NL-NH-AMS-M-SM-I-AsdSM

# Last resort: Sequential
NL-NH-AMS-M-SM-001
NL-NH-AMS-M-SM-002

Implementation note: Start with Wikidata-only. Expand to other identifiers if needed based on real-world collision frequency.

Validation Rules

GHCID Pattern Regex

Without Q-number:

^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}$

With Q-number (updated):

^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$

Breakdown:

  • [A-Z]{2} - Country code (2 letters)
  • [A-Z0-9]{1,3} - Region code (1-3 chars)
  • [A-Z]{3} - City LOCODE (3 letters)
  • [A-Z] - Type code (1 letter)
  • [A-Z0-9]{1,10} - Abbreviation (1-10 chars)
  • (-Q[0-9]+)? - Optional Wikidata Q-number suffix

Validation Tests

# Valid GHCIDs
assert validate("NL-NH-AMS-M-RM")           # Without Q-number
assert validate("NL-NH-AMS-M-SM-Q924335")   # With Q-number
assert validate("US-NY-NYC-M-MOMA")         # International
assert validate("NL-NH-AMS-M-SM-Q1")        # Short Q-number

# Invalid GHCIDs
assert not validate("NL-NH-AMS-M-SM-924335")  # Missing Q prefix
assert not validate("NL-NH-AMS-M-SM-QAB123")  # Q-number must be numeric
assert not validate("NL-NH-AMS-M-SM-Q")       # Q-number empty

Migration Strategy

For Existing Institutions Without Q-Number

When collision is detected on existing institutions without Q-numbers, temporal context determines strategy:

First Batch Collision (Retroactive Detection)

If both institutions were created simultaneously but collision not detected until later:

  1. Verify creation dates (check provenance.extraction_date)
  2. If same date: Treat as first batch → both get Q-numbers
  3. Fetch Wikidata Q-numbers for both institutions
  4. Update both GHCIDs with Q-number suffix
  5. Update both ghcid_numeric (hash changes!)
  6. Add history entries for both:
    # Institution 1
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-SM",
        ghcid_new="NL-NH-AMS-M-SM-Q924335",
        valid_from="2025-11-05T12:00:00Z",
        valid_to=None,
        reason="First batch collision (retroactive): Added Wikidata Q-number to disambiguate from Science Museum Amsterdam (both created 2025-11-01)"
    )
    
    # Institution 2
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-SM",
        ghcid_new="NL-NH-AMS-M-SM-Q123456",
        valid_from="2025-11-05T12:00:00Z",
        valid_to=None,
        reason="First batch collision (retroactive): Added Wikidata Q-number to disambiguate from Stedelijk Museum Amsterdam (both created 2025-11-01)"
    )
    

Historical Addition Collision

If new institution collides with already published existing GHCID:

  1. Verify publication date of existing record
  2. Existing GHCID: NO CHANGE (preserve published PID)
  3. New institution: Gets Q-number suffix
  4. Update only new ghcid_numeric
  5. Add history entry for new institution only:
    GHCIDHistoryEntry(
        ghcid_old="NL-NH-AMS-M-HM",  # What it would have been
        ghcid_new="NL-NH-AMS-M-HM-Q17339437",  # What it becomes
        valid_from="2025-11-15T14:30:00Z",
        valid_to=None,
        reason="Historical addition collision: Base GHCID NL-NH-AMS-M-HM already published for Hermitage Museum Amsterdam (2025-11-01). Added Q-number to preserve existing PID stability."
    )
    
  6. No history entry for existing institution (GHCID unchanged)

Decision Flow

def determine_collision_strategy(existing_inst, new_inst):
    """Determine whether to treat as first batch or historical addition."""
    
    # Check if existing record has publication date
    if existing_inst.provenance.publication_date is not None:
        # Published identifier → Historical addition
        return "HISTORICAL_ADDITION"
    
    # Check if both created on same date
    if (existing_inst.provenance.extraction_date == 
        new_inst.provenance.extraction_date):
        # Same creation date → First batch (retroactive)
        return "FIRST_BATCH_RETROACTIVE"
    
    # Different dates, no publication → First batch (simultaneous processing)
    if abs((existing_inst.provenance.extraction_date - 
            new_inst.provenance.extraction_date).days) < 1:
        return "FIRST_BATCH"
    
    # Default: Historical addition (new is significantly later)
    return "HISTORICAL_ADDITION"

Backward Compatibility

Q: What happens to old numeric IDs when Q-number is added?

A: The numeric ID changes because it's a hash of the full GHCID string. This is intentional:

  • Old numeric ID: SHA256("NL-NH-AMS-M-SM")[:8]
  • New numeric ID: SHA256("NL-NH-AMS-M-SM-Q924335")[:8]

Migration plan depends on temporal context:

First Batch Collision (Both Updated)

  1. Keep ghcid_original unchanged for both (preserves old GHCID)
  2. Update ghcid_current with Q-number for both
  3. Generate new ghcid_numeric from updated GHCID for both
  4. Add comprehensive history entries explaining change
  5. Maintain mapping table: old_numeric_id → new_numeric_id

Historical Addition (Only New Updated)

  1. Existing institution: NO CHANGES
    • ghcid_original: unchanged
    • ghcid_current: unchanged
    • ghcid_numeric: unchanged
    • No history entry added
  2. New institution: Gets Q-number from start
    • ghcid_original: NL-NH-AMS-M-HM-Q17339437 (includes Q-number)
    • ghcid_current: NL-NH-AMS-M-HM-Q17339437
    • ghcid_numeric: Hash of full GHCID with Q-number
    • History entry documents collision with existing published PID

Testing Strategy

Unit Tests

def test_ghcid_without_collision():
    """Most institutions should NOT have Q-number"""
    components = GHCIDComponents(
        country_code="NL",
        region_code="NH",
        city_locode="AMS",
        institution_type="M",
        abbreviation="RM"
    )
    assert components.to_string() == "NL-NH-AMS-M-RM"
    assert components.wikidata_qid is None

def test_first_batch_collision():
    """First batch: BOTH institutions get Q-numbers"""
    inst1 = create_institution(
        name="Stedelijk Museum Amsterdam",
        wikidata_qid="Q924335",
        extraction_date="2025-11-01T10:00:00Z"
    )
    inst2 = create_institution(
        name="Science Museum Amsterdam",
        wikidata_qid="Q123456",
        extraction_date="2025-11-01T10:00:00Z"
    )
    
    # Process batch
    resolve_batch_collisions([inst1, inst2])
    
    # Both should have Q-numbers
    assert inst1.ghcid == "NL-NH-AMS-M-SM-Q924335"
    assert inst2.ghcid == "NL-NH-AMS-M-SM-Q123456"

def test_historical_addition_collision():
    """Historical addition: Only NEW institution gets Q-number"""
    # Existing published institution
    existing = create_institution(
        name="Hermitage Museum Amsterdam",
        wikidata_qid="Q1542668",
        extraction_date="2025-11-01T10:00:00Z",
        publication_date="2025-11-01T10:00:00Z"
    )
    existing.ghcid = "NL-NH-AMS-M-HM"
    
    # New historical institution
    new = create_institution(
        name="Amsterdam Historical Museum",
        wikidata_qid="Q17339437",
        extraction_date="2025-11-15T14:30:00Z"
    )
    
    # Resolve collision
    resolve_collision(new, existing_ghcids={existing.ghcid: existing})
    
    # Existing GHCID unchanged
    assert existing.ghcid == "NL-NH-AMS-M-HM"
    
    # New institution gets Q-number
    assert new.ghcid == "NL-NH-AMS-M-HM-Q17339437"

def test_qid_normalization():
    """Q-prefix should be stripped and re-added"""
    components = GHCIDComponents(
        country_code="NL",
        region_code="NH",
        city_locode="AMS",
        institution_type="M",
        abbreviation="SM",
        wikidata_qid="Q924335"
    )
    assert components.wikidata_qid == "924335"  # Stored without Q
    assert "Q924335" in components.to_string()  # Displayed with Q

def test_collision_strategy_determination():
    """Test temporal context determines resolution strategy"""
    # Same date → First batch
    strategy = determine_collision_strategy(
        existing=Institution(extraction_date="2025-11-01T10:00:00Z"),
        new=Institution(extraction_date="2025-11-01T10:00:00Z")
    )
    assert strategy == "FIRST_BATCH"
    
    # Existing published → Historical addition
    strategy = determine_collision_strategy(
        existing=Institution(
            extraction_date="2025-11-01T10:00:00Z",
            publication_date="2025-11-01T10:00:00Z"
        ),
        new=Institution(extraction_date="2025-11-15T14:30:00Z")
    )
    assert strategy == "HISTORICAL_ADDITION"

def test_simultaneous_historical_additions():
    """Multiple historical additions: all new get Q-numbers"""
    existing = Institution(
        name="Maritime Museum Amsterdam",
        ghcid="NL-NH-AMS-M-MM",
        publication_date="2025-11-01T10:00:00Z"
    )
    
    new1 = Institution(
        name="Dutch Navy Museum",
        wikidata_qid="Q23456789",
        extraction_date="2025-12-01T09:00:00Z"
    )
    
    new2 = Institution(
        name="Amsterdam Naval Archive",
        wikidata_qid="Q98765432",
        extraction_date="2025-12-01T09:00:00Z"
    )
    
    resolve_batch_collisions([new1, new2], existing_ghcids={"NL-NH-AMS-M-MM": existing})
    
    # Existing unchanged
    assert existing.ghcid == "NL-NH-AMS-M-MM"
    
    # Both new get Q-numbers
    assert new1.ghcid == "NL-NH-AMS-M-MM-Q23456789"
    assert new2.ghcid == "NL-NH-AMS-M-MM-Q98765432"

Integration Tests

def test_real_dutch_museums_no_collisions():
    """Verify no actual collisions in Dutch ISIL registry"""
    parser = ISILRegistryParser()
    records = parser.parse_file("data/ISIL-codes_2025-08-01.csv")
    
    ghcids = {}
    collisions = []
    
    for record in records:
        ghcid_base = generate_base_ghcid(record)
        if ghcid_base in ghcids:
            collisions.append((ghcid_base, ghcids[ghcid_base], record))
        ghcids[ghcid_base] = record
    
    # Report any collisions found
    print(f"Found {len(collisions)} collisions")
    for base, record1, record2 in collisions:
        print(f"  {base}: {record1.instelling} vs {record2.instelling}")

Future Enhancements

  1. Collision probability calculator

    • Analyze existing institutions
    • Predict collision likelihood
    • Suggest alternative abbreviations
  2. Auto-fetch Wikidata Q-numbers

    • Use Wikidata API/SPARQL
    • Match by name + location
    • Confidence scoring
  3. Collision warning system

    • Alert when generating GHCID similar to existing
    • Suggest checking for duplicates
    • Proactive collision prevention
  4. Alternative disambiguation

    • Use founding year: NL-NH-AMS-M-SM-Y1895
    • Use street address: NL-NH-AMS-M-SM-PAULUS
    • Use parent org: NL-NH-AMS-M-SM-AMS (City of Amsterdam)

References

  • GHCID Spec: docs/plan/global_glam/06-global-identifier-system.md
  • Wikidata: https://www.wikidata.org
  • Implementation: src/glam_extractor/identifiers/ghcid.py
  • Schema: schemas/heritage_custodian.yaml (ghcid_current, ghcid_original slots)

Last Updated: 2025-11-05
Authors: GLAM Data Extraction Project Team