glam/data/collision_edge_case_analysis.md
2025-11-19 23:25:22 +01:00

14 KiB

Collision Edge Case Analysis - Manual Review

Date: 2025-11-07
Purpose: Investigate 3 ambiguous collision cases requiring human decision
Analyst: AI-assisted manual review


Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 279):

  • Institution: Het Nieuwe Instituut
  • ISIL Code: NL-RtHNI
  • Assigned: 2013-06-24
  • Remark: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)

Dutch Organizations CSV:

  • Institution: Nieuwe Instituut
  • ISIL Code: NL-RtHNI (SAME as ISIL registry)
  • City: Rotterdam

Analysis

This is a NAME CHANGE, not two different institutions:

  1. Same ISIL code (NL-RtHNI) in both datasets
  2. ISIL remark explicitly mentions name change ("naamswijzinging")
  3. Timeline:
    • 2013: ISIL assigned to "Het Nieuwe Instituut"
    • 2024-08-06: Name changed (remark added to ISIL registry)
    • Dutch orgs CSV shows new name "Nieuwe Instituut"

Decision: DUPLICATE - MERGE

Action:

  • These are the SAME institution at different points in time
  • Use current name: Nieuwe Instituut
  • Store historical name in alternative_names: Het Nieuwe Instituut
  • Add ChangeEvent:
    • change_type: NAME_CHANGE
    • event_date: "2024-08-06"
    • event_description: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"
    • source_documentation: ISIL registry remark

GHCID Impact:

  • Current GHCID: NL-XX-ROTT-M-NI
  • No Q-number needed (this is a duplicate, not a collision)
  • Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)

Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 339):

  • Institution: Historische Kring Wederen (with trailing space)
  • ISIL Code: NL-WdnHKW
  • City: Wierden
  • Assigned: 2015-02-27

Dutch Organizations CSV:

  • Entry 1: Historische Kring Wederen (Wierden) - ISIL: NL-WdnHKW, Platform: Mijn Stad Mijn Dorp
  • Entry 2: Historische Kring Wierden (Wierden) - No ISIL, Platform: ZCBS

Analysis

This appears to be a TYPO/DATA ERROR:

  1. Geographic context: City is "Wierden" (not "Wederen")
  2. ISIL registry has "Wederen" but city is Wierden → likely typo in institution name
  3. Dutch orgs CSV has BOTH variants:
    • "Wederen" with ISIL code (matches ISIL registry)
    • "Wierden" without ISIL code (correct spelling based on city)
  4. Different platforms (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately

Possible Scenarios

Scenario A: Same institution, typo in ISIL registry

  • ISIL registry has typo: "Wederen" → should be "Wierden"
  • Dutch orgs CSV correctly lists "Historische Kring Wierden"
  • Entry with ISIL is outdated/typo

Scenario B: Two different organizations

  • One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
  • One is "Historische Kring Wierden" (covering Wierden municipality)
  • Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this

Most Likely: Scenario A (typo)

Decision: PROBABLE DUPLICATE - MERGE WITH CAVEAT

Recommended Action:

  • Treat as same institution (typo correction)
  • Use correct name: Historische Kring Wierden
  • Store variant in alternative_names: Historische Kring Wederen (marked as "typo in ISIL registry")
  • Flag for verification: Recommend contacting institution to confirm

GHCID Impact:

  • Current base GHCID: NL-XX-WIER-M-HKW
  • No Q-number needed if merged
  • Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"

Alternative Action (if Scenario B is confirmed):

  • Keep as separate institutions
  • Add Q-numbers:
    • NL-XX-WIER-M-HKW-Q<wikidata> for "Wederen"
    • NL-XX-WIER-M-HKW-Q<wikidata> for "Wierden"

Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 226):

  • Institution: de Historische Kring Losser (with trailing space and lowercase "de")
  • ISIL Code: NL-LsHKL
  • City: Losser
  • Assigned: 2015-12-03

Dutch Organizations CSV:

  • Entry 1: de Historische Kring Losser (Losser) - ISIL: NL-LsHKL
  • Entry 2: Historische Kring Losser (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp

Analysis

This is a DEFINITE DUPLICATE - same institution with/without article "de":

  1. Same city, same type (historical society)
  2. Name differs only by article "de" (common in Dutch)
  3. ISIL registry includes "de" in official name
  4. Dutch orgs CSV has both variants:
    • With "de" + ISIL (matches ISIL registry)
    • Without "de" + platform (tracking different aspect)

Dutch language context:

  • "de" = "the" in English
  • Common for organizations to be referenced with/without article
  • Both are valid references to same organization
  • ISIL registry uses full official name with "de"

Decision: DUPLICATE - MERGE

Action:

  • These are the SAME institution
  • Use official ISIL name: de Historische Kring Losser
  • Store variant in alternative_names: Historische Kring Losser
  • Normalize matching to handle articles (de, het, 't) in Dutch names

GHCID Impact:

  • Current base GHCID: NL-XX-LOSS-M-HKL
  • No Q-number needed (duplicate, not collision)
  • Name normalization should strip articles for matching purposes

Summary of Decisions

Case City Status Action Q-Number Needed?
Het Nieuwe Instituut / Nieuwe Instituut Rotterdam NAME CHANGE Merge, add ChangeEvent No
Historische Kring Wederen / Wierden Wierden PROBABLE TYPO Merge with caveat, flag for verification No (if merged)
de Historische Kring Losser / Historische Kring Losser Losser DUPLICATE (article) Merge, improve normalization No

Implementation Steps

1. Rotterdam - Name Change Handling

File to modify: Deduplicator logic

Add special case for name change detection:

  • If ISIL codes match → always treat as same institution
  • Check ISIL registry remarks for "naamswijzinging" / "name change"
  • Generate ChangeEvent with NAME_CHANGE type

Code location: src/glam_extractor/parsers/deduplicator.py

def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
    """
    Detect name changes by comparing ISIL codes + institution names.
    
    If ISIL codes match but names differ → likely a name change.
    """
    # Check if ISIL codes match
    isil1 = get_isil_code(record1)
    isil2 = get_isil_code(record2)
    
    if isil1 and isil2 and isil1 == isil2:
        if record1.name != record2.name:
            # Same ISIL, different name → name change
            return ChangeEvent(
                change_type=ChangeTypeEnum.NAME_CHANGE,
                event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
                source_documentation="Detected from ISIL code match with different names"
            )
    return None

2. Wierden - Typo Handling

File to modify: Deduplicator logic

Add fuzzy matching for similar names in same city:

  • Levenshtein distance < 2 for names in same city → flag as potential typo
  • Add provenance note: "Name variant detected, may be typo"

Code location: src/glam_extractor/parsers/deduplicator.py

from rapidfuzz import fuzz

def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
    """
    Detect potential typos in institution names.
    
    If names are very similar (Levenshtein distance < 2) and in same city,
    likely a typo or variant spelling.
    """
    distance = fuzz.distance(name1, name2)
    if distance <= 2:
        return True
    return False

3. Losser - Article Normalization

File to modify: Deduplicator match key generation

Improve Dutch article handling:

  • Strip leading articles: "de", "het", "'t", "De", "Het"
  • Before generating match key

Code location: src/glam_extractor/parsers/deduplicator.py (line 94-127)

DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']

def normalize_name_for_matching(name: str) -> str:
    """
    Normalize institution name for duplicate detection.
    
    - Lowercase
    - Remove punctuation
    - Strip Dutch articles
    - Remove extra whitespace
    """
    normalized = name.lower().strip()
    
    # Strip Dutch articles
    for article in DUTCH_ARTICLES:
        if normalized.startswith(article.lower()):
            normalized = normalized[len(article):]
            break
    
    # Remove punctuation
    normalized = re.sub(r'[^\w\s-]', '', normalized)
    
    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', normalized).strip()
    
    return normalized

Testing Strategy

Test 1: Rotterdam Name Change

def test_name_change_detection_rotterdam():
    """Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
    record1 = HeritageCustodian(
        name="Het Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )
    record2 = HeritageCustodian(
        name="Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge to one record
    assert len(result) == 1
    assert result[0].name == "Nieuwe Instituut"
    assert "Het Nieuwe Instituut" in result[0].alternative_names
    
    # Should have NAME_CHANGE event
    assert any(
        event.change_type == ChangeTypeEnum.NAME_CHANGE 
        for event in result[0].change_history
    )

Test 2: Wierden Typo Detection

def test_typo_detection_wierden():
    """Test that Wederen/Wierden are flagged as potential typo."""
    record1 = HeritageCustodian(
        name="Historische Kring Wederen",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Wierden",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge (treating as typo)
    assert len(result) == 1
    
    # Should flag in provenance
    assert "variant" in result[0].provenance.notes.lower() or \
           "typo" in result[0].provenance.notes.lower()

Test 3: Losser Article Normalization

def test_article_normalization_losser():
    """Test that Dutch articles are stripped for matching."""
    record1 = HeritageCustodian(
        name="de Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge (articles ignored)
    assert len(result) == 1
    assert result[0].name == "de Historische Kring Losser"  # ISIL version wins
    assert "Historische Kring Losser" in result[0].alternative_names

Expected Impact on Collision Report

After implementing these fixes:

Current state: 15 collision groups, 30 institutions

Expected after fixes:

  • Rotterdam: -1 collision group (merged)
  • Wierden: -1 collision group (merged with caveat)
  • Losser: -1 collision group (merged)

New state: 12 collision groups, 24 institutions

Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)


Recommendations

Immediate Actions (High Priority)

  1. Implement article normalization (Losser case) - Low complexity, high impact
  2. Add ISIL-based name change detection (Rotterdam case) - Critical for data quality
  3. ⚠️ Add fuzzy matching for typos (Wierden case) - Requires manual verification

Medium-Term Actions

  1. Contact institutions for verification:

    • Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
    • Confirm Rotterdam name change timeline
  2. Improve ISIL registry integration:

    • Parse "Opmerking" field for change events
    • Extract name change dates from remarks
    • Auto-generate ChangeEvent records

Long-Term Actions

  1. Build article database for multilingual support:

    • Dutch: de, het, 't
    • English: the, a, an
    • German: der, die, das, ein, eine
    • French: le, la, les, un, une
  2. Implement fuzzy matching pipeline:

    • Stage 1: Exact match (current)
    • Stage 2: Normalized match (articles stripped)
    • Stage 3: Fuzzy match (typo detection)
    • Stage 4: Manual review queue

Files to Modify

  1. src/glam_extractor/parsers/deduplicator.py - Core logic
  2. tests/parsers/test_deduplicator.py - Add 3 new tests
  3. scripts/apply_collision_resolution_dutch_datasets.py - Re-run after fixes

Conclusion

All three edge cases are duplicates, not true collisions:

  1. Rotterdam: Name change (same ISIL code)
  2. Wierden: Probable typo (Wederen → Wierden)
  3. Losser: Article variance (de Historische Kring vs Historische Kring)

Implementing the recommended fixes will:

  • Reduce collisions from 15 to 12 groups (-20%)
  • Improve name normalization for Dutch institutions
  • Detect name changes automatically from ISIL data
  • Flag potential typos for manual review

Next step: Implement fixes in deduplicator code and re-run collision resolution.