kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

14 KiB

Raw Blame History

Collision Edge Case Analysis - Manual Review

Date: 2025-11-07
Purpose: Investigate 3 ambiguous collision cases requiring human decision
Analyst: AI-assisted manual review

Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 279):

Institution: Het Nieuwe Instituut
ISIL Code: NL-RtHNI
Assigned: 2013-06-24
Remark: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)

Dutch Organizations CSV:

Institution: Nieuwe Instituut
ISIL Code: NL-RtHNI (SAME as ISIL registry)
City: Rotterdam

Analysis

This is a NAME CHANGE, not two different institutions:

Same ISIL code (NL-RtHNI) in both datasets
ISIL remark explicitly mentions name change ("naamswijzinging")
Timeline:
- 2013: ISIL assigned to "Het Nieuwe Instituut"
- 2024-08-06: Name changed (remark added to ISIL registry)
- Dutch orgs CSV shows new name "Nieuwe Instituut"

Decision: DUPLICATE - MERGE

Action:

These are the SAME institution at different points in time
Use current name: Nieuwe Instituut
Store historical name in alternative_names: Het Nieuwe Instituut
Add ChangeEvent:
- change_type: NAME_CHANGE
- event_date: "2024-08-06"
- event_description: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"
- source_documentation: ISIL registry remark

GHCID Impact:

Current GHCID: NL-XX-ROTT-M-NI
No Q-number needed (this is a duplicate, not a collision)
Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)

Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 339):

Institution: Historische Kring Wederen (with trailing space)
ISIL Code: NL-WdnHKW
City: Wierden
Assigned: 2015-02-27

Dutch Organizations CSV:

Entry 1: Historische Kring Wederen (Wierden) - ISIL: NL-WdnHKW, Platform: Mijn Stad Mijn Dorp
Entry 2: Historische Kring Wierden (Wierden) - No ISIL, Platform: ZCBS

Analysis

This appears to be a TYPO/DATA ERROR:

Geographic context: City is "Wierden" (not "Wederen")
ISIL registry has "Wederen" but city is Wierden → likely typo in institution name
Dutch orgs CSV has BOTH variants:
- "Wederen" with ISIL code (matches ISIL registry)
- "Wierden" without ISIL code (correct spelling based on city)
Different platforms (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately

Possible Scenarios

Scenario A: Same institution, typo in ISIL registry

ISIL registry has typo: "Wederen" → should be "Wierden"
Dutch orgs CSV correctly lists "Historische Kring Wierden"
Entry with ISIL is outdated/typo

Scenario B: Two different organizations

One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
One is "Historische Kring Wierden" (covering Wierden municipality)
Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this

Most Likely: Scenario A (typo)

Decision: PROBABLE DUPLICATE - MERGE WITH CAVEAT

Recommended Action:

Treat as same institution (typo correction)
Use correct name: Historische Kring Wierden
Store variant in alternative_names: Historische Kring Wederen (marked as "typo in ISIL registry")
Flag for verification: Recommend contacting institution to confirm

GHCID Impact:

Current base GHCID: NL-XX-WIER-M-HKW
No Q-number needed if merged
Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"

Alternative Action (if Scenario B is confirmed):

Keep as separate institutions
Add Q-numbers:
- NL-XX-WIER-M-HKW-Q<wikidata> for "Wederen"
- NL-XX-WIER-M-HKW-Q<wikidata> for "Wierden"

Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"

Evidence

ISIL Registry (ISIL-codes_2025-08-01.csv, line 226):

Institution: de Historische Kring Losser (with trailing space and lowercase "de")
ISIL Code: NL-LsHKL
City: Losser
Assigned: 2015-12-03

Dutch Organizations CSV:

Entry 1: de Historische Kring Losser (Losser) - ISIL: NL-LsHKL
Entry 2: Historische Kring Losser (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp

Analysis

This is a DEFINITE DUPLICATE - same institution with/without article "de":

Same city, same type (historical society)
Name differs only by article "de" (common in Dutch)
ISIL registry includes "de" in official name
Dutch orgs CSV has both variants:
- With "de" + ISIL (matches ISIL registry)
- Without "de" + platform (tracking different aspect)

Dutch language context:

"de" = "the" in English
Common for organizations to be referenced with/without article
Both are valid references to same organization
ISIL registry uses full official name with "de"

Decision: DUPLICATE - MERGE

Action:

These are the SAME institution
Use official ISIL name: de Historische Kring Losser
Store variant in alternative_names: Historische Kring Losser
Normalize matching to handle articles (de, het, 't) in Dutch names

GHCID Impact:

Current base GHCID: NL-XX-LOSS-M-HKL
No Q-number needed (duplicate, not collision)
Name normalization should strip articles for matching purposes

Summary of Decisions

Case	City	Status	Action	Q-Number Needed?
Het Nieuwe Instituut / Nieuwe Instituut	Rotterdam	NAME CHANGE	Merge, add ChangeEvent	❌ No
Historische Kring Wederen / Wierden	Wierden	PROBABLE TYPO	Merge with caveat, flag for verification	❌ No (if merged)
de Historische Kring Losser / Historische Kring Losser	Losser	DUPLICATE (article)	Merge, improve normalization	❌ No

Implementation Steps

1. Rotterdam - Name Change Handling

File to modify: Deduplicator logic

Add special case for name change detection:

If ISIL codes match → always treat as same institution
Check ISIL registry remarks for "naamswijzinging" / "name change"
Generate ChangeEvent with NAME_CHANGE type

Code location: src/glam_extractor/parsers/deduplicator.py

def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
    """
    Detect name changes by comparing ISIL codes + institution names.
    
    If ISIL codes match but names differ → likely a name change.
    """
    # Check if ISIL codes match
    isil1 = get_isil_code(record1)
    isil2 = get_isil_code(record2)
    
    if isil1 and isil2 and isil1 == isil2:
        if record1.name != record2.name:
            # Same ISIL, different name → name change
            return ChangeEvent(
                change_type=ChangeTypeEnum.NAME_CHANGE,
                event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
                source_documentation="Detected from ISIL code match with different names"
            )
    return None

2. Wierden - Typo Handling

File to modify: Deduplicator logic

Add fuzzy matching for similar names in same city:

Levenshtein distance < 2 for names in same city → flag as potential typo
Add provenance note: "Name variant detected, may be typo"

Code location: src/glam_extractor/parsers/deduplicator.py

from rapidfuzz import fuzz

def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
    """
    Detect potential typos in institution names.
    
    If names are very similar (Levenshtein distance < 2) and in same city,
    likely a typo or variant spelling.
    """
    distance = fuzz.distance(name1, name2)
    if distance <= 2:
        return True
    return False

3. Losser - Article Normalization

File to modify: Deduplicator match key generation

Improve Dutch article handling:

Strip leading articles: "de", "het", "'t", "De", "Het"
Before generating match key

Code location: src/glam_extractor/parsers/deduplicator.py (line 94-127)

DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']

def normalize_name_for_matching(name: str) -> str:
    """
    Normalize institution name for duplicate detection.
    
    - Lowercase
    - Remove punctuation
    - Strip Dutch articles
    - Remove extra whitespace
    """
    normalized = name.lower().strip()
    
    # Strip Dutch articles
    for article in DUTCH_ARTICLES:
        if normalized.startswith(article.lower()):
            normalized = normalized[len(article):]
            break
    
    # Remove punctuation
    normalized = re.sub(r'[^\w\s-]', '', normalized)
    
    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', normalized).strip()
    
    return normalized

Testing Strategy

Test 1: Rotterdam Name Change

def test_name_change_detection_rotterdam():
    """Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
    record1 = HeritageCustodian(
        name="Het Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )
    record2 = HeritageCustodian(
        name="Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge to one record
    assert len(result) == 1
    assert result[0].name == "Nieuwe Instituut"
    assert "Het Nieuwe Instituut" in result[0].alternative_names
    
    # Should have NAME_CHANGE event
    assert any(
        event.change_type == ChangeTypeEnum.NAME_CHANGE 
        for event in result[0].change_history
    )

Test 2: Wierden Typo Detection

def test_typo_detection_wierden():
    """Test that Wederen/Wierden are flagged as potential typo."""
    record1 = HeritageCustodian(
        name="Historische Kring Wederen",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Wierden",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge (treating as typo)
    assert len(result) == 1
    
    # Should flag in provenance
    assert "variant" in result[0].provenance.notes.lower() or \
           "typo" in result[0].provenance.notes.lower()

Test 3: Losser Article Normalization

def test_article_normalization_losser():
    """Test that Dutch articles are stripped for matching."""
    record1 = HeritageCustodian(
        name="de Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")]
    )
    
    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])
    
    # Should merge (articles ignored)
    assert len(result) == 1
    assert result[0].name == "de Historische Kring Losser"  # ISIL version wins
    assert "Historische Kring Losser" in result[0].alternative_names

Expected Impact on Collision Report

After implementing these fixes:

Current state: 15 collision groups, 30 institutions

Expected after fixes:

Rotterdam: -1 collision group (merged)
Wierden: -1 collision group (merged with caveat)
Losser: -1 collision group (merged)

New state: 12 collision groups, 24 institutions

Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)

Recommendations

Immediate Actions (High Priority)

✅ Implement article normalization (Losser case) - Low complexity, high impact
✅ Add ISIL-based name change detection (Rotterdam case) - Critical for data quality
⚠️ Add fuzzy matching for typos (Wierden case) - Requires manual verification

Medium-Term Actions

Contact institutions for verification:
- Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
- Confirm Rotterdam name change timeline
Improve ISIL registry integration:
- Parse "Opmerking" field for change events
- Extract name change dates from remarks
- Auto-generate ChangeEvent records

Long-Term Actions

Build article database for multilingual support:
- Dutch: de, het, 't
- English: the, a, an
- German: der, die, das, ein, eine
- French: le, la, les, un, une
Implement fuzzy matching pipeline:
- Stage 1: Exact match (current)
- Stage 2: Normalized match (articles stripped)
- Stage 3: Fuzzy match (typo detection)
- Stage 4: Manual review queue

Files to Modify

src/glam_extractor/parsers/deduplicator.py - Core logic
tests/parsers/test_deduplicator.py - Add 3 new tests
scripts/apply_collision_resolution_dutch_datasets.py - Re-run after fixes

Conclusion

All three edge cases are duplicates, not true collisions:

Rotterdam: Name change (same ISIL code)
Wierden: Probable typo (Wederen → Wierden)
Losser: Article variance (de Historische Kring vs Historische Kring)

Implementing the recommended fixes will:

✅ Reduce collisions from 15 to 12 groups (-20%)
✅ Improve name normalization for Dutch institutions
✅ Detect name changes automatically from ISIL data
✅ Flag potential typos for manual review

Next step: Implement fixes in deduplicator code and re-run collision resolution.

14 KiB Raw Blame History

Collision Edge Case Analysis - Manual Review

Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"

Evidence

Analysis

Decision: DUPLICATE - MERGE

Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"

Evidence

Analysis

Possible Scenarios

Decision: PROBABLE DUPLICATE - MERGE WITH CAVEAT

Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"

Evidence

Analysis

Decision: DUPLICATE - MERGE

Summary of Decisions

Implementation Steps

1. Rotterdam - Name Change Handling

2. Wierden - Typo Handling

3. Losser - Article Normalization

Testing Strategy

Test 1: Rotterdam Name Change

Test 2: Wierden Typo Detection

Test 3: Losser Article Normalization

Expected Impact on Collision Report

Recommendations

Immediate Actions (High Priority)

Medium-Term Actions

Long-Term Actions

Files to Modify

Conclusion

14 KiB

Raw Blame History