# Collision Edge Case Analysis - Manual Review **Date**: 2025-11-07 **Purpose**: Investigate 3 ambiguous collision cases requiring human decision **Analyst**: AI-assisted manual review --- ## Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut" ### Evidence **ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 279): - Institution: `Het Nieuwe Instituut` - ISIL Code: `NL-RtHNI` - Assigned: 2013-06-24 - **Remark**: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change) **Dutch Organizations CSV**: - Institution: `Nieuwe Instituut` - ISIL Code: `NL-RtHNI` (SAME as ISIL registry) - City: Rotterdam ### Analysis This is a **NAME CHANGE**, not two different institutions: 1. **Same ISIL code** (`NL-RtHNI`) in both datasets 2. **ISIL remark explicitly mentions name change** ("naamswijzinging") 3. Timeline: - 2013: ISIL assigned to "Het Nieuwe Instituut" - 2024-08-06: Name changed (remark added to ISIL registry) - Dutch orgs CSV shows new name "Nieuwe Instituut" ### Decision: **DUPLICATE - MERGE** **Action**: - These are the SAME institution at different points in time - Use current name: `Nieuwe Instituut` - Store historical name in `alternative_names`: `Het Nieuwe Instituut` - Add `ChangeEvent`: - `change_type`: NAME_CHANGE - `event_date`: "2024-08-06" - `event_description`: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'" - `source_documentation`: ISIL registry remark **GHCID Impact**: - Current GHCID: `NL-XX-ROTT-M-NI` - No Q-number needed (this is a duplicate, not a collision) - Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same) --- ## Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden" ### Evidence **ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 339): - Institution: `Historische Kring Wederen` (with trailing space) - ISIL Code: `NL-WdnHKW` - City: Wierden - Assigned: 2015-02-27 **Dutch Organizations CSV**: - Entry 1: `Historische Kring Wederen` (Wierden) - ISIL: `NL-WdnHKW`, Platform: Mijn Stad Mijn Dorp - Entry 2: `Historische Kring Wierden` (Wierden) - No ISIL, Platform: ZCBS ### Analysis This appears to be a **TYPO/DATA ERROR**: 1. **Geographic context**: City is "Wierden" (not "Wederen") 2. **ISIL registry has "Wederen"** but city is Wierden → likely typo in institution name 3. **Dutch orgs CSV has BOTH variants**: - "Wederen" with ISIL code (matches ISIL registry) - "Wierden" without ISIL code (correct spelling based on city) 4. **Different platforms** (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately ### Possible Scenarios **Scenario A: Same institution, typo in ISIL registry** - ISIL registry has typo: "Wederen" → should be "Wierden" - Dutch orgs CSV correctly lists "Historische Kring Wierden" - Entry with ISIL is outdated/typo **Scenario B: Two different organizations** - One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?) - One is "Historische Kring Wierden" (covering Wierden municipality) - Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this **Most Likely**: Scenario A (typo) ### Decision: **PROBABLE DUPLICATE - MERGE WITH CAVEAT** **Recommended Action**: - Treat as same institution (typo correction) - Use correct name: `Historische Kring Wierden` - Store variant in `alternative_names`: `Historische Kring Wederen` (marked as "typo in ISIL registry") - **Flag for verification**: Recommend contacting institution to confirm **GHCID Impact**: - Current base GHCID: `NL-XX-WIER-M-HKW` - No Q-number needed if merged - Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'" **Alternative Action** (if Scenario B is confirmed): - Keep as separate institutions - Add Q-numbers: - `NL-XX-WIER-M-HKW-Q` for "Wederen" - `NL-XX-WIER-M-HKW-Q` for "Wierden" --- ## Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser" ### Evidence **ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 226): - Institution: `de Historische Kring Losser` (with trailing space and lowercase "de") - ISIL Code: `NL-LsHKL` - City: Losser - Assigned: 2015-12-03 **Dutch Organizations CSV**: - Entry 1: `de Historische Kring Losser` (Losser) - ISIL: `NL-LsHKL` - Entry 2: `Historische Kring Losser` (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp ### Analysis This is a **DEFINITE DUPLICATE** - same institution with/without article "de": 1. **Same city, same type** (historical society) 2. **Name differs only by article "de"** (common in Dutch) 3. **ISIL registry includes "de"** in official name 4. **Dutch orgs CSV has both variants**: - With "de" + ISIL (matches ISIL registry) - Without "de" + platform (tracking different aspect) **Dutch language context**: - "de" = "the" in English - Common for organizations to be referenced with/without article - Both are valid references to same organization - ISIL registry uses full official name with "de" ### Decision: **DUPLICATE - MERGE** **Action**: - These are the SAME institution - Use official ISIL name: `de Historische Kring Losser` - Store variant in `alternative_names`: `Historische Kring Losser` - Normalize matching to handle articles (de, het, 't) in Dutch names **GHCID Impact**: - Current base GHCID: `NL-XX-LOSS-M-HKL` - No Q-number needed (duplicate, not collision) - Name normalization should strip articles for matching purposes --- ## Summary of Decisions | Case | City | Status | Action | Q-Number Needed? | |------|------|--------|--------|------------------| | Het Nieuwe Instituut / Nieuwe Instituut | Rotterdam | NAME CHANGE | Merge, add ChangeEvent | ❌ No | | Historische Kring Wederen / Wierden | Wierden | PROBABLE TYPO | Merge with caveat, flag for verification | ❌ No (if merged) | | de Historische Kring Losser / Historische Kring Losser | Losser | DUPLICATE (article) | Merge, improve normalization | ❌ No | --- ## Implementation Steps ### 1. Rotterdam - Name Change Handling **File to modify**: Deduplicator logic Add special case for name change detection: - If ISIL codes match → always treat as same institution - Check ISIL registry remarks for "naamswijzinging" / "name change" - Generate `ChangeEvent` with NAME_CHANGE type **Code location**: `src/glam_extractor/parsers/deduplicator.py` ```python def detect_name_change(record1, record2) -> Optional[ChangeEvent]: """ Detect name changes by comparing ISIL codes + institution names. If ISIL codes match but names differ → likely a name change. """ # Check if ISIL codes match isil1 = get_isil_code(record1) isil2 = get_isil_code(record2) if isil1 and isil2 and isil1 == isil2: if record1.name != record2.name: # Same ISIL, different name → name change return ChangeEvent( change_type=ChangeTypeEnum.NAME_CHANGE, event_description=f"Name changed from '{record1.name}' to '{record2.name}'", source_documentation="Detected from ISIL code match with different names" ) return None ``` ### 2. Wierden - Typo Handling **File to modify**: Deduplicator logic Add fuzzy matching for similar names in same city: - Levenshtein distance < 2 for names in same city → flag as potential typo - Add provenance note: "Name variant detected, may be typo" **Code location**: `src/glam_extractor/parsers/deduplicator.py` ```python from rapidfuzz import fuzz def detect_potential_typo(name1: str, name2: str, city: str) -> bool: """ Detect potential typos in institution names. If names are very similar (Levenshtein distance < 2) and in same city, likely a typo or variant spelling. """ distance = fuzz.distance(name1, name2) if distance <= 2: return True return False ``` ### 3. Losser - Article Normalization **File to modify**: Deduplicator match key generation Improve Dutch article handling: - Strip leading articles: "de", "het", "'t", "De", "Het" - Before generating match key **Code location**: `src/glam_extractor/parsers/deduplicator.py` (line 94-127) ```python DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het '] def normalize_name_for_matching(name: str) -> str: """ Normalize institution name for duplicate detection. - Lowercase - Remove punctuation - Strip Dutch articles - Remove extra whitespace """ normalized = name.lower().strip() # Strip Dutch articles for article in DUTCH_ARTICLES: if normalized.startswith(article.lower()): normalized = normalized[len(article):] break # Remove punctuation normalized = re.sub(r'[^\w\s-]', '', normalized) # Normalize whitespace normalized = re.sub(r'\s+', ' ', normalized).strip() return normalized ``` --- ## Testing Strategy ### Test 1: Rotterdam Name Change ```python def test_name_change_detection_rotterdam(): """Test that Het Nieuwe Instituut / Nieuwe Instituut are merged.""" record1 = HeritageCustodian( name="Het Nieuwe Instituut", institution_type=InstitutionType.MUSEUM, locations=[Location(city="Rotterdam")], identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")] ) record2 = HeritageCustodian( name="Nieuwe Instituut", institution_type=InstitutionType.MUSEUM, locations=[Location(city="Rotterdam")], identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")] ) deduplicator = Deduplicator() result = deduplicator.deduplicate([record1, record2]) # Should merge to one record assert len(result) == 1 assert result[0].name == "Nieuwe Instituut" assert "Het Nieuwe Instituut" in result[0].alternative_names # Should have NAME_CHANGE event assert any( event.change_type == ChangeTypeEnum.NAME_CHANGE for event in result[0].change_history ) ``` ### Test 2: Wierden Typo Detection ```python def test_typo_detection_wierden(): """Test that Wederen/Wierden are flagged as potential typo.""" record1 = HeritageCustodian( name="Historische Kring Wederen", institution_type=InstitutionType.COLLECTING_SOCIETY, locations=[Location(city="Wierden")], identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")] ) record2 = HeritageCustodian( name="Historische Kring Wierden", institution_type=InstitutionType.COLLECTING_SOCIETY, locations=[Location(city="Wierden")] ) deduplicator = Deduplicator() result = deduplicator.deduplicate([record1, record2]) # Should merge (treating as typo) assert len(result) == 1 # Should flag in provenance assert "variant" in result[0].provenance.notes.lower() or \ "typo" in result[0].provenance.notes.lower() ``` ### Test 3: Losser Article Normalization ```python def test_article_normalization_losser(): """Test that Dutch articles are stripped for matching.""" record1 = HeritageCustodian( name="de Historische Kring Losser", institution_type=InstitutionType.COLLECTING_SOCIETY, locations=[Location(city="Losser")], identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")] ) record2 = HeritageCustodian( name="Historische Kring Losser", institution_type=InstitutionType.COLLECTING_SOCIETY, locations=[Location(city="Losser")] ) deduplicator = Deduplicator() result = deduplicator.deduplicate([record1, record2]) # Should merge (articles ignored) assert len(result) == 1 assert result[0].name == "de Historische Kring Losser" # ISIL version wins assert "Historische Kring Losser" in result[0].alternative_names ``` --- ## Expected Impact on Collision Report After implementing these fixes: **Current state**: 15 collision groups, 30 institutions **Expected after fixes**: - **Rotterdam**: -1 collision group (merged) - **Wierden**: -1 collision group (merged with caveat) - **Losser**: -1 collision group (merged) **New state**: **12 collision groups, 24 institutions** Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums) --- ## Recommendations ### Immediate Actions (High Priority) 1. ✅ **Implement article normalization** (Losser case) - Low complexity, high impact 2. ✅ **Add ISIL-based name change detection** (Rotterdam case) - Critical for data quality 3. ⚠️ **Add fuzzy matching for typos** (Wierden case) - Requires manual verification ### Medium-Term Actions 4. **Contact institutions for verification**: - Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?" - Confirm Rotterdam name change timeline 5. **Improve ISIL registry integration**: - Parse "Opmerking" field for change events - Extract name change dates from remarks - Auto-generate ChangeEvent records ### Long-Term Actions 6. **Build article database** for multilingual support: - Dutch: de, het, 't - English: the, a, an - German: der, die, das, ein, eine - French: le, la, les, un, une 7. **Implement fuzzy matching pipeline**: - Stage 1: Exact match (current) - Stage 2: Normalized match (articles stripped) - Stage 3: Fuzzy match (typo detection) - Stage 4: Manual review queue --- ## Files to Modify 1. `src/glam_extractor/parsers/deduplicator.py` - Core logic 2. `tests/parsers/test_deduplicator.py` - Add 3 new tests 3. `scripts/apply_collision_resolution_dutch_datasets.py` - Re-run after fixes --- ## Conclusion All three edge cases are **duplicates, not true collisions**: 1. **Rotterdam**: Name change (same ISIL code) 2. **Wierden**: Probable typo (Wederen → Wierden) 3. **Losser**: Article variance (de Historische Kring vs Historische Kring) Implementing the recommended fixes will: - ✅ Reduce collisions from 15 to 12 groups (-20%) - ✅ Improve name normalization for Dutch institutions - ✅ Detect name changes automatically from ISIL data - ✅ Flag potential typos for manual review **Next step**: Implement fixes in deduplicator code and re-run collision resolution.