14 KiB
Collision Edge Case Analysis - Manual Review
Date: 2025-11-07
Purpose: Investigate 3 ambiguous collision cases requiring human decision
Analyst: AI-assisted manual review
Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"
Evidence
ISIL Registry (ISIL-codes_2025-08-01.csv, line 279):
- Institution:
Het Nieuwe Instituut - ISIL Code:
NL-RtHNI - Assigned: 2013-06-24
- Remark: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)
Dutch Organizations CSV:
- Institution:
Nieuwe Instituut - ISIL Code:
NL-RtHNI(SAME as ISIL registry) - City: Rotterdam
Analysis
This is a NAME CHANGE, not two different institutions:
- Same ISIL code (
NL-RtHNI) in both datasets - ISIL remark explicitly mentions name change ("naamswijzinging")
- Timeline:
- 2013: ISIL assigned to "Het Nieuwe Instituut"
- 2024-08-06: Name changed (remark added to ISIL registry)
- Dutch orgs CSV shows new name "Nieuwe Instituut"
Decision: DUPLICATE - MERGE
Action:
- These are the SAME institution at different points in time
- Use current name:
Nieuwe Instituut - Store historical name in
alternative_names:Het Nieuwe Instituut - Add
ChangeEvent:change_type: NAME_CHANGEevent_date: "2024-08-06"event_description: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"source_documentation: ISIL registry remark
GHCID Impact:
- Current GHCID:
NL-XX-ROTT-M-NI - No Q-number needed (this is a duplicate, not a collision)
- Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)
Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"
Evidence
ISIL Registry (ISIL-codes_2025-08-01.csv, line 339):
- Institution:
Historische Kring Wederen(with trailing space) - ISIL Code:
NL-WdnHKW - City: Wierden
- Assigned: 2015-02-27
Dutch Organizations CSV:
- Entry 1:
Historische Kring Wederen(Wierden) - ISIL:NL-WdnHKW, Platform: Mijn Stad Mijn Dorp - Entry 2:
Historische Kring Wierden(Wierden) - No ISIL, Platform: ZCBS
Analysis
This appears to be a TYPO/DATA ERROR:
- Geographic context: City is "Wierden" (not "Wederen")
- ISIL registry has "Wederen" but city is Wierden → likely typo in institution name
- Dutch orgs CSV has BOTH variants:
- "Wederen" with ISIL code (matches ISIL registry)
- "Wierden" without ISIL code (correct spelling based on city)
- Different platforms (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately
Possible Scenarios
Scenario A: Same institution, typo in ISIL registry
- ISIL registry has typo: "Wederen" → should be "Wierden"
- Dutch orgs CSV correctly lists "Historische Kring Wierden"
- Entry with ISIL is outdated/typo
Scenario B: Two different organizations
- One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
- One is "Historische Kring Wierden" (covering Wierden municipality)
- Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this
Most Likely: Scenario A (typo)
Decision: PROBABLE DUPLICATE - MERGE WITH CAVEAT
Recommended Action:
- Treat as same institution (typo correction)
- Use correct name:
Historische Kring Wierden - Store variant in
alternative_names:Historische Kring Wederen(marked as "typo in ISIL registry") - Flag for verification: Recommend contacting institution to confirm
GHCID Impact:
- Current base GHCID:
NL-XX-WIER-M-HKW - No Q-number needed if merged
- Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"
Alternative Action (if Scenario B is confirmed):
- Keep as separate institutions
- Add Q-numbers:
NL-XX-WIER-M-HKW-Q<wikidata>for "Wederen"NL-XX-WIER-M-HKW-Q<wikidata>for "Wierden"
Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"
Evidence
ISIL Registry (ISIL-codes_2025-08-01.csv, line 226):
- Institution:
de Historische Kring Losser(with trailing space and lowercase "de") - ISIL Code:
NL-LsHKL - City: Losser
- Assigned: 2015-12-03
Dutch Organizations CSV:
- Entry 1:
de Historische Kring Losser(Losser) - ISIL:NL-LsHKL - Entry 2:
Historische Kring Losser(Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp
Analysis
This is a DEFINITE DUPLICATE - same institution with/without article "de":
- Same city, same type (historical society)
- Name differs only by article "de" (common in Dutch)
- ISIL registry includes "de" in official name
- Dutch orgs CSV has both variants:
- With "de" + ISIL (matches ISIL registry)
- Without "de" + platform (tracking different aspect)
Dutch language context:
- "de" = "the" in English
- Common for organizations to be referenced with/without article
- Both are valid references to same organization
- ISIL registry uses full official name with "de"
Decision: DUPLICATE - MERGE
Action:
- These are the SAME institution
- Use official ISIL name:
de Historische Kring Losser - Store variant in
alternative_names:Historische Kring Losser - Normalize matching to handle articles (de, het, 't) in Dutch names
GHCID Impact:
- Current base GHCID:
NL-XX-LOSS-M-HKL - No Q-number needed (duplicate, not collision)
- Name normalization should strip articles for matching purposes
Summary of Decisions
| Case | City | Status | Action | Q-Number Needed? |
|---|---|---|---|---|
| Het Nieuwe Instituut / Nieuwe Instituut | Rotterdam | NAME CHANGE | Merge, add ChangeEvent | ❌ No |
| Historische Kring Wederen / Wierden | Wierden | PROBABLE TYPO | Merge with caveat, flag for verification | ❌ No (if merged) |
| de Historische Kring Losser / Historische Kring Losser | Losser | DUPLICATE (article) | Merge, improve normalization | ❌ No |
Implementation Steps
1. Rotterdam - Name Change Handling
File to modify: Deduplicator logic
Add special case for name change detection:
- If ISIL codes match → always treat as same institution
- Check ISIL registry remarks for "naamswijzinging" / "name change"
- Generate
ChangeEventwith NAME_CHANGE type
Code location: src/glam_extractor/parsers/deduplicator.py
def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
"""
Detect name changes by comparing ISIL codes + institution names.
If ISIL codes match but names differ → likely a name change.
"""
# Check if ISIL codes match
isil1 = get_isil_code(record1)
isil2 = get_isil_code(record2)
if isil1 and isil2 and isil1 == isil2:
if record1.name != record2.name:
# Same ISIL, different name → name change
return ChangeEvent(
change_type=ChangeTypeEnum.NAME_CHANGE,
event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
source_documentation="Detected from ISIL code match with different names"
)
return None
2. Wierden - Typo Handling
File to modify: Deduplicator logic
Add fuzzy matching for similar names in same city:
- Levenshtein distance < 2 for names in same city → flag as potential typo
- Add provenance note: "Name variant detected, may be typo"
Code location: src/glam_extractor/parsers/deduplicator.py
from rapidfuzz import fuzz
def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
"""
Detect potential typos in institution names.
If names are very similar (Levenshtein distance < 2) and in same city,
likely a typo or variant spelling.
"""
distance = fuzz.distance(name1, name2)
if distance <= 2:
return True
return False
3. Losser - Article Normalization
File to modify: Deduplicator match key generation
Improve Dutch article handling:
- Strip leading articles: "de", "het", "'t", "De", "Het"
- Before generating match key
Code location: src/glam_extractor/parsers/deduplicator.py (line 94-127)
DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']
def normalize_name_for_matching(name: str) -> str:
"""
Normalize institution name for duplicate detection.
- Lowercase
- Remove punctuation
- Strip Dutch articles
- Remove extra whitespace
"""
normalized = name.lower().strip()
# Strip Dutch articles
for article in DUTCH_ARTICLES:
if normalized.startswith(article.lower()):
normalized = normalized[len(article):]
break
# Remove punctuation
normalized = re.sub(r'[^\w\s-]', '', normalized)
# Normalize whitespace
normalized = re.sub(r'\s+', ' ', normalized).strip()
return normalized
Testing Strategy
Test 1: Rotterdam Name Change
def test_name_change_detection_rotterdam():
"""Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
record1 = HeritageCustodian(
name="Het Nieuwe Instituut",
institution_type=InstitutionType.MUSEUM,
locations=[Location(city="Rotterdam")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
)
record2 = HeritageCustodian(
name="Nieuwe Instituut",
institution_type=InstitutionType.MUSEUM,
locations=[Location(city="Rotterdam")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge to one record
assert len(result) == 1
assert result[0].name == "Nieuwe Instituut"
assert "Het Nieuwe Instituut" in result[0].alternative_names
# Should have NAME_CHANGE event
assert any(
event.change_type == ChangeTypeEnum.NAME_CHANGE
for event in result[0].change_history
)
Test 2: Wierden Typo Detection
def test_typo_detection_wierden():
"""Test that Wederen/Wierden are flagged as potential typo."""
record1 = HeritageCustodian(
name="Historische Kring Wederen",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Wierden")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
)
record2 = HeritageCustodian(
name="Historische Kring Wierden",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Wierden")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge (treating as typo)
assert len(result) == 1
# Should flag in provenance
assert "variant" in result[0].provenance.notes.lower() or \
"typo" in result[0].provenance.notes.lower()
Test 3: Losser Article Normalization
def test_article_normalization_losser():
"""Test that Dutch articles are stripped for matching."""
record1 = HeritageCustodian(
name="de Historische Kring Losser",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Losser")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
)
record2 = HeritageCustodian(
name="Historische Kring Losser",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Losser")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge (articles ignored)
assert len(result) == 1
assert result[0].name == "de Historische Kring Losser" # ISIL version wins
assert "Historische Kring Losser" in result[0].alternative_names
Expected Impact on Collision Report
After implementing these fixes:
Current state: 15 collision groups, 30 institutions
Expected after fixes:
- Rotterdam: -1 collision group (merged)
- Wierden: -1 collision group (merged with caveat)
- Losser: -1 collision group (merged)
New state: 12 collision groups, 24 institutions
Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)
Recommendations
Immediate Actions (High Priority)
- ✅ Implement article normalization (Losser case) - Low complexity, high impact
- ✅ Add ISIL-based name change detection (Rotterdam case) - Critical for data quality
- ⚠️ Add fuzzy matching for typos (Wierden case) - Requires manual verification
Medium-Term Actions
-
Contact institutions for verification:
- Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
- Confirm Rotterdam name change timeline
-
Improve ISIL registry integration:
- Parse "Opmerking" field for change events
- Extract name change dates from remarks
- Auto-generate ChangeEvent records
Long-Term Actions
-
Build article database for multilingual support:
- Dutch: de, het, 't
- English: the, a, an
- German: der, die, das, ein, eine
- French: le, la, les, un, une
-
Implement fuzzy matching pipeline:
- Stage 1: Exact match (current)
- Stage 2: Normalized match (articles stripped)
- Stage 3: Fuzzy match (typo detection)
- Stage 4: Manual review queue
Files to Modify
src/glam_extractor/parsers/deduplicator.py- Core logictests/parsers/test_deduplicator.py- Add 3 new testsscripts/apply_collision_resolution_dutch_datasets.py- Re-run after fixes
Conclusion
All three edge cases are duplicates, not true collisions:
- Rotterdam: Name change (same ISIL code)
- Wierden: Probable typo (Wederen → Wierden)
- Losser: Article variance (de Historische Kring vs Historische Kring)
Implementing the recommended fixes will:
- ✅ Reduce collisions from 15 to 12 groups (-20%)
- ✅ Improve name normalization for Dutch institutions
- ✅ Detect name changes automatically from ISIL data
- ✅ Flag potential typos for manual review
Next step: Implement fixes in deduplicator code and re-run collision resolution.