437 lines
14 KiB
Markdown
437 lines
14 KiB
Markdown
# Collision Edge Case Analysis - Manual Review
|
||
|
||
**Date**: 2025-11-07
|
||
**Purpose**: Investigate 3 ambiguous collision cases requiring human decision
|
||
**Analyst**: AI-assisted manual review
|
||
|
||
---
|
||
|
||
## Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"
|
||
|
||
### Evidence
|
||
|
||
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 279):
|
||
- Institution: `Het Nieuwe Instituut`
|
||
- ISIL Code: `NL-RtHNI`
|
||
- Assigned: 2013-06-24
|
||
- **Remark**: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)
|
||
|
||
**Dutch Organizations CSV**:
|
||
- Institution: `Nieuwe Instituut`
|
||
- ISIL Code: `NL-RtHNI` (SAME as ISIL registry)
|
||
- City: Rotterdam
|
||
|
||
### Analysis
|
||
|
||
This is a **NAME CHANGE**, not two different institutions:
|
||
|
||
1. **Same ISIL code** (`NL-RtHNI`) in both datasets
|
||
2. **ISIL remark explicitly mentions name change** ("naamswijzinging")
|
||
3. Timeline:
|
||
- 2013: ISIL assigned to "Het Nieuwe Instituut"
|
||
- 2024-08-06: Name changed (remark added to ISIL registry)
|
||
- Dutch orgs CSV shows new name "Nieuwe Instituut"
|
||
|
||
### Decision: **DUPLICATE - MERGE**
|
||
|
||
**Action**:
|
||
- These are the SAME institution at different points in time
|
||
- Use current name: `Nieuwe Instituut`
|
||
- Store historical name in `alternative_names`: `Het Nieuwe Instituut`
|
||
- Add `ChangeEvent`:
|
||
- `change_type`: NAME_CHANGE
|
||
- `event_date`: "2024-08-06"
|
||
- `event_description`: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"
|
||
- `source_documentation`: ISIL registry remark
|
||
|
||
**GHCID Impact**:
|
||
- Current GHCID: `NL-XX-ROTT-M-NI`
|
||
- No Q-number needed (this is a duplicate, not a collision)
|
||
- Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)
|
||
|
||
---
|
||
|
||
## Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"
|
||
|
||
### Evidence
|
||
|
||
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 339):
|
||
- Institution: `Historische Kring Wederen` (with trailing space)
|
||
- ISIL Code: `NL-WdnHKW`
|
||
- City: Wierden
|
||
- Assigned: 2015-02-27
|
||
|
||
**Dutch Organizations CSV**:
|
||
- Entry 1: `Historische Kring Wederen` (Wierden) - ISIL: `NL-WdnHKW`, Platform: Mijn Stad Mijn Dorp
|
||
- Entry 2: `Historische Kring Wierden` (Wierden) - No ISIL, Platform: ZCBS
|
||
|
||
### Analysis
|
||
|
||
This appears to be a **TYPO/DATA ERROR**:
|
||
|
||
1. **Geographic context**: City is "Wierden" (not "Wederen")
|
||
2. **ISIL registry has "Wederen"** but city is Wierden → likely typo in institution name
|
||
3. **Dutch orgs CSV has BOTH variants**:
|
||
- "Wederen" with ISIL code (matches ISIL registry)
|
||
- "Wierden" without ISIL code (correct spelling based on city)
|
||
4. **Different platforms** (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately
|
||
|
||
### Possible Scenarios
|
||
|
||
**Scenario A: Same institution, typo in ISIL registry**
|
||
- ISIL registry has typo: "Wederen" → should be "Wierden"
|
||
- Dutch orgs CSV correctly lists "Historische Kring Wierden"
|
||
- Entry with ISIL is outdated/typo
|
||
|
||
**Scenario B: Two different organizations**
|
||
- One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
|
||
- One is "Historische Kring Wierden" (covering Wierden municipality)
|
||
- Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this
|
||
|
||
**Most Likely**: Scenario A (typo)
|
||
|
||
### Decision: **PROBABLE DUPLICATE - MERGE WITH CAVEAT**
|
||
|
||
**Recommended Action**:
|
||
- Treat as same institution (typo correction)
|
||
- Use correct name: `Historische Kring Wierden`
|
||
- Store variant in `alternative_names`: `Historische Kring Wederen` (marked as "typo in ISIL registry")
|
||
- **Flag for verification**: Recommend contacting institution to confirm
|
||
|
||
**GHCID Impact**:
|
||
- Current base GHCID: `NL-XX-WIER-M-HKW`
|
||
- No Q-number needed if merged
|
||
- Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"
|
||
|
||
**Alternative Action** (if Scenario B is confirmed):
|
||
- Keep as separate institutions
|
||
- Add Q-numbers:
|
||
- `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wederen"
|
||
- `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wierden"
|
||
|
||
---
|
||
|
||
## Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"
|
||
|
||
### Evidence
|
||
|
||
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 226):
|
||
- Institution: `de Historische Kring Losser` (with trailing space and lowercase "de")
|
||
- ISIL Code: `NL-LsHKL`
|
||
- City: Losser
|
||
- Assigned: 2015-12-03
|
||
|
||
**Dutch Organizations CSV**:
|
||
- Entry 1: `de Historische Kring Losser` (Losser) - ISIL: `NL-LsHKL`
|
||
- Entry 2: `Historische Kring Losser` (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp
|
||
|
||
### Analysis
|
||
|
||
This is a **DEFINITE DUPLICATE** - same institution with/without article "de":
|
||
|
||
1. **Same city, same type** (historical society)
|
||
2. **Name differs only by article "de"** (common in Dutch)
|
||
3. **ISIL registry includes "de"** in official name
|
||
4. **Dutch orgs CSV has both variants**:
|
||
- With "de" + ISIL (matches ISIL registry)
|
||
- Without "de" + platform (tracking different aspect)
|
||
|
||
**Dutch language context**:
|
||
- "de" = "the" in English
|
||
- Common for organizations to be referenced with/without article
|
||
- Both are valid references to same organization
|
||
- ISIL registry uses full official name with "de"
|
||
|
||
### Decision: **DUPLICATE - MERGE**
|
||
|
||
**Action**:
|
||
- These are the SAME institution
|
||
- Use official ISIL name: `de Historische Kring Losser`
|
||
- Store variant in `alternative_names`: `Historische Kring Losser`
|
||
- Normalize matching to handle articles (de, het, 't) in Dutch names
|
||
|
||
**GHCID Impact**:
|
||
- Current base GHCID: `NL-XX-LOSS-M-HKL`
|
||
- No Q-number needed (duplicate, not collision)
|
||
- Name normalization should strip articles for matching purposes
|
||
|
||
---
|
||
|
||
## Summary of Decisions
|
||
|
||
| Case | City | Status | Action | Q-Number Needed? |
|
||
|------|------|--------|--------|------------------|
|
||
| Het Nieuwe Instituut / Nieuwe Instituut | Rotterdam | NAME CHANGE | Merge, add ChangeEvent | ❌ No |
|
||
| Historische Kring Wederen / Wierden | Wierden | PROBABLE TYPO | Merge with caveat, flag for verification | ❌ No (if merged) |
|
||
| de Historische Kring Losser / Historische Kring Losser | Losser | DUPLICATE (article) | Merge, improve normalization | ❌ No |
|
||
|
||
---
|
||
|
||
## Implementation Steps
|
||
|
||
### 1. Rotterdam - Name Change Handling
|
||
|
||
**File to modify**: Deduplicator logic
|
||
|
||
Add special case for name change detection:
|
||
- If ISIL codes match → always treat as same institution
|
||
- Check ISIL registry remarks for "naamswijzinging" / "name change"
|
||
- Generate `ChangeEvent` with NAME_CHANGE type
|
||
|
||
**Code location**: `src/glam_extractor/parsers/deduplicator.py`
|
||
|
||
```python
|
||
def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
|
||
"""
|
||
Detect name changes by comparing ISIL codes + institution names.
|
||
|
||
If ISIL codes match but names differ → likely a name change.
|
||
"""
|
||
# Check if ISIL codes match
|
||
isil1 = get_isil_code(record1)
|
||
isil2 = get_isil_code(record2)
|
||
|
||
if isil1 and isil2 and isil1 == isil2:
|
||
if record1.name != record2.name:
|
||
# Same ISIL, different name → name change
|
||
return ChangeEvent(
|
||
change_type=ChangeTypeEnum.NAME_CHANGE,
|
||
event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
|
||
source_documentation="Detected from ISIL code match with different names"
|
||
)
|
||
return None
|
||
```
|
||
|
||
### 2. Wierden - Typo Handling
|
||
|
||
**File to modify**: Deduplicator logic
|
||
|
||
Add fuzzy matching for similar names in same city:
|
||
- Levenshtein distance < 2 for names in same city → flag as potential typo
|
||
- Add provenance note: "Name variant detected, may be typo"
|
||
|
||
**Code location**: `src/glam_extractor/parsers/deduplicator.py`
|
||
|
||
```python
|
||
from rapidfuzz import fuzz
|
||
|
||
def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
|
||
"""
|
||
Detect potential typos in institution names.
|
||
|
||
If names are very similar (Levenshtein distance < 2) and in same city,
|
||
likely a typo or variant spelling.
|
||
"""
|
||
distance = fuzz.distance(name1, name2)
|
||
if distance <= 2:
|
||
return True
|
||
return False
|
||
```
|
||
|
||
### 3. Losser - Article Normalization
|
||
|
||
**File to modify**: Deduplicator match key generation
|
||
|
||
Improve Dutch article handling:
|
||
- Strip leading articles: "de", "het", "'t", "De", "Het"
|
||
- Before generating match key
|
||
|
||
**Code location**: `src/glam_extractor/parsers/deduplicator.py` (line 94-127)
|
||
|
||
```python
|
||
DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']
|
||
|
||
def normalize_name_for_matching(name: str) -> str:
|
||
"""
|
||
Normalize institution name for duplicate detection.
|
||
|
||
- Lowercase
|
||
- Remove punctuation
|
||
- Strip Dutch articles
|
||
- Remove extra whitespace
|
||
"""
|
||
normalized = name.lower().strip()
|
||
|
||
# Strip Dutch articles
|
||
for article in DUTCH_ARTICLES:
|
||
if normalized.startswith(article.lower()):
|
||
normalized = normalized[len(article):]
|
||
break
|
||
|
||
# Remove punctuation
|
||
normalized = re.sub(r'[^\w\s-]', '', normalized)
|
||
|
||
# Normalize whitespace
|
||
normalized = re.sub(r'\s+', ' ', normalized).strip()
|
||
|
||
return normalized
|
||
```
|
||
|
||
---
|
||
|
||
## Testing Strategy
|
||
|
||
### Test 1: Rotterdam Name Change
|
||
|
||
```python
|
||
def test_name_change_detection_rotterdam():
|
||
"""Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
|
||
record1 = HeritageCustodian(
|
||
name="Het Nieuwe Instituut",
|
||
institution_type=InstitutionType.MUSEUM,
|
||
locations=[Location(city="Rotterdam")],
|
||
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
|
||
)
|
||
record2 = HeritageCustodian(
|
||
name="Nieuwe Instituut",
|
||
institution_type=InstitutionType.MUSEUM,
|
||
locations=[Location(city="Rotterdam")],
|
||
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
|
||
)
|
||
|
||
deduplicator = Deduplicator()
|
||
result = deduplicator.deduplicate([record1, record2])
|
||
|
||
# Should merge to one record
|
||
assert len(result) == 1
|
||
assert result[0].name == "Nieuwe Instituut"
|
||
assert "Het Nieuwe Instituut" in result[0].alternative_names
|
||
|
||
# Should have NAME_CHANGE event
|
||
assert any(
|
||
event.change_type == ChangeTypeEnum.NAME_CHANGE
|
||
for event in result[0].change_history
|
||
)
|
||
```
|
||
|
||
### Test 2: Wierden Typo Detection
|
||
|
||
```python
|
||
def test_typo_detection_wierden():
|
||
"""Test that Wederen/Wierden are flagged as potential typo."""
|
||
record1 = HeritageCustodian(
|
||
name="Historische Kring Wederen",
|
||
institution_type=InstitutionType.COLLECTING_SOCIETY,
|
||
locations=[Location(city="Wierden")],
|
||
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
|
||
)
|
||
record2 = HeritageCustodian(
|
||
name="Historische Kring Wierden",
|
||
institution_type=InstitutionType.COLLECTING_SOCIETY,
|
||
locations=[Location(city="Wierden")]
|
||
)
|
||
|
||
deduplicator = Deduplicator()
|
||
result = deduplicator.deduplicate([record1, record2])
|
||
|
||
# Should merge (treating as typo)
|
||
assert len(result) == 1
|
||
|
||
# Should flag in provenance
|
||
assert "variant" in result[0].provenance.notes.lower() or \
|
||
"typo" in result[0].provenance.notes.lower()
|
||
```
|
||
|
||
### Test 3: Losser Article Normalization
|
||
|
||
```python
|
||
def test_article_normalization_losser():
|
||
"""Test that Dutch articles are stripped for matching."""
|
||
record1 = HeritageCustodian(
|
||
name="de Historische Kring Losser",
|
||
institution_type=InstitutionType.COLLECTING_SOCIETY,
|
||
locations=[Location(city="Losser")],
|
||
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
|
||
)
|
||
record2 = HeritageCustodian(
|
||
name="Historische Kring Losser",
|
||
institution_type=InstitutionType.COLLECTING_SOCIETY,
|
||
locations=[Location(city="Losser")]
|
||
)
|
||
|
||
deduplicator = Deduplicator()
|
||
result = deduplicator.deduplicate([record1, record2])
|
||
|
||
# Should merge (articles ignored)
|
||
assert len(result) == 1
|
||
assert result[0].name == "de Historische Kring Losser" # ISIL version wins
|
||
assert "Historische Kring Losser" in result[0].alternative_names
|
||
```
|
||
|
||
---
|
||
|
||
## Expected Impact on Collision Report
|
||
|
||
After implementing these fixes:
|
||
|
||
**Current state**: 15 collision groups, 30 institutions
|
||
|
||
**Expected after fixes**:
|
||
- **Rotterdam**: -1 collision group (merged)
|
||
- **Wierden**: -1 collision group (merged with caveat)
|
||
- **Losser**: -1 collision group (merged)
|
||
|
||
**New state**: **12 collision groups, 24 institutions**
|
||
|
||
Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### Immediate Actions (High Priority)
|
||
|
||
1. ✅ **Implement article normalization** (Losser case) - Low complexity, high impact
|
||
2. ✅ **Add ISIL-based name change detection** (Rotterdam case) - Critical for data quality
|
||
3. ⚠️ **Add fuzzy matching for typos** (Wierden case) - Requires manual verification
|
||
|
||
### Medium-Term Actions
|
||
|
||
4. **Contact institutions for verification**:
|
||
- Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
|
||
- Confirm Rotterdam name change timeline
|
||
|
||
5. **Improve ISIL registry integration**:
|
||
- Parse "Opmerking" field for change events
|
||
- Extract name change dates from remarks
|
||
- Auto-generate ChangeEvent records
|
||
|
||
### Long-Term Actions
|
||
|
||
6. **Build article database** for multilingual support:
|
||
- Dutch: de, het, 't
|
||
- English: the, a, an
|
||
- German: der, die, das, ein, eine
|
||
- French: le, la, les, un, une
|
||
|
||
7. **Implement fuzzy matching pipeline**:
|
||
- Stage 1: Exact match (current)
|
||
- Stage 2: Normalized match (articles stripped)
|
||
- Stage 3: Fuzzy match (typo detection)
|
||
- Stage 4: Manual review queue
|
||
|
||
---
|
||
|
||
## Files to Modify
|
||
|
||
1. `src/glam_extractor/parsers/deduplicator.py` - Core logic
|
||
2. `tests/parsers/test_deduplicator.py` - Add 3 new tests
|
||
3. `scripts/apply_collision_resolution_dutch_datasets.py` - Re-run after fixes
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
All three edge cases are **duplicates, not true collisions**:
|
||
|
||
1. **Rotterdam**: Name change (same ISIL code)
|
||
2. **Wierden**: Probable typo (Wederen → Wierden)
|
||
3. **Losser**: Article variance (de Historische Kring vs Historische Kring)
|
||
|
||
Implementing the recommended fixes will:
|
||
- ✅ Reduce collisions from 15 to 12 groups (-20%)
|
||
- ✅ Improve name normalization for Dutch institutions
|
||
- ✅ Detect name changes automatically from ISIL data
|
||
- ✅ Flag potential typos for manual review
|
||
|
||
**Next step**: Implement fixes in deduplicator code and re-run collision resolution.
|