glam/data/collision_edge_case_analysis.md
2025-11-19 23:25:22 +01:00

437 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Collision Edge Case Analysis - Manual Review
**Date**: 2025-11-07
**Purpose**: Investigate 3 ambiguous collision cases requiring human decision
**Analyst**: AI-assisted manual review
---
## Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"
### Evidence
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 279):
- Institution: `Het Nieuwe Instituut`
- ISIL Code: `NL-RtHNI`
- Assigned: 2013-06-24
- **Remark**: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)
**Dutch Organizations CSV**:
- Institution: `Nieuwe Instituut`
- ISIL Code: `NL-RtHNI` (SAME as ISIL registry)
- City: Rotterdam
### Analysis
This is a **NAME CHANGE**, not two different institutions:
1. **Same ISIL code** (`NL-RtHNI`) in both datasets
2. **ISIL remark explicitly mentions name change** ("naamswijzinging")
3. Timeline:
- 2013: ISIL assigned to "Het Nieuwe Instituut"
- 2024-08-06: Name changed (remark added to ISIL registry)
- Dutch orgs CSV shows new name "Nieuwe Instituut"
### Decision: **DUPLICATE - MERGE**
**Action**:
- These are the SAME institution at different points in time
- Use current name: `Nieuwe Instituut`
- Store historical name in `alternative_names`: `Het Nieuwe Instituut`
- Add `ChangeEvent`:
- `change_type`: NAME_CHANGE
- `event_date`: "2024-08-06"
- `event_description`: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"
- `source_documentation`: ISIL registry remark
**GHCID Impact**:
- Current GHCID: `NL-XX-ROTT-M-NI`
- No Q-number needed (this is a duplicate, not a collision)
- Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)
---
## Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"
### Evidence
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 339):
- Institution: `Historische Kring Wederen` (with trailing space)
- ISIL Code: `NL-WdnHKW`
- City: Wierden
- Assigned: 2015-02-27
**Dutch Organizations CSV**:
- Entry 1: `Historische Kring Wederen` (Wierden) - ISIL: `NL-WdnHKW`, Platform: Mijn Stad Mijn Dorp
- Entry 2: `Historische Kring Wierden` (Wierden) - No ISIL, Platform: ZCBS
### Analysis
This appears to be a **TYPO/DATA ERROR**:
1. **Geographic context**: City is "Wierden" (not "Wederen")
2. **ISIL registry has "Wederen"** but city is Wierden → likely typo in institution name
3. **Dutch orgs CSV has BOTH variants**:
- "Wederen" with ISIL code (matches ISIL registry)
- "Wierden" without ISIL code (correct spelling based on city)
4. **Different platforms** (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately
### Possible Scenarios
**Scenario A: Same institution, typo in ISIL registry**
- ISIL registry has typo: "Wederen" → should be "Wierden"
- Dutch orgs CSV correctly lists "Historische Kring Wierden"
- Entry with ISIL is outdated/typo
**Scenario B: Two different organizations**
- One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
- One is "Historische Kring Wierden" (covering Wierden municipality)
- Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this
**Most Likely**: Scenario A (typo)
### Decision: **PROBABLE DUPLICATE - MERGE WITH CAVEAT**
**Recommended Action**:
- Treat as same institution (typo correction)
- Use correct name: `Historische Kring Wierden`
- Store variant in `alternative_names`: `Historische Kring Wederen` (marked as "typo in ISIL registry")
- **Flag for verification**: Recommend contacting institution to confirm
**GHCID Impact**:
- Current base GHCID: `NL-XX-WIER-M-HKW`
- No Q-number needed if merged
- Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"
**Alternative Action** (if Scenario B is confirmed):
- Keep as separate institutions
- Add Q-numbers:
- `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wederen"
- `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wierden"
---
## Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"
### Evidence
**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 226):
- Institution: `de Historische Kring Losser` (with trailing space and lowercase "de")
- ISIL Code: `NL-LsHKL`
- City: Losser
- Assigned: 2015-12-03
**Dutch Organizations CSV**:
- Entry 1: `de Historische Kring Losser` (Losser) - ISIL: `NL-LsHKL`
- Entry 2: `Historische Kring Losser` (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp
### Analysis
This is a **DEFINITE DUPLICATE** - same institution with/without article "de":
1. **Same city, same type** (historical society)
2. **Name differs only by article "de"** (common in Dutch)
3. **ISIL registry includes "de"** in official name
4. **Dutch orgs CSV has both variants**:
- With "de" + ISIL (matches ISIL registry)
- Without "de" + platform (tracking different aspect)
**Dutch language context**:
- "de" = "the" in English
- Common for organizations to be referenced with/without article
- Both are valid references to same organization
- ISIL registry uses full official name with "de"
### Decision: **DUPLICATE - MERGE**
**Action**:
- These are the SAME institution
- Use official ISIL name: `de Historische Kring Losser`
- Store variant in `alternative_names`: `Historische Kring Losser`
- Normalize matching to handle articles (de, het, 't) in Dutch names
**GHCID Impact**:
- Current base GHCID: `NL-XX-LOSS-M-HKL`
- No Q-number needed (duplicate, not collision)
- Name normalization should strip articles for matching purposes
---
## Summary of Decisions
| Case | City | Status | Action | Q-Number Needed? |
|------|------|--------|--------|------------------|
| Het Nieuwe Instituut / Nieuwe Instituut | Rotterdam | NAME CHANGE | Merge, add ChangeEvent | ❌ No |
| Historische Kring Wederen / Wierden | Wierden | PROBABLE TYPO | Merge with caveat, flag for verification | ❌ No (if merged) |
| de Historische Kring Losser / Historische Kring Losser | Losser | DUPLICATE (article) | Merge, improve normalization | ❌ No |
---
## Implementation Steps
### 1. Rotterdam - Name Change Handling
**File to modify**: Deduplicator logic
Add special case for name change detection:
- If ISIL codes match → always treat as same institution
- Check ISIL registry remarks for "naamswijzinging" / "name change"
- Generate `ChangeEvent` with NAME_CHANGE type
**Code location**: `src/glam_extractor/parsers/deduplicator.py`
```python
def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
"""
Detect name changes by comparing ISIL codes + institution names.
If ISIL codes match but names differ → likely a name change.
"""
# Check if ISIL codes match
isil1 = get_isil_code(record1)
isil2 = get_isil_code(record2)
if isil1 and isil2 and isil1 == isil2:
if record1.name != record2.name:
# Same ISIL, different name → name change
return ChangeEvent(
change_type=ChangeTypeEnum.NAME_CHANGE,
event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
source_documentation="Detected from ISIL code match with different names"
)
return None
```
### 2. Wierden - Typo Handling
**File to modify**: Deduplicator logic
Add fuzzy matching for similar names in same city:
- Levenshtein distance < 2 for names in same city flag as potential typo
- Add provenance note: "Name variant detected, may be typo"
**Code location**: `src/glam_extractor/parsers/deduplicator.py`
```python
from rapidfuzz import fuzz
def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
"""
Detect potential typos in institution names.
If names are very similar (Levenshtein distance < 2) and in same city,
likely a typo or variant spelling.
"""
distance = fuzz.distance(name1, name2)
if distance <= 2:
return True
return False
```
### 3. Losser - Article Normalization
**File to modify**: Deduplicator match key generation
Improve Dutch article handling:
- Strip leading articles: "de", "het", "'t", "De", "Het"
- Before generating match key
**Code location**: `src/glam_extractor/parsers/deduplicator.py` (line 94-127)
```python
DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']
def normalize_name_for_matching(name: str) -> str:
"""
Normalize institution name for duplicate detection.
- Lowercase
- Remove punctuation
- Strip Dutch articles
- Remove extra whitespace
"""
normalized = name.lower().strip()
# Strip Dutch articles
for article in DUTCH_ARTICLES:
if normalized.startswith(article.lower()):
normalized = normalized[len(article):]
break
# Remove punctuation
normalized = re.sub(r'[^\w\s-]', '', normalized)
# Normalize whitespace
normalized = re.sub(r'\s+', ' ', normalized).strip()
return normalized
```
---
## Testing Strategy
### Test 1: Rotterdam Name Change
```python
def test_name_change_detection_rotterdam():
"""Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
record1 = HeritageCustodian(
name="Het Nieuwe Instituut",
institution_type=InstitutionType.MUSEUM,
locations=[Location(city="Rotterdam")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
)
record2 = HeritageCustodian(
name="Nieuwe Instituut",
institution_type=InstitutionType.MUSEUM,
locations=[Location(city="Rotterdam")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge to one record
assert len(result) == 1
assert result[0].name == "Nieuwe Instituut"
assert "Het Nieuwe Instituut" in result[0].alternative_names
# Should have NAME_CHANGE event
assert any(
event.change_type == ChangeTypeEnum.NAME_CHANGE
for event in result[0].change_history
)
```
### Test 2: Wierden Typo Detection
```python
def test_typo_detection_wierden():
"""Test that Wederen/Wierden are flagged as potential typo."""
record1 = HeritageCustodian(
name="Historische Kring Wederen",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Wierden")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
)
record2 = HeritageCustodian(
name="Historische Kring Wierden",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Wierden")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge (treating as typo)
assert len(result) == 1
# Should flag in provenance
assert "variant" in result[0].provenance.notes.lower() or \
"typo" in result[0].provenance.notes.lower()
```
### Test 3: Losser Article Normalization
```python
def test_article_normalization_losser():
"""Test that Dutch articles are stripped for matching."""
record1 = HeritageCustodian(
name="de Historische Kring Losser",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Losser")],
identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
)
record2 = HeritageCustodian(
name="Historische Kring Losser",
institution_type=InstitutionType.COLLECTING_SOCIETY,
locations=[Location(city="Losser")]
)
deduplicator = Deduplicator()
result = deduplicator.deduplicate([record1, record2])
# Should merge (articles ignored)
assert len(result) == 1
assert result[0].name == "de Historische Kring Losser" # ISIL version wins
assert "Historische Kring Losser" in result[0].alternative_names
```
---
## Expected Impact on Collision Report
After implementing these fixes:
**Current state**: 15 collision groups, 30 institutions
**Expected after fixes**:
- **Rotterdam**: -1 collision group (merged)
- **Wierden**: -1 collision group (merged with caveat)
- **Losser**: -1 collision group (merged)
**New state**: **12 collision groups, 24 institutions**
Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)
---
## Recommendations
### Immediate Actions (High Priority)
1. **Implement article normalization** (Losser case) - Low complexity, high impact
2. **Add ISIL-based name change detection** (Rotterdam case) - Critical for data quality
3. **Add fuzzy matching for typos** (Wierden case) - Requires manual verification
### Medium-Term Actions
4. **Contact institutions for verification**:
- Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
- Confirm Rotterdam name change timeline
5. **Improve ISIL registry integration**:
- Parse "Opmerking" field for change events
- Extract name change dates from remarks
- Auto-generate ChangeEvent records
### Long-Term Actions
6. **Build article database** for multilingual support:
- Dutch: de, het, 't
- English: the, a, an
- German: der, die, das, ein, eine
- French: le, la, les, un, une
7. **Implement fuzzy matching pipeline**:
- Stage 1: Exact match (current)
- Stage 2: Normalized match (articles stripped)
- Stage 3: Fuzzy match (typo detection)
- Stage 4: Manual review queue
---
## Files to Modify
1. `src/glam_extractor/parsers/deduplicator.py` - Core logic
2. `tests/parsers/test_deduplicator.py` - Add 3 new tests
3. `scripts/apply_collision_resolution_dutch_datasets.py` - Re-run after fixes
---
## Conclusion
All three edge cases are **duplicates, not true collisions**:
1. **Rotterdam**: Name change (same ISIL code)
2. **Wierden**: Probable typo (Wederen Wierden)
3. **Losser**: Article variance (de Historische Kring vs Historische Kring)
Implementing the recommended fixes will:
- Reduce collisions from 15 to 12 groups (-20%)
- Improve name normalization for Dutch institutions
- Detect name changes automatically from ISIL data
- Flag potential typos for manual review
**Next step**: Implement fixes in deduplicator code and re-run collision resolution.