glam/data/collision_edge_case_analysis.md

# Collision Edge Case Analysis - Manual Review

**Date**: 2025-11-07
**Purpose**: Investigate 3 ambiguous collision cases requiring human decision
**Analyst**: AI-assisted manual review

---

## Case 1: Rotterdam - "Het Nieuwe Instituut" vs "Nieuwe Instituut"

### Evidence

**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 279):
- Institution: `Het Nieuwe Instituut`
- ISIL Code: `NL-RtHNI`
- Assigned: 2013-06-24
- **Remark**: "n.b. 2024-08-06 Nieuwe afkorting a.g.v. naamswijzinging." (New abbreviation due to name change)

**Dutch Organizations CSV**:
- Institution: `Nieuwe Instituut`
- ISIL Code: `NL-RtHNI` (SAME as ISIL registry)
- City: Rotterdam

### Analysis

This is a **NAME CHANGE**, not two different institutions:

1. **Same ISIL code** (`NL-RtHNI`) in both datasets
2. **ISIL remark explicitly mentions name change** ("naamswijzinging")
3. Timeline:
   - 2013: ISIL assigned to "Het Nieuwe Instituut"
   - 2024-08-06: Name changed (remark added to ISIL registry)
   - Dutch orgs CSV shows new name "Nieuwe Instituut"

### Decision: **DUPLICATE - MERGE**

**Action**:
- These are the SAME institution at different points in time
- Use current name: `Nieuwe Instituut`
- Store historical name in `alternative_names`: `Het Nieuwe Instituut`
- Add `ChangeEvent`:
  - `change_type`: NAME_CHANGE
  - `event_date`: "2024-08-06"
  - `event_description`: "Name changed from 'Het Nieuwe Instituut' to 'Nieuwe Instituut'"
  - `source_documentation`: ISIL registry remark

**GHCID Impact**:
- Current GHCID: `NL-XX-ROTT-M-NI`
- No Q-number needed (this is a duplicate, not a collision)
- Historical GHCID: Same (name change doesn't affect GHCID since abbreviation is same)

---

## Case 2: Wierden - "Historische Kring Wederen" vs "Historische Kring Wierden"

### Evidence

**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 339):
- Institution: `Historische Kring Wederen` (with trailing space)
- ISIL Code: `NL-WdnHKW`
- City: Wierden
- Assigned: 2015-02-27

**Dutch Organizations CSV**:
- Entry 1: `Historische Kring Wederen` (Wierden) - ISIL: `NL-WdnHKW`, Platform: Mijn Stad Mijn Dorp
- Entry 2: `Historische Kring Wierden` (Wierden) - No ISIL, Platform: ZCBS

### Analysis

This appears to be a **TYPO/DATA ERROR**:

1. **Geographic context**: City is "Wierden" (not "Wederen")
2. **ISIL registry has "Wederen"** but city is Wierden → likely typo in institution name
3. **Dutch orgs CSV has BOTH variants**:
   - "Wederen" with ISIL code (matches ISIL registry)
   - "Wierden" without ISIL code (correct spelling based on city)
4. **Different platforms** (Mijn Stad Mijn Dorp vs ZCBS) suggests they might be tracked separately

### Possible Scenarios

**Scenario A: Same institution, typo in ISIL registry**
- ISIL registry has typo: "Wederen" → should be "Wierden"
- Dutch orgs CSV correctly lists "Historische Kring Wierden"
- Entry with ISIL is outdated/typo

**Scenario B: Two different organizations**
- One is "Historische Kring Wederen" (possibly covering nearby Weerdinge/Weere area?)
- One is "Historische Kring Wierden" (covering Wierden municipality)
- Different platforms (Mijn Stad Mijn Dorp vs ZCBS) support this

**Most Likely**: Scenario A (typo)

### Decision: **PROBABLE DUPLICATE - MERGE WITH CAVEAT**

**Recommended Action**:
- Treat as same institution (typo correction)
- Use correct name: `Historische Kring Wierden`
- Store variant in `alternative_names`: `Historische Kring Wederen` (marked as "typo in ISIL registry")
- **Flag for verification**: Recommend contacting institution to confirm

**GHCID Impact**:
- Current base GHCID: `NL-XX-WIER-M-HKW`
- No Q-number needed if merged
- Add provenance note: "Name variant 'Wederen' appears in ISIL registry, likely typo for 'Wierden'"

**Alternative Action** (if Scenario B is confirmed):
- Keep as separate institutions
- Add Q-numbers:
  - `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wederen"
  - `NL-XX-WIER-M-HKW-Q<wikidata>` for "Wierden"

---

## Case 3: Losser - "de Historische Kring Losser" vs "Historische Kring Losser"

### Evidence

**ISIL Registry** (`ISIL-codes_2025-08-01.csv`, line 226):
- Institution: `de Historische Kring Losser` (with trailing space and lowercase "de")
- ISIL Code: `NL-LsHKL`
- City: Losser
- Assigned: 2015-12-03

**Dutch Organizations CSV**:
- Entry 1: `de Historische Kring Losser` (Losser) - ISIL: `NL-LsHKL`
- Entry 2: `Historische Kring Losser` (Losser) - No ISIL, Platform: Mijn Stad Mijn Dorp

### Analysis

This is a **DEFINITE DUPLICATE** - same institution with/without article "de":

1. **Same city, same type** (historical society)
2. **Name differs only by article "de"** (common in Dutch)
3. **ISIL registry includes "de"** in official name
4. **Dutch orgs CSV has both variants**:
   - With "de" + ISIL (matches ISIL registry)
   - Without "de" + platform (tracking different aspect)

**Dutch language context**:
- "de" = "the" in English
- Common for organizations to be referenced with/without article
- Both are valid references to same organization
- ISIL registry uses full official name with "de"

### Decision: **DUPLICATE - MERGE**

**Action**:
- These are the SAME institution
- Use official ISIL name: `de Historische Kring Losser`
- Store variant in `alternative_names`: `Historische Kring Losser`
- Normalize matching to handle articles (de, het, 't) in Dutch names

**GHCID Impact**:
- Current base GHCID: `NL-XX-LOSS-M-HKL`
- No Q-number needed (duplicate, not collision)
- Name normalization should strip articles for matching purposes

---

## Summary of Decisions

| Case | City | Status | Action | Q-Number Needed? |
|------|------|--------|--------|------------------|
| Het Nieuwe Instituut / Nieuwe Instituut | Rotterdam | NAME CHANGE | Merge, add ChangeEvent | ❌ No |
| Historische Kring Wederen / Wierden | Wierden | PROBABLE TYPO | Merge with caveat, flag for verification | ❌ No (if merged) |
| de Historische Kring Losser / Historische Kring Losser | Losser | DUPLICATE (article) | Merge, improve normalization | ❌ No |

---

## Implementation Steps

### 1. Rotterdam - Name Change Handling

**File to modify**: Deduplicator logic

Add special case for name change detection:
- If ISIL codes match → always treat as same institution
- Check ISIL registry remarks for "naamswijzinging" / "name change"
- Generate `ChangeEvent` with NAME_CHANGE type

**Code location**: `src/glam_extractor/parsers/deduplicator.py`

```python
def detect_name_change(record1, record2) -> Optional[ChangeEvent]:
    """
    Detect name changes by comparing ISIL codes + institution names.

    If ISIL codes match but names differ → likely a name change.
    """
    # Check if ISIL codes match
    isil1 = get_isil_code(record1)
    isil2 = get_isil_code(record2)

    if isil1 and isil2 and isil1 == isil2:
        if record1.name != record2.name:
            # Same ISIL, different name → name change
            return ChangeEvent(
                change_type=ChangeTypeEnum.NAME_CHANGE,
                event_description=f"Name changed from '{record1.name}' to '{record2.name}'",
                source_documentation="Detected from ISIL code match with different names"
            )
    return None
```

### 2. Wierden - Typo Handling

**File to modify**: Deduplicator logic

Add fuzzy matching for similar names in same city:
- Levenshtein distance < 2 for names in same city → flag as potential typo
- Add provenance note: "Name variant detected, may be typo"

**Code location**: `src/glam_extractor/parsers/deduplicator.py`

```python
from rapidfuzz import fuzz

def detect_potential_typo(name1: str, name2: str, city: str) -> bool:
    """
    Detect potential typos in institution names.

    If names are very similar (Levenshtein distance < 2) and in same city,
    likely a typo or variant spelling.
    """
    distance = fuzz.distance(name1, name2)
    if distance <= 2:
        return True
    return False
```

### 3. Losser - Article Normalization

**File to modify**: Deduplicator match key generation

Improve Dutch article handling:
- Strip leading articles: "de", "het", "'t", "De", "Het"
- Before generating match key

**Code location**: `src/glam_extractor/parsers/deduplicator.py` (line 94-127)

```python
DUTCH_ARTICLES = ['de ', 'het ', "'t ", 'De ', 'Het ']

def normalize_name_for_matching(name: str) -> str:
    """
    Normalize institution name for duplicate detection.

    - Lowercase
    - Remove punctuation
    - Strip Dutch articles
    - Remove extra whitespace
    """
    normalized = name.lower().strip()

    # Strip Dutch articles
    for article in DUTCH_ARTICLES:
        if normalized.startswith(article.lower()):
            normalized = normalized[len(article):]
            break

    # Remove punctuation
    normalized = re.sub(r'[^\w\s-]', '', normalized)

    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', normalized).strip()

    return normalized
```

---

## Testing Strategy

### Test 1: Rotterdam Name Change

```python
def test_name_change_detection_rotterdam():
    """Test that Het Nieuwe Instituut / Nieuwe Instituut are merged."""
    record1 = HeritageCustodian(
        name="Het Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )
    record2 = HeritageCustodian(
        name="Nieuwe Instituut",
        institution_type=InstitutionType.MUSEUM,
        locations=[Location(city="Rotterdam")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-RtHNI")]
    )

    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])

    # Should merge to one record
    assert len(result) == 1
    assert result[0].name == "Nieuwe Instituut"
    assert "Het Nieuwe Instituut" in result[0].alternative_names

    # Should have NAME_CHANGE event
    assert any(
        event.change_type == ChangeTypeEnum.NAME_CHANGE
        for event in result[0].change_history
    )
```

### Test 2: Wierden Typo Detection

```python
def test_typo_detection_wierden():
    """Test that Wederen/Wierden are flagged as potential typo."""
    record1 = HeritageCustodian(
        name="Historische Kring Wederen",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-WdnHKW")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Wierden",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Wierden")]
    )

    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])

    # Should merge (treating as typo)
    assert len(result) == 1

    # Should flag in provenance
    assert "variant" in result[0].provenance.notes.lower() or \
           "typo" in result[0].provenance.notes.lower()
```

### Test 3: Losser Article Normalization

```python
def test_article_normalization_losser():
    """Test that Dutch articles are stripped for matching."""
    record1 = HeritageCustodian(
        name="de Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")],
        identifiers=[Identifier(identifier_scheme="ISIL", identifier_value="NL-LsHKL")]
    )
    record2 = HeritageCustodian(
        name="Historische Kring Losser",
        institution_type=InstitutionType.COLLECTING_SOCIETY,
        locations=[Location(city="Losser")]
    )

    deduplicator = Deduplicator()
    result = deduplicator.deduplicate([record1, record2])

    # Should merge (articles ignored)
    assert len(result) == 1
    assert result[0].name == "de Historische Kring Losser"  # ISIL version wins
    assert "Historische Kring Losser" in result[0].alternative_names
```

---

## Expected Impact on Collision Report

After implementing these fixes:

**Current state**: 15 collision groups, 30 institutions

**Expected after fixes**:
- **Rotterdam**: -1 collision group (merged)
- **Wierden**: -1 collision group (merged with caveat)
- **Losser**: -1 collision group (merged)

**New state**: **12 collision groups, 24 institutions**

Remaining collisions: 11 municipality/archive pairs + 1 legitimate museum collision (Den Haag museums)

---

## Recommendations

### Immediate Actions (High Priority)

1. ✅ **Implement article normalization** (Losser case) - Low complexity, high impact
2. ✅ **Add ISIL-based name change detection** (Rotterdam case) - Critical for data quality
3. ⚠️ **Add fuzzy matching for typos** (Wierden case) - Requires manual verification

### Medium-Term Actions

4. **Contact institutions for verification**:
   - Email Historische Kring Wierden: "Is 'Wederen' a typo in ISIL registry?"
   - Confirm Rotterdam name change timeline

5. **Improve ISIL registry integration**:
   - Parse "Opmerking" field for change events
   - Extract name change dates from remarks
   - Auto-generate ChangeEvent records

### Long-Term Actions

6. **Build article database** for multilingual support:
   - Dutch: de, het, 't
   - English: the, a, an
   - German: der, die, das, ein, eine
   - French: le, la, les, un, une

7. **Implement fuzzy matching pipeline**:
   - Stage 1: Exact match (current)
   - Stage 2: Normalized match (articles stripped)
   - Stage 3: Fuzzy match (typo detection)
   - Stage 4: Manual review queue

---

## Files to Modify

1. `src/glam_extractor/parsers/deduplicator.py` - Core logic
2. `tests/parsers/test_deduplicator.py` - Add 3 new tests
3. `scripts/apply_collision_resolution_dutch_datasets.py` - Re-run after fixes

---

## Conclusion

All three edge cases are **duplicates, not true collisions**:

1. **Rotterdam**: Name change (same ISIL code)
2. **Wierden**: Probable typo (Wederen → Wierden)
3. **Losser**: Article variance (de Historische Kring vs Historische Kring)

Implementing the recommended fixes will:
- ✅ Reduce collisions from 15 to 12 groups (-20%)
- ✅ Improve name normalization for Dutch institutions
- ✅ Detect name changes automatically from ISIL data
- ✅ Flag potential typos for manual review

**Next step**: Implement fixes in deduplicator code and re-run collision resolution.