glam/.opencode/GHCID_COLLISION_DUPLICATE_DETECTION.md
2025-12-23 13:27:35 +01:00

220 lines
6.5 KiB
Markdown

# GHCID Collision Duplicate Detection Rule
**Rule ID**: Rule 33
**Status**: ACTIVE
**Created**: 2025-01-13
**Applies To**: GHCID collision resolution, entity deduplication
---
## Core Principle
**Duplicate detection MUST be a standard step in GHCID collision resolution. However, similar-sounding entities should NOT be easily considered duplicates without rigorous verification of ALL available details.**
Heritage institutions often have similar or identical names, especially:
- Libraries named after the same historical figure in different cities
- Regional branches of national institutions
- Institutions that split, merged, or were renamed over time
- Common naming patterns (e.g., "Biblioteca Popular [Name]" in Argentina)
---
## Duplicate Verification Requirements
### ALL of the following details MUST match before marking entities as duplicates:
| Field Category | Fields to Compare | Weight |
|----------------|-------------------|--------|
| **Geographic** | Country, Region/State, City/Settlement, Coordinates | CRITICAL |
| **Temporal** | Founding date, Dissolution date (if any) | HIGH |
| **Identity** | Official name, Alternative names, Abbreviations | HIGH |
| **Legal** | Legal status, Registration numbers, Parent organization | HIGH |
| **Digital** | Website URL, Social media handles | MEDIUM |
| **Identifiers** | ISIL code, Wikidata ID, VIAF, other external IDs | HIGH |
| **Contact** | Address, Phone, Email | MEDIUM |
### Decision Matrix
| Scenario | Action |
|----------|--------|
| ALL details match exactly | **DUPLICATE** - Keep earliest file, delete later |
| Same name, DIFFERENT city | **NOT DUPLICATE** - Keep both, add name suffix to resolve collision |
| Same name, same city, different addresses | **INVESTIGATE** - May be branches, historical relocations, or genuine duplicates |
| Same name, same city, different founding dates | **INVESTIGATE** - May be successor institutions, not duplicates |
| Same name, same city, different Wikidata IDs | **NOT DUPLICATE** - Wikidata confirms separate entities |
---
## Collision Resolution Workflow (Updated)
```
1. Generate base GHCID for new entity
2. Check for existing files with same base GHCID
NO COLLISION → Use base GHCID, done
COLLISION DETECTED → Continue to step 3
3. **NEW: DUPLICATE DETECTION CHECK**
Compare ALL available fields (see table above)
ALL MATCH → DUPLICATE detected
├─ Keep file with earlier creation timestamp
├─ Delete duplicate file
└─ Document in deletion log
ANY MISMATCH → NOT DUPLICATE
├─ Add name suffix to NEW entity
└─ Keep both files
4. If NOT duplicate, apply temporal priority rule
5. Generate name suffix from native language name
6. Update GHCID history
```
---
## Real-World Examples
### Example 1: TRUE DUPLICATE (Same Institution)
**File 1**: `AR-A-CAS-L-BPCC.yaml` (created 23:34:13)
```yaml
name: Biblioteca Popular Carlos Casado
location:
city: Casilda
region: Santa Fe
country: AR
wikidata_id: Q64337167
```
**File 2**: `AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml` (created 23:36:13)
```yaml
name: Biblioteca Popular Carlos Casado
location:
city: Casilda
region: Santa Fe
country: AR
wikidata_id: Q64337167
```
**Verdict**: DUPLICATE
- Same name, city, region, country
- Same Wikidata ID confirms identity
- **Action**: Delete File 2 (later timestamp), keep File 1
### Example 2: NOT DUPLICATE (Different Cities)
**File 1**: `AR-B-SJU-L-BPBS.yaml`
```yaml
name: Biblioteca Popular Bernardino Rivadavia
location:
city: San Juan
region: San Juan
country: AR
wikidata_id: Q64338123
```
**File 2**: `AR-C-CBA-L-BPBS.yaml`
```yaml
name: Biblioteca Popular Bernardino Rivadavia
location:
city: Córdoba
region: Córdoba
country: AR
wikidata_id: Q64339456
```
**Verdict**: NOT DUPLICATE
- Same name but DIFFERENT cities (San Juan vs Córdoba)
- Different Wikidata IDs confirm separate entities
- **Action**: Keep both, use name suffixes if GHCID collision occurs
### Example 3: REQUIRES INVESTIGATION (Same City, Different Details)
**File 1**: `NL-NH-AMS-M-SM.yaml`
```yaml
name: Stedelijk Museum
location:
city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl
```
**File 2**: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam.yaml`
```yaml
name: Stedelijk Museum Amsterdam
location:
city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl
```
**Verdict**: DUPLICATE (after investigation)
- Same city, same founding date, same website
- Name variation is just formal vs informal
- **Action**: Keep earlier file, delete duplicate
---
## Implementation Checklist
Before declaring two entities as duplicates:
- [ ] **Country matches exactly**
- [ ] **Region/State matches exactly**
- [ ] **City/Settlement matches exactly** (check for neighborhood vs city confusion)
- [ ] **Founding dates are consistent** (or both unknown)
- [ ] **No conflicting external identifiers** (different Wikidata IDs = NOT duplicate)
- [ ] **Websites are same or one is null**
- [ ] **Legal status is consistent**
- [ ] **No evidence of split/merger events** that would explain similar names
---
## When in Doubt
**If any field shows a mismatch, err on the side of keeping both files.**
It is better to have two files for the same institution (can be merged later) than to accidentally delete a unique institution's record.
**For ambiguous cases:**
1. Add a `duplicate_candidate` flag to both files
2. Document the uncertainty in `provenance.notes`
3. Flag for manual review
```yaml
provenance:
notes: |
DUPLICATE_CANDIDATE: May be duplicate of AR-A-CAS-L-BPCC.yaml
Matching fields: name, city, region
Differing fields: founding_date (1923 vs 1925)
Manual review required.
```
---
## File Deletion Protocol
When a duplicate is confirmed:
1. **Verify which file was created first** (use creation timestamp or `extraction_date`)
2. **Keep the earlier file** (preserves PID stability)
3. **Move duplicate to archive** (do not permanently delete immediately):
```
data/custodian/archive/duplicates/AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml
```
4. **Document deletion** in a log file or provenance notes
5. **After 30 days**, permanently delete archived duplicates (allows for recovery if mistake)
---
## See Also
- `AGENTS.md` - Rule 33: GHCID Collision Duplicate Detection
- `docs/GHCID_PID_SCHEME.md` - GHCID format and collision resolution
- `docs/plan/global_glam/07-ghcid-collision-resolution.md` - Temporal priority rules