220 lines
6.5 KiB
Markdown
220 lines
6.5 KiB
Markdown
# GHCID Collision Duplicate Detection Rule
|
|
|
|
**Rule ID**: Rule 33
|
|
**Status**: ACTIVE
|
|
**Created**: 2025-01-13
|
|
**Applies To**: GHCID collision resolution, entity deduplication
|
|
|
|
---
|
|
|
|
## Core Principle
|
|
|
|
**Duplicate detection MUST be a standard step in GHCID collision resolution. However, similar-sounding entities should NOT be easily considered duplicates without rigorous verification of ALL available details.**
|
|
|
|
Heritage institutions often have similar or identical names, especially:
|
|
- Libraries named after the same historical figure in different cities
|
|
- Regional branches of national institutions
|
|
- Institutions that split, merged, or were renamed over time
|
|
- Common naming patterns (e.g., "Biblioteca Popular [Name]" in Argentina)
|
|
|
|
---
|
|
|
|
## Duplicate Verification Requirements
|
|
|
|
### ALL of the following details MUST match before marking entities as duplicates:
|
|
|
|
| Field Category | Fields to Compare | Weight |
|
|
|----------------|-------------------|--------|
|
|
| **Geographic** | Country, Region/State, City/Settlement, Coordinates | CRITICAL |
|
|
| **Temporal** | Founding date, Dissolution date (if any) | HIGH |
|
|
| **Identity** | Official name, Alternative names, Abbreviations | HIGH |
|
|
| **Legal** | Legal status, Registration numbers, Parent organization | HIGH |
|
|
| **Digital** | Website URL, Social media handles | MEDIUM |
|
|
| **Identifiers** | ISIL code, Wikidata ID, VIAF, other external IDs | HIGH |
|
|
| **Contact** | Address, Phone, Email | MEDIUM |
|
|
|
|
### Decision Matrix
|
|
|
|
| Scenario | Action |
|
|
|----------|--------|
|
|
| ALL details match exactly | **DUPLICATE** - Keep earliest file, delete later |
|
|
| Same name, DIFFERENT city | **NOT DUPLICATE** - Keep both, add name suffix to resolve collision |
|
|
| Same name, same city, different addresses | **INVESTIGATE** - May be branches, historical relocations, or genuine duplicates |
|
|
| Same name, same city, different founding dates | **INVESTIGATE** - May be successor institutions, not duplicates |
|
|
| Same name, same city, different Wikidata IDs | **NOT DUPLICATE** - Wikidata confirms separate entities |
|
|
|
|
---
|
|
|
|
## Collision Resolution Workflow (Updated)
|
|
|
|
```
|
|
1. Generate base GHCID for new entity
|
|
↓
|
|
2. Check for existing files with same base GHCID
|
|
↓
|
|
NO COLLISION → Use base GHCID, done
|
|
↓
|
|
COLLISION DETECTED → Continue to step 3
|
|
↓
|
|
3. **NEW: DUPLICATE DETECTION CHECK**
|
|
↓
|
|
Compare ALL available fields (see table above)
|
|
↓
|
|
ALL MATCH → DUPLICATE detected
|
|
├─ Keep file with earlier creation timestamp
|
|
├─ Delete duplicate file
|
|
└─ Document in deletion log
|
|
↓
|
|
ANY MISMATCH → NOT DUPLICATE
|
|
├─ Add name suffix to NEW entity
|
|
└─ Keep both files
|
|
↓
|
|
4. If NOT duplicate, apply temporal priority rule
|
|
↓
|
|
5. Generate name suffix from native language name
|
|
↓
|
|
6. Update GHCID history
|
|
```
|
|
|
|
---
|
|
|
|
## Real-World Examples
|
|
|
|
### Example 1: TRUE DUPLICATE (Same Institution)
|
|
|
|
**File 1**: `AR-A-CAS-L-BPCC.yaml` (created 23:34:13)
|
|
```yaml
|
|
name: Biblioteca Popular Carlos Casado
|
|
location:
|
|
city: Casilda
|
|
region: Santa Fe
|
|
country: AR
|
|
wikidata_id: Q64337167
|
|
```
|
|
|
|
**File 2**: `AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml` (created 23:36:13)
|
|
```yaml
|
|
name: Biblioteca Popular Carlos Casado
|
|
location:
|
|
city: Casilda
|
|
region: Santa Fe
|
|
country: AR
|
|
wikidata_id: Q64337167
|
|
```
|
|
|
|
**Verdict**: DUPLICATE
|
|
- Same name, city, region, country
|
|
- Same Wikidata ID confirms identity
|
|
- **Action**: Delete File 2 (later timestamp), keep File 1
|
|
|
|
### Example 2: NOT DUPLICATE (Different Cities)
|
|
|
|
**File 1**: `AR-B-SJU-L-BPBS.yaml`
|
|
```yaml
|
|
name: Biblioteca Popular Bernardino Rivadavia
|
|
location:
|
|
city: San Juan
|
|
region: San Juan
|
|
country: AR
|
|
wikidata_id: Q64338123
|
|
```
|
|
|
|
**File 2**: `AR-C-CBA-L-BPBS.yaml`
|
|
```yaml
|
|
name: Biblioteca Popular Bernardino Rivadavia
|
|
location:
|
|
city: Córdoba
|
|
region: Córdoba
|
|
country: AR
|
|
wikidata_id: Q64339456
|
|
```
|
|
|
|
**Verdict**: NOT DUPLICATE
|
|
- Same name but DIFFERENT cities (San Juan vs Córdoba)
|
|
- Different Wikidata IDs confirm separate entities
|
|
- **Action**: Keep both, use name suffixes if GHCID collision occurs
|
|
|
|
### Example 3: REQUIRES INVESTIGATION (Same City, Different Details)
|
|
|
|
**File 1**: `NL-NH-AMS-M-SM.yaml`
|
|
```yaml
|
|
name: Stedelijk Museum
|
|
location:
|
|
city: Amsterdam
|
|
founded: 1895
|
|
website: https://www.stedelijk.nl
|
|
```
|
|
|
|
**File 2**: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam.yaml`
|
|
```yaml
|
|
name: Stedelijk Museum Amsterdam
|
|
location:
|
|
city: Amsterdam
|
|
founded: 1895
|
|
website: https://www.stedelijk.nl
|
|
```
|
|
|
|
**Verdict**: DUPLICATE (after investigation)
|
|
- Same city, same founding date, same website
|
|
- Name variation is just formal vs informal
|
|
- **Action**: Keep earlier file, delete duplicate
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
Before declaring two entities as duplicates:
|
|
|
|
- [ ] **Country matches exactly**
|
|
- [ ] **Region/State matches exactly**
|
|
- [ ] **City/Settlement matches exactly** (check for neighborhood vs city confusion)
|
|
- [ ] **Founding dates are consistent** (or both unknown)
|
|
- [ ] **No conflicting external identifiers** (different Wikidata IDs = NOT duplicate)
|
|
- [ ] **Websites are same or one is null**
|
|
- [ ] **Legal status is consistent**
|
|
- [ ] **No evidence of split/merger events** that would explain similar names
|
|
|
|
---
|
|
|
|
## When in Doubt
|
|
|
|
**If any field shows a mismatch, err on the side of keeping both files.**
|
|
|
|
It is better to have two files for the same institution (can be merged later) than to accidentally delete a unique institution's record.
|
|
|
|
**For ambiguous cases:**
|
|
1. Add a `duplicate_candidate` flag to both files
|
|
2. Document the uncertainty in `provenance.notes`
|
|
3. Flag for manual review
|
|
|
|
```yaml
|
|
provenance:
|
|
notes: |
|
|
DUPLICATE_CANDIDATE: May be duplicate of AR-A-CAS-L-BPCC.yaml
|
|
Matching fields: name, city, region
|
|
Differing fields: founding_date (1923 vs 1925)
|
|
Manual review required.
|
|
```
|
|
|
|
---
|
|
|
|
## File Deletion Protocol
|
|
|
|
When a duplicate is confirmed:
|
|
|
|
1. **Verify which file was created first** (use creation timestamp or `extraction_date`)
|
|
2. **Keep the earlier file** (preserves PID stability)
|
|
3. **Move duplicate to archive** (do not permanently delete immediately):
|
|
```
|
|
data/custodian/archive/duplicates/AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml
|
|
```
|
|
4. **Document deletion** in a log file or provenance notes
|
|
5. **After 30 days**, permanently delete archived duplicates (allows for recovery if mistake)
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- `AGENTS.md` - Rule 33: GHCID Collision Duplicate Detection
|
|
- `docs/GHCID_PID_SCHEME.md` - GHCID format and collision resolution
|
|
- `docs/plan/global_glam/07-ghcid-collision-resolution.md` - Temporal priority rules
|