6.5 KiB
GHCID Collision Duplicate Detection Rule
Rule ID: Rule 33
Status: ACTIVE
Created: 2025-01-13
Applies To: GHCID collision resolution, entity deduplication
Core Principle
Duplicate detection MUST be a standard step in GHCID collision resolution. However, similar-sounding entities should NOT be easily considered duplicates without rigorous verification of ALL available details.
Heritage institutions often have similar or identical names, especially:
- Libraries named after the same historical figure in different cities
- Regional branches of national institutions
- Institutions that split, merged, or were renamed over time
- Common naming patterns (e.g., "Biblioteca Popular [Name]" in Argentina)
Duplicate Verification Requirements
ALL of the following details MUST match before marking entities as duplicates:
| Field Category | Fields to Compare | Weight |
|---|---|---|
| Geographic | Country, Region/State, City/Settlement, Coordinates | CRITICAL |
| Temporal | Founding date, Dissolution date (if any) | HIGH |
| Identity | Official name, Alternative names, Abbreviations | HIGH |
| Legal | Legal status, Registration numbers, Parent organization | HIGH |
| Digital | Website URL, Social media handles | MEDIUM |
| Identifiers | ISIL code, Wikidata ID, VIAF, other external IDs | HIGH |
| Contact | Address, Phone, Email | MEDIUM |
Decision Matrix
| Scenario | Action |
|---|---|
| ALL details match exactly | DUPLICATE - Keep earliest file, delete later |
| Same name, DIFFERENT city | NOT DUPLICATE - Keep both, add name suffix to resolve collision |
| Same name, same city, different addresses | INVESTIGATE - May be branches, historical relocations, or genuine duplicates |
| Same name, same city, different founding dates | INVESTIGATE - May be successor institutions, not duplicates |
| Same name, same city, different Wikidata IDs | NOT DUPLICATE - Wikidata confirms separate entities |
Collision Resolution Workflow (Updated)
1. Generate base GHCID for new entity
↓
2. Check for existing files with same base GHCID
↓
NO COLLISION → Use base GHCID, done
↓
COLLISION DETECTED → Continue to step 3
↓
3. **NEW: DUPLICATE DETECTION CHECK**
↓
Compare ALL available fields (see table above)
↓
ALL MATCH → DUPLICATE detected
├─ Keep file with earlier creation timestamp
├─ Delete duplicate file
└─ Document in deletion log
↓
ANY MISMATCH → NOT DUPLICATE
├─ Add name suffix to NEW entity
└─ Keep both files
↓
4. If NOT duplicate, apply temporal priority rule
↓
5. Generate name suffix from native language name
↓
6. Update GHCID history
Real-World Examples
Example 1: TRUE DUPLICATE (Same Institution)
File 1: AR-A-CAS-L-BPCC.yaml (created 23:34:13)
name: Biblioteca Popular Carlos Casado
location:
city: Casilda
region: Santa Fe
country: AR
wikidata_id: Q64337167
File 2: AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml (created 23:36:13)
name: Biblioteca Popular Carlos Casado
location:
city: Casilda
region: Santa Fe
country: AR
wikidata_id: Q64337167
Verdict: DUPLICATE
- Same name, city, region, country
- Same Wikidata ID confirms identity
- Action: Delete File 2 (later timestamp), keep File 1
Example 2: NOT DUPLICATE (Different Cities)
File 1: AR-B-SJU-L-BPBS.yaml
name: Biblioteca Popular Bernardino Rivadavia
location:
city: San Juan
region: San Juan
country: AR
wikidata_id: Q64338123
File 2: AR-C-CBA-L-BPBS.yaml
name: Biblioteca Popular Bernardino Rivadavia
location:
city: Córdoba
region: Córdoba
country: AR
wikidata_id: Q64339456
Verdict: NOT DUPLICATE
- Same name but DIFFERENT cities (San Juan vs Córdoba)
- Different Wikidata IDs confirm separate entities
- Action: Keep both, use name suffixes if GHCID collision occurs
Example 3: REQUIRES INVESTIGATION (Same City, Different Details)
File 1: NL-NH-AMS-M-SM.yaml
name: Stedelijk Museum
location:
city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl
File 2: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam.yaml
name: Stedelijk Museum Amsterdam
location:
city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl
Verdict: DUPLICATE (after investigation)
- Same city, same founding date, same website
- Name variation is just formal vs informal
- Action: Keep earlier file, delete duplicate
Implementation Checklist
Before declaring two entities as duplicates:
- Country matches exactly
- Region/State matches exactly
- City/Settlement matches exactly (check for neighborhood vs city confusion)
- Founding dates are consistent (or both unknown)
- No conflicting external identifiers (different Wikidata IDs = NOT duplicate)
- Websites are same or one is null
- Legal status is consistent
- No evidence of split/merger events that would explain similar names
When in Doubt
If any field shows a mismatch, err on the side of keeping both files.
It is better to have two files for the same institution (can be merged later) than to accidentally delete a unique institution's record.
For ambiguous cases:
- Add a
duplicate_candidateflag to both files - Document the uncertainty in
provenance.notes - Flag for manual review
provenance:
notes: |
DUPLICATE_CANDIDATE: May be duplicate of AR-A-CAS-L-BPCC.yaml
Matching fields: name, city, region
Differing fields: founding_date (1923 vs 1925)
Manual review required.
File Deletion Protocol
When a duplicate is confirmed:
- Verify which file was created first (use creation timestamp or
extraction_date) - Keep the earlier file (preserves PID stability)
- Move duplicate to archive (do not permanently delete immediately):
data/custodian/archive/duplicates/AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml - Document deletion in a log file or provenance notes
- After 30 days, permanently delete archived duplicates (allows for recovery if mistake)
See Also
AGENTS.md- Rule 33: GHCID Collision Duplicate Detectiondocs/GHCID_PID_SCHEME.md- GHCID format and collision resolutiondocs/plan/global_glam/07-ghcid-collision-resolution.md- Temporal priority rules