glam/.opencode/GHCID_COLLISION_DUPLICATE_DETECTION.md
2025-12-23 13:27:35 +01:00

6.5 KiB

GHCID Collision Duplicate Detection Rule

Rule ID: Rule 33
Status: ACTIVE
Created: 2025-01-13
Applies To: GHCID collision resolution, entity deduplication


Core Principle

Duplicate detection MUST be a standard step in GHCID collision resolution. However, similar-sounding entities should NOT be easily considered duplicates without rigorous verification of ALL available details.

Heritage institutions often have similar or identical names, especially:

  • Libraries named after the same historical figure in different cities
  • Regional branches of national institutions
  • Institutions that split, merged, or were renamed over time
  • Common naming patterns (e.g., "Biblioteca Popular [Name]" in Argentina)

Duplicate Verification Requirements

ALL of the following details MUST match before marking entities as duplicates:

Field Category Fields to Compare Weight
Geographic Country, Region/State, City/Settlement, Coordinates CRITICAL
Temporal Founding date, Dissolution date (if any) HIGH
Identity Official name, Alternative names, Abbreviations HIGH
Legal Legal status, Registration numbers, Parent organization HIGH
Digital Website URL, Social media handles MEDIUM
Identifiers ISIL code, Wikidata ID, VIAF, other external IDs HIGH
Contact Address, Phone, Email MEDIUM

Decision Matrix

Scenario Action
ALL details match exactly DUPLICATE - Keep earliest file, delete later
Same name, DIFFERENT city NOT DUPLICATE - Keep both, add name suffix to resolve collision
Same name, same city, different addresses INVESTIGATE - May be branches, historical relocations, or genuine duplicates
Same name, same city, different founding dates INVESTIGATE - May be successor institutions, not duplicates
Same name, same city, different Wikidata IDs NOT DUPLICATE - Wikidata confirms separate entities

Collision Resolution Workflow (Updated)

1. Generate base GHCID for new entity
   ↓
2. Check for existing files with same base GHCID
   ↓
   NO COLLISION → Use base GHCID, done
   ↓
   COLLISION DETECTED → Continue to step 3
   ↓
3. **NEW: DUPLICATE DETECTION CHECK**
   ↓
   Compare ALL available fields (see table above)
   ↓
   ALL MATCH → DUPLICATE detected
      ├─ Keep file with earlier creation timestamp
      ├─ Delete duplicate file
      └─ Document in deletion log
   ↓
   ANY MISMATCH → NOT DUPLICATE
      ├─ Add name suffix to NEW entity
      └─ Keep both files
   ↓
4. If NOT duplicate, apply temporal priority rule
   ↓
5. Generate name suffix from native language name
   ↓
6. Update GHCID history

Real-World Examples

Example 1: TRUE DUPLICATE (Same Institution)

File 1: AR-A-CAS-L-BPCC.yaml (created 23:34:13)

name: Biblioteca Popular Carlos Casado
location:
  city: Casilda
  region: Santa Fe
  country: AR
wikidata_id: Q64337167

File 2: AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml (created 23:36:13)

name: Biblioteca Popular Carlos Casado
location:
  city: Casilda
  region: Santa Fe
  country: AR
wikidata_id: Q64337167

Verdict: DUPLICATE

  • Same name, city, region, country
  • Same Wikidata ID confirms identity
  • Action: Delete File 2 (later timestamp), keep File 1

Example 2: NOT DUPLICATE (Different Cities)

File 1: AR-B-SJU-L-BPBS.yaml

name: Biblioteca Popular Bernardino Rivadavia
location:
  city: San Juan
  region: San Juan
  country: AR
wikidata_id: Q64338123

File 2: AR-C-CBA-L-BPBS.yaml

name: Biblioteca Popular Bernardino Rivadavia
location:
  city: Córdoba
  region: Córdoba
  country: AR
wikidata_id: Q64339456

Verdict: NOT DUPLICATE

  • Same name but DIFFERENT cities (San Juan vs Córdoba)
  • Different Wikidata IDs confirm separate entities
  • Action: Keep both, use name suffixes if GHCID collision occurs

Example 3: REQUIRES INVESTIGATION (Same City, Different Details)

File 1: NL-NH-AMS-M-SM.yaml

name: Stedelijk Museum
location:
  city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl

File 2: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam.yaml

name: Stedelijk Museum Amsterdam
location:
  city: Amsterdam
founded: 1895
website: https://www.stedelijk.nl

Verdict: DUPLICATE (after investigation)

  • Same city, same founding date, same website
  • Name variation is just formal vs informal
  • Action: Keep earlier file, delete duplicate

Implementation Checklist

Before declaring two entities as duplicates:

  • Country matches exactly
  • Region/State matches exactly
  • City/Settlement matches exactly (check for neighborhood vs city confusion)
  • Founding dates are consistent (or both unknown)
  • No conflicting external identifiers (different Wikidata IDs = NOT duplicate)
  • Websites are same or one is null
  • Legal status is consistent
  • No evidence of split/merger events that would explain similar names

When in Doubt

If any field shows a mismatch, err on the side of keeping both files.

It is better to have two files for the same institution (can be merged later) than to accidentally delete a unique institution's record.

For ambiguous cases:

  1. Add a duplicate_candidate flag to both files
  2. Document the uncertainty in provenance.notes
  3. Flag for manual review
provenance:
  notes: |
    DUPLICATE_CANDIDATE: May be duplicate of AR-A-CAS-L-BPCC.yaml
    Matching fields: name, city, region
    Differing fields: founding_date (1923 vs 1925)
    Manual review required.    

File Deletion Protocol

When a duplicate is confirmed:

  1. Verify which file was created first (use creation timestamp or extraction_date)
  2. Keep the earlier file (preserves PID stability)
  3. Move duplicate to archive (do not permanently delete immediately):
    data/custodian/archive/duplicates/AR-A-CAS-L-BPCC-biblioteca_popular_carlos_casado.yaml
    
  4. Document deletion in a log file or provenance notes
  5. After 30 days, permanently delete archived duplicates (allows for recovery if mistake)

See Also

  • AGENTS.md - Rule 33: GHCID Collision Duplicate Detection
  • docs/GHCID_PID_SCHEME.md - GHCID format and collision resolution
  • docs/plan/global_glam/07-ghcid-collision-resolution.md - Temporal priority rules