glam/docs/SYNTHETIC_QNUMBER_REMEDIATION.md
2025-11-19 23:25:22 +01:00

8.2 KiB

Synthetic Q-Number Remediation Plan

Status: 🚨 CRITICAL DATA QUALITY ISSUE
Created: 2025-11-09
Priority: HIGH

Problem Statement

The global heritage institutions dataset contains 2,607 institutions with synthetic Q-numbers (Q90000000 and above). These are algorithmically generated identifiers that:

  • Do NOT correspond to real Wikidata entities
  • Break Linked Open Data integrity (RDF triples with fake Q-numbers)
  • Violate W3C persistent identifier best practices
  • Create citation errors (Q-numbers don't resolve to Wikidata pages)
  • Undermine trust in the dataset

Policy Update: As of 2025-11-09, synthetic Q-numbers are strictly prohibited in this project. See AGENTS.md section "Persistent Identifiers (GHCID)" for detailed policy.

Current Dataset Status

Total institutions:           13,396
├─ Real Wikidata Q-numbers:    7,330 (54.7%) ✅
├─ Synthetic Q-numbers:        2,607 (19.5%) ❌ NEEDS FIXING
└─ No Wikidata ID:             3,459 (25.8%) ⚠️  ACCEPTABLE (will enrich later)

Impact Assessment

GHCIDs Affected

Institutions with synthetic Q-numbers have GHCIDs in the format:

  • {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}-Q9000XXXX

Example: NL-NH-AMS-M-RJ-Q90052341

These GHCIDs are valid structurally but use fake Wikidata identifiers.

Data Tiers

Synthetic Q-numbers impact data tier classification:

  • Current: TIER_3_CROWD_SOURCED (incorrect - fake Wikidata)
  • Should be: TIER_4_INFERRED (until real Q-number obtained)

Remediation Strategy

Phase 1: Immediate - Remove Synthetic Q-Numbers from GHCIDs

Objective: Strip synthetic Q-numbers from GHCIDs, revert to base GHCID

Actions:

  1. Identify all institutions with Q-numbers >= Q90000000
  2. Remove Q-suffix from GHCID (e.g., NL-NH-AMS-M-RJ-Q90052341NL-NH-AMS-M-RJ)
  3. Remove fake Wikidata identifier from identifiers array
  4. Add needs_wikidata_enrichment: true flag
  5. Record change in ghcid_history
  6. Update provenance to reflect data tier correction

Script: scripts/remove_synthetic_q_numbers.py

Estimated Time: 15-20 minutes

Expected Outcome:

Real Wikidata Q-numbers:    7,330 (54.7%) ✅
Synthetic Q-numbers:            0 (0.0%)  ✅ FIXED
No Wikidata ID:             6,066 (45.3%) ⚠️  Flagged for enrichment

Phase 2: Wikidata Enrichment - Obtain Real Q-Numbers

Objective: Query Wikidata API to find real Q-numbers for 6,066 institutions

Priority Order:

  1. Dutch institutions (1,351 total)

    • High data quality (TIER_1 CSV sources)
    • Many already have ISIL codes
    • Expected match rate: 70-80%
  2. Latin America institutions (Brazil, Chile, Mexico)

    • Mexico: 21.1% → 31.2% coverage ( enriched Nov 8)
    • Chile: 28.9% coverage (good name quality)
    • Brazil: 1.0% coverage (poor name quality, needs web scraping)
  3. European institutions (Belgium, Italy, Denmark, Austria, etc.)

    • ~500 institutions
    • Expected match rate: 60-70%
  4. Asian institutions (Japan, Vietnam, Thailand, Taiwan, etc.)

    • ~800 institutions
    • Expected match rate: 40-50% (language barriers)
  5. African/Middle Eastern institutions

    • ~200 institutions
    • Expected match rate: 30-40% (fewer Wikidata entries)

Enrichment Methods:

  1. SPARQL Query (primary):

    SELECT ?item ?itemLabel ?viaf ?isil WHERE {
      ?item wdt:P31/wdt:P279* wd:Q33506 .  # Museum
      ?item wdt:P131* wd:Q727 .             # Located in Amsterdam
      OPTIONAL { ?item wdt:P214 ?viaf }
      OPTIONAL { ?item wdt:P791 ?isil }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en,nl" }
    }
    
  2. Fuzzy Name Matching (threshold > 0.85):

    from rapidfuzz import fuzz
    score = fuzz.ratio(institution_name.lower(), wikidata_label.lower())
    
  3. ISIL/VIAF Cross-Reference (high confidence):

    • If institution has ISIL code, query Wikidata for matching ISIL
    • If institution has VIAF ID, query Wikidata for matching VIAF

Scripts:

  • scripts/enrich_dutch_institutions_wikidata.py (priority 1)
  • scripts/enrich_latam_institutions_fuzzy.py (exists, used for Mexico)
  • scripts/enrich_global_with_wikidata.py (create for global batch)

Estimated Time: 3-5 hours total (can be parallelized)

Phase 3: Manual Review - Edge Cases

Objective: Human review of institutions that cannot be automatically matched

Cases Requiring Manual Review:

  1. Low fuzzy match scores (70-85%)
  2. Multiple Wikidata candidates (disambiguation needed)
  3. Institutions with non-Latin script names
  4. Very small/local institutions not in Wikidata

Estimated Count: ~500-800 institutions

Process:

  1. Export CSV with institution details + Wikidata candidates
  2. Manual review in spreadsheet
  3. Import verified Q-numbers
  4. Update GHCIDs and provenance

Phase 4: Web Scraping - No Wikidata Match

Objective: For institutions without Wikidata entries, verify existence via website

Actions:

  1. Use crawl4ai to scrape institutional websites
  2. Extract formal names, addresses, founding dates
  3. If institution exists but not in Wikidata:
    • Keep base GHCID (no Q-suffix)
    • Mark as TIER_2_VERIFIED (website confirmation)
    • Flag for Wikidata community contribution
  4. If institution no longer exists (closed):
    • Add ChangeEvent with change_type: CLOSURE
    • Keep record for historical reference

Success Metrics

Phase 1 Success Criteria

  • Zero synthetic Q-numbers in dataset
  • All affected institutions flagged with needs_wikidata_enrichment
  • GHCID history entries created for all changes
  • Provenance updated to reflect data tier correction

Phase 2 Success Criteria

  • Dutch institutions: 70%+ real Wikidata coverage
  • Latin America: 40%+ real Wikidata coverage
  • Global: 60%+ institutions with real Q-numbers
  • All Q-numbers verified resolvable on Wikidata

Phase 3 Success Criteria

  • Manual review completed for all ambiguous cases
  • Disambiguation documented in provenance notes

Phase 4 Success Criteria

  • Website verification for remaining institutions
  • TIER_2_VERIFIED status assigned where applicable
  • List of candidates for Wikidata community contribution

Timeline

Phase Duration Start Date Completion Target
Phase 1: Remove Synthetic Q-Numbers 15-20 min 2025-11-09 2025-11-09
Phase 2: Wikidata Enrichment 3-5 hours 2025-11-10 2025-11-11
Phase 3: Manual Review 2-3 days 2025-11-12 2025-11-15
Phase 4: Web Scraping 1 week 2025-11-16 2025-11-23

Total Project Duration: ~2 weeks

Next Steps

Immediate Actions (within 24 hours):

  1. Update AGENTS.md with synthetic Q-number prohibition policy (DONE)
  2. Create scripts/remove_synthetic_q_numbers.py (Phase 1 script)
  3. Run Phase 1 remediation - Remove all synthetic Q-numbers
  4. Validate dataset - Confirm zero synthetic Q-numbers remain

Short-term Actions (within 1 week):

  1. Create scripts/enrich_dutch_institutions_wikidata.py (highest ROI)
  2. Run Dutch Wikidata enrichment - Target 70%+ coverage
  3. Run Chile Wikidata enrichment - Lower threshold to 0.80
  4. Create global enrichment script - Batch process remaining countries

Medium-term Actions (within 2 weeks):

  1. Manual review CSV export - Edge cases and ambiguous matches
  2. Web scraping for Brazilian institutions - Poor name quality issue
  3. Final validation - Verify 60%+ global Wikidata coverage
  4. Update documentation - Reflect new data quality standards

References

  • Policy: AGENTS.md - "Persistent Identifiers (GHCID)" section (prohibition statement)
  • Schema: schemas/core.yaml - Identifier class (Wikidata identifier structure)
  • Provenance: schemas/provenance.yaml - GHCIDHistoryEntry (tracking GHCID changes)
  • Existing Scripts: scripts/enrich_latam_institutions_fuzzy.py (Mexico enrichment example)
  • Session Context: SESSION_SUMMARY_2025-11-08_LATAM.md (Latin America enrichment results)

Document Status: ACTIVE REMEDIATION PLAN
Owner: GLAM Data Extraction Project
Last Updated: 2025-11-09
Next Review: After Phase 1 completion (2025-11-09)