8.2 KiB
Synthetic Q-Number Remediation Plan
Status: 🚨 CRITICAL DATA QUALITY ISSUE
Created: 2025-11-09
Priority: HIGH
Problem Statement
The global heritage institutions dataset contains 2,607 institutions with synthetic Q-numbers (Q90000000 and above). These are algorithmically generated identifiers that:
- ❌ Do NOT correspond to real Wikidata entities
- ❌ Break Linked Open Data integrity (RDF triples with fake Q-numbers)
- ❌ Violate W3C persistent identifier best practices
- ❌ Create citation errors (Q-numbers don't resolve to Wikidata pages)
- ❌ Undermine trust in the dataset
Policy Update: As of 2025-11-09, synthetic Q-numbers are strictly prohibited in this project. See AGENTS.md section "Persistent Identifiers (GHCID)" for detailed policy.
Current Dataset Status
Total institutions: 13,396
├─ Real Wikidata Q-numbers: 7,330 (54.7%) ✅
├─ Synthetic Q-numbers: 2,607 (19.5%) ❌ NEEDS FIXING
└─ No Wikidata ID: 3,459 (25.8%) ⚠️ ACCEPTABLE (will enrich later)
Impact Assessment
GHCIDs Affected
Institutions with synthetic Q-numbers have GHCIDs in the format:
{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}-Q9000XXXX
Example: NL-NH-AMS-M-RJ-Q90052341
These GHCIDs are valid structurally but use fake Wikidata identifiers.
Data Tiers
Synthetic Q-numbers impact data tier classification:
- Current: TIER_3_CROWD_SOURCED (incorrect - fake Wikidata)
- Should be: TIER_4_INFERRED (until real Q-number obtained)
Remediation Strategy
Phase 1: Immediate - Remove Synthetic Q-Numbers from GHCIDs
Objective: Strip synthetic Q-numbers from GHCIDs, revert to base GHCID
Actions:
- Identify all institutions with Q-numbers >= Q90000000
- Remove Q-suffix from GHCID (e.g.,
NL-NH-AMS-M-RJ-Q90052341→NL-NH-AMS-M-RJ) - Remove fake Wikidata identifier from
identifiersarray - Add
needs_wikidata_enrichment: trueflag - Record change in
ghcid_history - Update
provenanceto reflect data tier correction
Script: scripts/remove_synthetic_q_numbers.py
Estimated Time: 15-20 minutes
Expected Outcome:
Real Wikidata Q-numbers: 7,330 (54.7%) ✅
Synthetic Q-numbers: 0 (0.0%) ✅ FIXED
No Wikidata ID: 6,066 (45.3%) ⚠️ Flagged for enrichment
Phase 2: Wikidata Enrichment - Obtain Real Q-Numbers
Objective: Query Wikidata API to find real Q-numbers for 6,066 institutions
Priority Order:
-
Dutch institutions (1,351 total)
- High data quality (TIER_1 CSV sources)
- Many already have ISIL codes
- Expected match rate: 70-80%
-
Latin America institutions (Brazil, Chile, Mexico)
- Mexico: 21.1% → 31.2% coverage (✅ enriched Nov 8)
- Chile: 28.9% coverage (good name quality)
- Brazil: 1.0% coverage (poor name quality, needs web scraping)
-
European institutions (Belgium, Italy, Denmark, Austria, etc.)
- ~500 institutions
- Expected match rate: 60-70%
-
Asian institutions (Japan, Vietnam, Thailand, Taiwan, etc.)
- ~800 institutions
- Expected match rate: 40-50% (language barriers)
-
African/Middle Eastern institutions
- ~200 institutions
- Expected match rate: 30-40% (fewer Wikidata entries)
Enrichment Methods:
-
SPARQL Query (primary):
SELECT ?item ?itemLabel ?viaf ?isil WHERE { ?item wdt:P31/wdt:P279* wd:Q33506 . # Museum ?item wdt:P131* wd:Q727 . # Located in Amsterdam OPTIONAL { ?item wdt:P214 ?viaf } OPTIONAL { ?item wdt:P791 ?isil } SERVICE wikibase:label { bd:serviceParam wikibase:language "en,nl" } } -
Fuzzy Name Matching (threshold > 0.85):
from rapidfuzz import fuzz score = fuzz.ratio(institution_name.lower(), wikidata_label.lower()) -
ISIL/VIAF Cross-Reference (high confidence):
- If institution has ISIL code, query Wikidata for matching ISIL
- If institution has VIAF ID, query Wikidata for matching VIAF
Scripts:
scripts/enrich_dutch_institutions_wikidata.py(priority 1)scripts/enrich_latam_institutions_fuzzy.py(exists, used for Mexico)scripts/enrich_global_with_wikidata.py(create for global batch)
Estimated Time: 3-5 hours total (can be parallelized)
Phase 3: Manual Review - Edge Cases
Objective: Human review of institutions that cannot be automatically matched
Cases Requiring Manual Review:
- Low fuzzy match scores (70-85%)
- Multiple Wikidata candidates (disambiguation needed)
- Institutions with non-Latin script names
- Very small/local institutions not in Wikidata
Estimated Count: ~500-800 institutions
Process:
- Export CSV with institution details + Wikidata candidates
- Manual review in spreadsheet
- Import verified Q-numbers
- Update GHCIDs and provenance
Phase 4: Web Scraping - No Wikidata Match
Objective: For institutions without Wikidata entries, verify existence via website
Actions:
- Use
crawl4aito scrape institutional websites - Extract formal names, addresses, founding dates
- If institution exists but not in Wikidata:
- Keep base GHCID (no Q-suffix)
- Mark as TIER_2_VERIFIED (website confirmation)
- Flag for Wikidata community contribution
- If institution no longer exists (closed):
- Add
ChangeEventwithchange_type: CLOSURE - Keep record for historical reference
- Add
Success Metrics
Phase 1 Success Criteria
- ✅ Zero synthetic Q-numbers in dataset
- ✅ All affected institutions flagged with
needs_wikidata_enrichment - ✅ GHCID history entries created for all changes
- ✅ Provenance updated to reflect data tier correction
Phase 2 Success Criteria
- ✅ Dutch institutions: 70%+ real Wikidata coverage
- ✅ Latin America: 40%+ real Wikidata coverage
- ✅ Global: 60%+ institutions with real Q-numbers
- ✅ All Q-numbers verified resolvable on Wikidata
Phase 3 Success Criteria
- ✅ Manual review completed for all ambiguous cases
- ✅ Disambiguation documented in provenance notes
Phase 4 Success Criteria
- ✅ Website verification for remaining institutions
- ✅ TIER_2_VERIFIED status assigned where applicable
- ✅ List of candidates for Wikidata community contribution
Timeline
| Phase | Duration | Start Date | Completion Target |
|---|---|---|---|
| Phase 1: Remove Synthetic Q-Numbers | 15-20 min | 2025-11-09 | 2025-11-09 |
| Phase 2: Wikidata Enrichment | 3-5 hours | 2025-11-10 | 2025-11-11 |
| Phase 3: Manual Review | 2-3 days | 2025-11-12 | 2025-11-15 |
| Phase 4: Web Scraping | 1 week | 2025-11-16 | 2025-11-23 |
Total Project Duration: ~2 weeks
Next Steps
Immediate Actions (within 24 hours):
- ✅ Update AGENTS.md with synthetic Q-number prohibition policy (DONE)
- ⏳ Create
scripts/remove_synthetic_q_numbers.py(Phase 1 script) - ⏳ Run Phase 1 remediation - Remove all synthetic Q-numbers
- ⏳ Validate dataset - Confirm zero synthetic Q-numbers remain
Short-term Actions (within 1 week):
- ⏳ Create
scripts/enrich_dutch_institutions_wikidata.py(highest ROI) - ⏳ Run Dutch Wikidata enrichment - Target 70%+ coverage
- ⏳ Run Chile Wikidata enrichment - Lower threshold to 0.80
- ⏳ Create global enrichment script - Batch process remaining countries
Medium-term Actions (within 2 weeks):
- ⏳ Manual review CSV export - Edge cases and ambiguous matches
- ⏳ Web scraping for Brazilian institutions - Poor name quality issue
- ⏳ Final validation - Verify 60%+ global Wikidata coverage
- ⏳ Update documentation - Reflect new data quality standards
References
- Policy:
AGENTS.md- "Persistent Identifiers (GHCID)" section (prohibition statement) - Schema:
schemas/core.yaml-Identifierclass (Wikidata identifier structure) - Provenance:
schemas/provenance.yaml-GHCIDHistoryEntry(tracking GHCID changes) - Existing Scripts:
scripts/enrich_latam_institutions_fuzzy.py(Mexico enrichment example) - Session Context:
SESSION_SUMMARY_2025-11-08_LATAM.md(Latin America enrichment results)
Document Status: ACTIVE REMEDIATION PLAN
Owner: GLAM Data Extraction Project
Last Updated: 2025-11-09
Next Review: After Phase 1 completion (2025-11-09)