# Synthetic Q-Number Remediation Plan **Status**: 🚨 **CRITICAL DATA QUALITY ISSUE** **Created**: 2025-11-09 **Priority**: HIGH ## Problem Statement The global heritage institutions dataset contains **2,607 institutions with synthetic Q-numbers** (Q90000000 and above). These are algorithmically generated identifiers that: - ❌ Do NOT correspond to real Wikidata entities - ❌ Break Linked Open Data integrity (RDF triples with fake Q-numbers) - ❌ Violate W3C persistent identifier best practices - ❌ Create citation errors (Q-numbers don't resolve to Wikidata pages) - ❌ Undermine trust in the dataset **Policy Update**: As of 2025-11-09, synthetic Q-numbers are **strictly prohibited** in this project. See `AGENTS.md` section "Persistent Identifiers (GHCID)" for detailed policy. ## Current Dataset Status ``` Total institutions: 13,396 ├─ Real Wikidata Q-numbers: 7,330 (54.7%) ✅ ├─ Synthetic Q-numbers: 2,607 (19.5%) ❌ NEEDS FIXING └─ No Wikidata ID: 3,459 (25.8%) ⚠️ ACCEPTABLE (will enrich later) ``` ## Impact Assessment ### GHCIDs Affected Institutions with synthetic Q-numbers have GHCIDs in the format: - `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}-Q9000XXXX` Example: `NL-NH-AMS-M-RJ-Q90052341` These GHCIDs are **valid structurally** but use **fake Wikidata identifiers**. ### Data Tiers Synthetic Q-numbers impact data tier classification: - Current: TIER_3_CROWD_SOURCED (incorrect - fake Wikidata) - Should be: TIER_4_INFERRED (until real Q-number obtained) ## Remediation Strategy ### Phase 1: Immediate - Remove Synthetic Q-Numbers from GHCIDs **Objective**: Strip synthetic Q-numbers from GHCIDs, revert to base GHCID **Actions**: 1. Identify all institutions with Q-numbers >= Q90000000 2. Remove Q-suffix from GHCID (e.g., `NL-NH-AMS-M-RJ-Q90052341` → `NL-NH-AMS-M-RJ`) 3. Remove fake Wikidata identifier from `identifiers` array 4. Add `needs_wikidata_enrichment: true` flag 5. Record change in `ghcid_history` 6. Update `provenance` to reflect data tier correction **Script**: `scripts/remove_synthetic_q_numbers.py` **Estimated Time**: 15-20 minutes **Expected Outcome**: ``` Real Wikidata Q-numbers: 7,330 (54.7%) ✅ Synthetic Q-numbers: 0 (0.0%) ✅ FIXED No Wikidata ID: 6,066 (45.3%) ⚠️ Flagged for enrichment ``` ### Phase 2: Wikidata Enrichment - Obtain Real Q-Numbers **Objective**: Query Wikidata API to find real Q-numbers for 6,066 institutions **Priority Order**: 1. **Dutch institutions** (1,351 total) - High data quality (TIER_1 CSV sources) - Many already have ISIL codes - Expected match rate: 70-80% 2. **Latin America institutions** (Brazil, Chile, Mexico) - Mexico: 21.1% → 31.2% coverage (✅ enriched Nov 8) - Chile: 28.9% coverage (good name quality) - Brazil: 1.0% coverage (poor name quality, needs web scraping) 3. **European institutions** (Belgium, Italy, Denmark, Austria, etc.) - ~500 institutions - Expected match rate: 60-70% 4. **Asian institutions** (Japan, Vietnam, Thailand, Taiwan, etc.) - ~800 institutions - Expected match rate: 40-50% (language barriers) 5. **African/Middle Eastern institutions** - ~200 institutions - Expected match rate: 30-40% (fewer Wikidata entries) **Enrichment Methods**: 1. **SPARQL Query** (primary): ```sparql SELECT ?item ?itemLabel ?viaf ?isil WHERE { ?item wdt:P31/wdt:P279* wd:Q33506 . # Museum ?item wdt:P131* wd:Q727 . # Located in Amsterdam OPTIONAL { ?item wdt:P214 ?viaf } OPTIONAL { ?item wdt:P791 ?isil } SERVICE wikibase:label { bd:serviceParam wikibase:language "en,nl" } } ``` 2. **Fuzzy Name Matching** (threshold > 0.85): ```python from rapidfuzz import fuzz score = fuzz.ratio(institution_name.lower(), wikidata_label.lower()) ``` 3. **ISIL/VIAF Cross-Reference** (high confidence): - If institution has ISIL code, query Wikidata for matching ISIL - If institution has VIAF ID, query Wikidata for matching VIAF **Scripts**: - `scripts/enrich_dutch_institutions_wikidata.py` (priority 1) - `scripts/enrich_latam_institutions_fuzzy.py` (exists, used for Mexico) - `scripts/enrich_global_with_wikidata.py` (create for global batch) **Estimated Time**: 3-5 hours total (can be parallelized) ### Phase 3: Manual Review - Edge Cases **Objective**: Human review of institutions that cannot be automatically matched **Cases Requiring Manual Review**: 1. Low fuzzy match scores (70-85%) 2. Multiple Wikidata candidates (disambiguation needed) 3. Institutions with non-Latin script names 4. Very small/local institutions not in Wikidata **Estimated Count**: ~500-800 institutions **Process**: 1. Export CSV with institution details + Wikidata candidates 2. Manual review in spreadsheet 3. Import verified Q-numbers 4. Update GHCIDs and provenance ### Phase 4: Web Scraping - No Wikidata Match **Objective**: For institutions without Wikidata entries, verify existence via website **Actions**: 1. Use `crawl4ai` to scrape institutional websites 2. Extract formal names, addresses, founding dates 3. If institution exists but not in Wikidata: - Keep base GHCID (no Q-suffix) - Mark as TIER_2_VERIFIED (website confirmation) - Flag for Wikidata community contribution 4. If institution no longer exists (closed): - Add `ChangeEvent` with `change_type: CLOSURE` - Keep record for historical reference ## Success Metrics ### Phase 1 Success Criteria - ✅ Zero synthetic Q-numbers in dataset - ✅ All affected institutions flagged with `needs_wikidata_enrichment` - ✅ GHCID history entries created for all changes - ✅ Provenance updated to reflect data tier correction ### Phase 2 Success Criteria - ✅ Dutch institutions: 70%+ real Wikidata coverage - ✅ Latin America: 40%+ real Wikidata coverage - ✅ Global: 60%+ institutions with real Q-numbers - ✅ All Q-numbers verified resolvable on Wikidata ### Phase 3 Success Criteria - ✅ Manual review completed for all ambiguous cases - ✅ Disambiguation documented in provenance notes ### Phase 4 Success Criteria - ✅ Website verification for remaining institutions - ✅ TIER_2_VERIFIED status assigned where applicable - ✅ List of candidates for Wikidata community contribution ## Timeline | Phase | Duration | Start Date | Completion Target | |-------|----------|------------|-------------------| | **Phase 1: Remove Synthetic Q-Numbers** | 15-20 min | 2025-11-09 | 2025-11-09 | | **Phase 2: Wikidata Enrichment** | 3-5 hours | 2025-11-10 | 2025-11-11 | | **Phase 3: Manual Review** | 2-3 days | 2025-11-12 | 2025-11-15 | | **Phase 4: Web Scraping** | 1 week | 2025-11-16 | 2025-11-23 | **Total Project Duration**: ~2 weeks ## Next Steps **Immediate Actions** (within 24 hours): 1. ✅ **Update AGENTS.md** with synthetic Q-number prohibition policy (DONE) 2. ⏳ **Create `scripts/remove_synthetic_q_numbers.py`** (Phase 1 script) 3. ⏳ **Run Phase 1 remediation** - Remove all synthetic Q-numbers 4. ⏳ **Validate dataset** - Confirm zero synthetic Q-numbers remain **Short-term Actions** (within 1 week): 5. ⏳ **Create `scripts/enrich_dutch_institutions_wikidata.py`** (highest ROI) 6. ⏳ **Run Dutch Wikidata enrichment** - Target 70%+ coverage 7. ⏳ **Run Chile Wikidata enrichment** - Lower threshold to 0.80 8. ⏳ **Create global enrichment script** - Batch process remaining countries **Medium-term Actions** (within 2 weeks): 9. ⏳ **Manual review CSV export** - Edge cases and ambiguous matches 10. ⏳ **Web scraping for Brazilian institutions** - Poor name quality issue 11. ⏳ **Final validation** - Verify 60%+ global Wikidata coverage 12. ⏳ **Update documentation** - Reflect new data quality standards ## References - **Policy**: `AGENTS.md` - "Persistent Identifiers (GHCID)" section (prohibition statement) - **Schema**: `schemas/core.yaml` - `Identifier` class (Wikidata identifier structure) - **Provenance**: `schemas/provenance.yaml` - `GHCIDHistoryEntry` (tracking GHCID changes) - **Existing Scripts**: `scripts/enrich_latam_institutions_fuzzy.py` (Mexico enrichment example) - **Session Context**: `SESSION_SUMMARY_2025-11-08_LATAM.md` (Latin America enrichment results) --- **Document Status**: ACTIVE REMEDIATION PLAN **Owner**: GLAM Data Extraction Project **Last Updated**: 2025-11-09 **Next Review**: After Phase 1 completion (2025-11-09)