230 lines
8.2 KiB
Markdown
230 lines
8.2 KiB
Markdown
# Synthetic Q-Number Remediation Plan
|
|
|
|
**Status**: 🚨 **CRITICAL DATA QUALITY ISSUE**
|
|
**Created**: 2025-11-09
|
|
**Priority**: HIGH
|
|
|
|
## Problem Statement
|
|
|
|
The global heritage institutions dataset contains **2,607 institutions with synthetic Q-numbers** (Q90000000 and above). These are algorithmically generated identifiers that:
|
|
|
|
- ❌ Do NOT correspond to real Wikidata entities
|
|
- ❌ Break Linked Open Data integrity (RDF triples with fake Q-numbers)
|
|
- ❌ Violate W3C persistent identifier best practices
|
|
- ❌ Create citation errors (Q-numbers don't resolve to Wikidata pages)
|
|
- ❌ Undermine trust in the dataset
|
|
|
|
**Policy Update**: As of 2025-11-09, synthetic Q-numbers are **strictly prohibited** in this project. See `AGENTS.md` section "Persistent Identifiers (GHCID)" for detailed policy.
|
|
|
|
## Current Dataset Status
|
|
|
|
```
|
|
Total institutions: 13,396
|
|
├─ Real Wikidata Q-numbers: 7,330 (54.7%) ✅
|
|
├─ Synthetic Q-numbers: 2,607 (19.5%) ❌ NEEDS FIXING
|
|
└─ No Wikidata ID: 3,459 (25.8%) ⚠️ ACCEPTABLE (will enrich later)
|
|
```
|
|
|
|
## Impact Assessment
|
|
|
|
### GHCIDs Affected
|
|
|
|
Institutions with synthetic Q-numbers have GHCIDs in the format:
|
|
- `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}-Q9000XXXX`
|
|
|
|
Example: `NL-NH-AMS-M-RJ-Q90052341`
|
|
|
|
These GHCIDs are **valid structurally** but use **fake Wikidata identifiers**.
|
|
|
|
### Data Tiers
|
|
|
|
Synthetic Q-numbers impact data tier classification:
|
|
- Current: TIER_3_CROWD_SOURCED (incorrect - fake Wikidata)
|
|
- Should be: TIER_4_INFERRED (until real Q-number obtained)
|
|
|
|
## Remediation Strategy
|
|
|
|
### Phase 1: Immediate - Remove Synthetic Q-Numbers from GHCIDs
|
|
|
|
**Objective**: Strip synthetic Q-numbers from GHCIDs, revert to base GHCID
|
|
|
|
**Actions**:
|
|
1. Identify all institutions with Q-numbers >= Q90000000
|
|
2. Remove Q-suffix from GHCID (e.g., `NL-NH-AMS-M-RJ-Q90052341` → `NL-NH-AMS-M-RJ`)
|
|
3. Remove fake Wikidata identifier from `identifiers` array
|
|
4. Add `needs_wikidata_enrichment: true` flag
|
|
5. Record change in `ghcid_history`
|
|
6. Update `provenance` to reflect data tier correction
|
|
|
|
**Script**: `scripts/remove_synthetic_q_numbers.py`
|
|
|
|
**Estimated Time**: 15-20 minutes
|
|
|
|
**Expected Outcome**:
|
|
```
|
|
Real Wikidata Q-numbers: 7,330 (54.7%) ✅
|
|
Synthetic Q-numbers: 0 (0.0%) ✅ FIXED
|
|
No Wikidata ID: 6,066 (45.3%) ⚠️ Flagged for enrichment
|
|
```
|
|
|
|
### Phase 2: Wikidata Enrichment - Obtain Real Q-Numbers
|
|
|
|
**Objective**: Query Wikidata API to find real Q-numbers for 6,066 institutions
|
|
|
|
**Priority Order**:
|
|
1. **Dutch institutions** (1,351 total)
|
|
- High data quality (TIER_1 CSV sources)
|
|
- Many already have ISIL codes
|
|
- Expected match rate: 70-80%
|
|
|
|
2. **Latin America institutions** (Brazil, Chile, Mexico)
|
|
- Mexico: 21.1% → 31.2% coverage (✅ enriched Nov 8)
|
|
- Chile: 28.9% coverage (good name quality)
|
|
- Brazil: 1.0% coverage (poor name quality, needs web scraping)
|
|
|
|
3. **European institutions** (Belgium, Italy, Denmark, Austria, etc.)
|
|
- ~500 institutions
|
|
- Expected match rate: 60-70%
|
|
|
|
4. **Asian institutions** (Japan, Vietnam, Thailand, Taiwan, etc.)
|
|
- ~800 institutions
|
|
- Expected match rate: 40-50% (language barriers)
|
|
|
|
5. **African/Middle Eastern institutions**
|
|
- ~200 institutions
|
|
- Expected match rate: 30-40% (fewer Wikidata entries)
|
|
|
|
**Enrichment Methods**:
|
|
|
|
1. **SPARQL Query** (primary):
|
|
```sparql
|
|
SELECT ?item ?itemLabel ?viaf ?isil WHERE {
|
|
?item wdt:P31/wdt:P279* wd:Q33506 . # Museum
|
|
?item wdt:P131* wd:Q727 . # Located in Amsterdam
|
|
OPTIONAL { ?item wdt:P214 ?viaf }
|
|
OPTIONAL { ?item wdt:P791 ?isil }
|
|
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,nl" }
|
|
}
|
|
```
|
|
|
|
2. **Fuzzy Name Matching** (threshold > 0.85):
|
|
```python
|
|
from rapidfuzz import fuzz
|
|
score = fuzz.ratio(institution_name.lower(), wikidata_label.lower())
|
|
```
|
|
|
|
3. **ISIL/VIAF Cross-Reference** (high confidence):
|
|
- If institution has ISIL code, query Wikidata for matching ISIL
|
|
- If institution has VIAF ID, query Wikidata for matching VIAF
|
|
|
|
**Scripts**:
|
|
- `scripts/enrich_dutch_institutions_wikidata.py` (priority 1)
|
|
- `scripts/enrich_latam_institutions_fuzzy.py` (exists, used for Mexico)
|
|
- `scripts/enrich_global_with_wikidata.py` (create for global batch)
|
|
|
|
**Estimated Time**: 3-5 hours total (can be parallelized)
|
|
|
|
### Phase 3: Manual Review - Edge Cases
|
|
|
|
**Objective**: Human review of institutions that cannot be automatically matched
|
|
|
|
**Cases Requiring Manual Review**:
|
|
1. Low fuzzy match scores (70-85%)
|
|
2. Multiple Wikidata candidates (disambiguation needed)
|
|
3. Institutions with non-Latin script names
|
|
4. Very small/local institutions not in Wikidata
|
|
|
|
**Estimated Count**: ~500-800 institutions
|
|
|
|
**Process**:
|
|
1. Export CSV with institution details + Wikidata candidates
|
|
2. Manual review in spreadsheet
|
|
3. Import verified Q-numbers
|
|
4. Update GHCIDs and provenance
|
|
|
|
### Phase 4: Web Scraping - No Wikidata Match
|
|
|
|
**Objective**: For institutions without Wikidata entries, verify existence via website
|
|
|
|
**Actions**:
|
|
1. Use `crawl4ai` to scrape institutional websites
|
|
2. Extract formal names, addresses, founding dates
|
|
3. If institution exists but not in Wikidata:
|
|
- Keep base GHCID (no Q-suffix)
|
|
- Mark as TIER_2_VERIFIED (website confirmation)
|
|
- Flag for Wikidata community contribution
|
|
4. If institution no longer exists (closed):
|
|
- Add `ChangeEvent` with `change_type: CLOSURE`
|
|
- Keep record for historical reference
|
|
|
|
## Success Metrics
|
|
|
|
### Phase 1 Success Criteria
|
|
- ✅ Zero synthetic Q-numbers in dataset
|
|
- ✅ All affected institutions flagged with `needs_wikidata_enrichment`
|
|
- ✅ GHCID history entries created for all changes
|
|
- ✅ Provenance updated to reflect data tier correction
|
|
|
|
### Phase 2 Success Criteria
|
|
- ✅ Dutch institutions: 70%+ real Wikidata coverage
|
|
- ✅ Latin America: 40%+ real Wikidata coverage
|
|
- ✅ Global: 60%+ institutions with real Q-numbers
|
|
- ✅ All Q-numbers verified resolvable on Wikidata
|
|
|
|
### Phase 3 Success Criteria
|
|
- ✅ Manual review completed for all ambiguous cases
|
|
- ✅ Disambiguation documented in provenance notes
|
|
|
|
### Phase 4 Success Criteria
|
|
- ✅ Website verification for remaining institutions
|
|
- ✅ TIER_2_VERIFIED status assigned where applicable
|
|
- ✅ List of candidates for Wikidata community contribution
|
|
|
|
## Timeline
|
|
|
|
| Phase | Duration | Start Date | Completion Target |
|
|
|-------|----------|------------|-------------------|
|
|
| **Phase 1: Remove Synthetic Q-Numbers** | 15-20 min | 2025-11-09 | 2025-11-09 |
|
|
| **Phase 2: Wikidata Enrichment** | 3-5 hours | 2025-11-10 | 2025-11-11 |
|
|
| **Phase 3: Manual Review** | 2-3 days | 2025-11-12 | 2025-11-15 |
|
|
| **Phase 4: Web Scraping** | 1 week | 2025-11-16 | 2025-11-23 |
|
|
|
|
**Total Project Duration**: ~2 weeks
|
|
|
|
## Next Steps
|
|
|
|
**Immediate Actions** (within 24 hours):
|
|
|
|
1. ✅ **Update AGENTS.md** with synthetic Q-number prohibition policy (DONE)
|
|
2. ⏳ **Create `scripts/remove_synthetic_q_numbers.py`** (Phase 1 script)
|
|
3. ⏳ **Run Phase 1 remediation** - Remove all synthetic Q-numbers
|
|
4. ⏳ **Validate dataset** - Confirm zero synthetic Q-numbers remain
|
|
|
|
**Short-term Actions** (within 1 week):
|
|
|
|
5. ⏳ **Create `scripts/enrich_dutch_institutions_wikidata.py`** (highest ROI)
|
|
6. ⏳ **Run Dutch Wikidata enrichment** - Target 70%+ coverage
|
|
7. ⏳ **Run Chile Wikidata enrichment** - Lower threshold to 0.80
|
|
8. ⏳ **Create global enrichment script** - Batch process remaining countries
|
|
|
|
**Medium-term Actions** (within 2 weeks):
|
|
|
|
9. ⏳ **Manual review CSV export** - Edge cases and ambiguous matches
|
|
10. ⏳ **Web scraping for Brazilian institutions** - Poor name quality issue
|
|
11. ⏳ **Final validation** - Verify 60%+ global Wikidata coverage
|
|
12. ⏳ **Update documentation** - Reflect new data quality standards
|
|
|
|
## References
|
|
|
|
- **Policy**: `AGENTS.md` - "Persistent Identifiers (GHCID)" section (prohibition statement)
|
|
- **Schema**: `schemas/core.yaml` - `Identifier` class (Wikidata identifier structure)
|
|
- **Provenance**: `schemas/provenance.yaml` - `GHCIDHistoryEntry` (tracking GHCID changes)
|
|
- **Existing Scripts**: `scripts/enrich_latam_institutions_fuzzy.py` (Mexico enrichment example)
|
|
- **Session Context**: `SESSION_SUMMARY_2025-11-08_LATAM.md` (Latin America enrichment results)
|
|
|
|
---
|
|
|
|
**Document Status**: ACTIVE REMEDIATION PLAN
|
|
**Owner**: GLAM Data Extraction Project
|
|
**Last Updated**: 2025-11-09
|
|
**Next Review**: After Phase 1 completion (2025-11-09)
|