glam/docs/WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md
2025-11-19 23:25:22 +01:00

473 lines
13 KiB
Markdown

# Wikidata Fuzzy Match Review - Danish Dataset
## Executive Summary
**Dataset**: Danish GLAM institutions (2,348 total)
**Wikidata Coverage**: 769 institutions (32.8%)
**Fuzzy Matches Requiring Review**: 185 institutions (24.1% of linked)
**Match Score Range**: 85-99% confidence
**Review Status**: 🟡 **PENDING MANUAL REVIEW**
---
## Review Scope
### Match Distribution by Priority
| Priority | Score Range | Count | % of Fuzzy | Description |
|----------|-------------|-------|------------|-------------|
| **1** | 85-87% | 58 | 31.4% | Very uncertain - **REVIEW FIRST** |
| **2** | 87-90% | 62 | 33.5% | Uncertain - needs verification |
| **3** | 90-93% | 44 | 23.8% | Moderate confidence |
| **4** | 93-96% | 14 | 7.6% | Fairly confident |
| **5** | 96-99% | 7 | 3.8% | Mostly confident |
**Recommended Focus**: Priority 1-2 (120 matches = 64.9% of fuzzy matches)
### Institution Type Breakdown
| Type | Count | % of Fuzzy |
|------|-------|------------|
| **LIBRARY** | 152 | 82.2% |
| **ARCHIVE** | 33 | 17.8% |
**Observation**: Libraries dominate fuzzy matches (likely due to branch naming variations)
---
## Generated Files
### 1. Review Report (CSV)
**File**: `data/review/denmark_wikidata_fuzzy_matches.csv`
**Rows**: 185 (header + 185 data rows)
**Columns**: 13
**Column Reference**:
| Column | Description | Action |
|--------|-------------|--------|
| `priority` | 1-5 (1=most uncertain) | Sort by this to prioritize |
| `match_score` | 85.0-99.x% | Fuzzy match confidence |
| `institution_name` | Our dataset name | Compare with wikidata_label |
| `wikidata_label` | Wikidata entity label | Compare with institution_name |
| `city` | Institution location | Cross-check with Wikidata |
| `institution_type` | LIBRARY or ARCHIVE | Verify on Wikidata (P31) |
| `isil_code` | ISIL identifier (if any) | Strong validation signal |
| `ghcid` | Our persistent ID | Reference only |
| `wikidata_qid` | Q-number (e.g. Q12345) | Link target |
| `wikidata_url` | Direct Wikidata link | **CLICK TO VERIFY** |
| `validation_status` | **FILL IN**: CORRECT \| INCORRECT \| UNCERTAIN | Your decision |
| `validation_notes` | **FILL IN**: Explanation | Document reasoning |
| `institution_id` | W3ID URI | For script processing |
### 2. Validation Checklist
**File**: `docs/WIKIDATA_VALIDATION_CHECKLIST.md`
**Purpose**: Step-by-step guide for manual reviewers
**Contents**:
- Validation workflow (4 steps per row)
- 5 common validation scenarios
- Quality assurance checklist
- Research sources (Danish registries)
- Batch validation tips
- Example validation session
### 3. Processing Scripts
#### Generate Report Script
**File**: `scripts/generate_wikidata_review_report.py`
**Purpose**: Extract fuzzy matches from enriched dataset
**Status**: ✅ Already executed
**Output**: CSV report
#### Apply Validation Script
**File**: `scripts/apply_wikidata_validation.py`
**Purpose**: Update dataset based on manual review
**Status**: ⏳ Ready to run after manual review
**Input**: CSV with filled `validation_status` column
**Output**: `denmark_complete_validated.json`
---
## Quick Start Guide
### For Reviewers (Immediate Action)
1. **Open CSV in spreadsheet software**:
```bash
# Option A: Excel
open data/review/denmark_wikidata_fuzzy_matches.csv
# Option B: Google Sheets
# Upload data/review/denmark_wikidata_fuzzy_matches.csv
# Option C: LibreOffice Calc
libreoffice --calc data/review/denmark_wikidata_fuzzy_matches.csv
```
2. **Sort by Priority 1** (most uncertain)
3. **For each row**:
- Compare `institution_name` vs `wikidata_label`
- Click `wikidata_url` to verify match
- Check `city`, `institution_type`, `isil_code`
- Fill `validation_status`: CORRECT | INCORRECT | UNCERTAIN
- Add `validation_notes` (recommended)
4. **Save CSV** (preserve column structure)
5. **Run update script**:
```bash
python scripts/apply_wikidata_validation.py
```
### For Project Managers (Progress Tracking)
**Estimated Time**:
- Priority 1-2 (120 matches): ~4-6 hours (2-3 minutes per match)
- Priority 3-5 (65 matches): ~1-2 hours (1-2 minutes per match)
- **Total**: ~5-8 hours
**Milestones**:
- [ ] Priority 1 complete (58 matches)
- [ ] Priority 2 complete (62 matches)
- [ ] Priority 3 complete (44 matches)
- [ ] Priority 4-5 complete (21 matches)
- [ ] Validation applied to dataset
- [ ] RDF re-exported with corrections
---
## Sample Review Records
### Example 1: Priority 1 - Likely Incorrect
```csv
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
1,85.0,"Campus Vejle, Biblioteket",Vejle Bibliotek,Vejle,LIBRARY,DK-861510,INCORRECT,"Wikidata is main library, ours is campus branch"
```
**Analysis**:
- Name similar but ours has ", Biblioteket" suffix
- Likely branch vs main library mismatch
- Needs verification on Wikidata (check P361 "part of")
### Example 2: Priority 2 - Needs Research
```csv
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
2,87.0,Fur Lokalhistoriske Arkiv,Randers Lokalhistoriske Arkiv,Skive,ARCHIVE,,INCORRECT,"City mismatch: Fur vs Randers, different local archives"
```
**Analysis**:
- City mismatch (Skive vs Randers)
- Both are local historical archives
- Likely wrong match due to similar names
- No ISIL to cross-check
### Example 3: Priority 3 - Likely Correct
```csv
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
3,92.5,Aalborg Universitetsbibliotek,Aalborg University Library,Aalborg,LIBRARY,DK-820010,CORRECT,"ISIL match, Danish/English variant, same entity"
```
**Analysis**:
- ISIL code match (DK-820010) = high confidence
- Danish vs English name
- City match
- Type match
- → Almost certainly CORRECT
---
## Validation Expectations
### Predicted Outcomes (Based on Match Scores)
| Status | Expected % | Expected Count | Description |
|--------|------------|----------------|-------------|
| **CORRECT** | 85-90% | 157-167 | Keep Wikidata link, update provenance |
| **INCORRECT** | 5-10% | 9-19 | Remove Wikidata link, document reason |
| **UNCERTAIN** | 5% | 9 | Flag for expert review, keep tentatively |
### Quality Thresholds
**Acceptable Quality**:
- ≥80% CORRECT
- ≤15% INCORRECT
- ≤10% UNCERTAIN
**High Quality**:
- ≥90% CORRECT
- ≤5% INCORRECT
- ≤5% UNCERTAIN
**Red Flags** (indicate algorithm issues):
- <70% CORRECT
- >20% INCORRECT
- >15% UNCERTAIN
---
## Known Issues to Watch For
### Issue 1: Branch vs Main Library
**Pattern**: Institution name ends with ", Biblioteket" (the library)
**Example**:
- Ours: "Campus Vejle, Biblioteket"
- Wikidata: "Vejle Bibliotek"
**Likely Outcome**: INCORRECT (branch matched to main library)
**Fix**: Check Wikidata for "part of" (P361) relationship
---
### Issue 2: Gymnasium Libraries
**Pattern**: Institution name starts with "[School Name] Gymnasium, Biblioteket"
**Example**:
- Ours: "Fredericia Gymnasium, Biblioteket"
- Wikidata: "Fredericia Bibliotek"
**Likely Outcome**: INCORRECT (school library matched to public library)
**Fix**: Verify institution type on Wikidata (P31)
---
### Issue 3: Location Mismatch
**Pattern**: City name differs between dataset and Wikidata
**Example**:
- Ours: "Fur Lokalhistoriske Arkiv" (Skive)
- Wikidata: "Randers Lokalhistoriske Arkiv" (Randers)
**Likely Outcome**: INCORRECT (similar names, different cities)
**Fix**: Google "[institution name] Denmark" to confirm location
---
### Issue 4: Historical Name Changes
**Pattern**: Institution renamed, Wikidata has old or new name
**Example**:
- Ours: "Statsbiblioteket" (historical)
- Wikidata: "Royal Danish Library" (current, post-merger)
**Likely Outcome**: UNCERTAIN (need to check merger date)
**Fix**: Check Wikidata history, look for "replaced by" (P1366) or "end time" (P582)
---
### Issue 5: Multilingual Variants
**Pattern**: Danish name vs English Wikidata label
**Example**:
- Ours: "Rigsarkivet"
- Wikidata: "Danish National Archives"
**Likely Outcome**: CORRECT (same entity, different language)
**Fix**: Check Wikidata for Danish label/alias
---
## Danish Language Resources
### Useful Terms
| Danish | English | Context |
|--------|---------|---------|
| Bibliotek | Library | General library |
| Bibliotekerne | The libraries | Library system |
| Hovedbiblioteket | Main library | Central/flagship branch |
| Kombi-bibliotek | Combined library | Library + community center |
| Lokalhistoriske Arkiv | Local history archive | Municipal archive |
| Rigsarkivet | National Archives | Denmark's national archive |
| Statsbiblioteket | State Library | Historical name (merged) |
| Universitetsbibliotek | University library | Academic library |
| Centralbibliotek | Central library | Main branch |
| Filial | Branch | Library branch |
| Gymnasium | High school | Upper secondary school |
### Danish ISIL Prefixes
- **DK-8xxxxx**: Libraries (6-digit codes)
- **DK-01x**: National institutions (Rigsarkivet = DK-011)
- **DK-xxx**: Archives and special collections
---
## Research Resources
### Primary Validation Sources
1. **Danish ISIL Registry**: https://isil.dk
- Authoritative source for ISIL codes
- Search by institution name or code
- Official library/archive registry
2. **Wikidata Query Service**: https://query.wikidata.org
- SPARQL endpoint for bulk queries
- Check P791 (ISIL), P17 (country), P31 (type)
3. **Danish Library Portal**: https://bibliotek.dk
- Public library directory
- Search by city or name
4. **Danish National Archives**: https://www.sa.dk
- Archive directory
- Member institution list
5. **Danish Agency for Culture**: https://slks.dk
- Government heritage institutions
- Official museum/gallery registers
---
## Post-Validation Workflow
### After Manual Review Completed
```bash
# Step 1: Apply validation results
python scripts/apply_wikidata_validation.py
# Expected output:
# - data/instances/denmark_complete_validated.json
# - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
# Step 2: Re-export RDF with corrections
python scripts/export_denmark_rdf.py \
--input data/instances/denmark_complete_validated.json \
--output data/rdf/denmark_validated
# Expected output:
# - denmark_validated.ttl
# - denmark_validated.rdf
# - denmark_validated.jsonld
# - denmark_validated.nt
# Step 3: Update documentation
# - PROGRESS.md: Add validation statistics
# - SESSION_SUMMARY: Document findings
# - data/rdf/README.md: Note validated version
# Step 4: Commit changes
git add data/review/denmark_wikidata_fuzzy_matches.csv
git add data/instances/denmark_complete_validated.json
git add data/rdf/denmark_validated.*
git commit -m "feat: Apply manual Wikidata validation to Danish dataset (185 fuzzy matches reviewed)"
```
---
## Validation Metrics Tracking
### Template for Progress Updates
```markdown
## Wikidata Validation Progress
**Date**: YYYY-MM-DD
**Reviewer**: [Name]
### Review Status
- [x] Report generated (185 matches)
- [ ] Priority 1 reviewed (58 matches)
- [ ] Priority 2 reviewed (62 matches)
- [ ] Priority 3 reviewed (44 matches)
- [ ] Priority 4-5 reviewed (21 matches)
- [ ] Validation applied to dataset
- [ ] RDF re-exported
### Preliminary Results (after X matches reviewed)
| Status | Count | % |
|--------|-------|---|
| CORRECT | X | X% |
| INCORRECT | X | X% |
| UNCERTAIN | X | X% |
| Not Reviewed | X | X% |
### Common Issues Found
1. [Issue description]
2. [Issue description]
### Time Spent
- Priority 1: X hours
- Priority 2: X hours
- Total: X hours
### Next Steps
- [ ] [Action item]
- [ ] [Action item]
```
---
## FAQ
### Q: Can I skip Priority 4-5 matches?
**A**: Yes, if time-constrained. Priority 1-2 (64.9% of fuzzy matches) capture most uncertainty. Priority 4-5 have 93-99% confidence and are likely correct.
### Q: What if I can't determine CORRECT vs INCORRECT?
**A**: Mark as UNCERTAIN and add detailed notes. Flag for expert review (Danish language expertise or local knowledge).
### Q: How do I handle merged institutions?
**A**: Check Wikidata for "replaced by" (P1366) property. If our data is post-merger and Wikidata is pre-merger entity → INCORRECT. Document merger date in notes.
### Q: Should I edit Wikidata during review?
**A**: Optional but helpful. If you find missing Danish labels or incorrect data on Wikidata, you can edit (requires Wikidata account). Document edits in `validation_notes`.
### Q: What if ISIL codes don't match?
**A**: ISIL mismatch = almost always INCORRECT. ISIL is authoritative identifier. Exception: Wikidata may have outdated ISIL after code reassignment.
### Q: How do I validate branch libraries?
**A**: Check Wikidata for "part of" (P361) property. If Wikidata entity is parent system, may still be CORRECT (acceptable abstraction level). If branch-to-branch mismatch → INCORRECT.
---
## Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2025-11-19 | Initial report generation (185 fuzzy matches) |
---
## Contact
**Questions?** Open an issue on GitHub or contact project maintainer.
**Found a bug in scripts?** Report at: [GitHub Issues]
**Need Danish language help?** [Contact Danish institutional partners]
---
**Status**: 🟡 Awaiting Manual Review
**Next Milestone**: Priority 1-2 review completion (120 matches)
**Estimated Completion**: [Add date after work begins]