473 lines
13 KiB
Markdown
473 lines
13 KiB
Markdown
# Wikidata Fuzzy Match Review - Danish Dataset
|
|
|
|
## Executive Summary
|
|
|
|
**Dataset**: Danish GLAM institutions (2,348 total)
|
|
**Wikidata Coverage**: 769 institutions (32.8%)
|
|
**Fuzzy Matches Requiring Review**: 185 institutions (24.1% of linked)
|
|
**Match Score Range**: 85-99% confidence
|
|
|
|
**Review Status**: 🟡 **PENDING MANUAL REVIEW**
|
|
|
|
---
|
|
|
|
## Review Scope
|
|
|
|
### Match Distribution by Priority
|
|
|
|
| Priority | Score Range | Count | % of Fuzzy | Description |
|
|
|----------|-------------|-------|------------|-------------|
|
|
| **1** | 85-87% | 58 | 31.4% | Very uncertain - **REVIEW FIRST** |
|
|
| **2** | 87-90% | 62 | 33.5% | Uncertain - needs verification |
|
|
| **3** | 90-93% | 44 | 23.8% | Moderate confidence |
|
|
| **4** | 93-96% | 14 | 7.6% | Fairly confident |
|
|
| **5** | 96-99% | 7 | 3.8% | Mostly confident |
|
|
|
|
**Recommended Focus**: Priority 1-2 (120 matches = 64.9% of fuzzy matches)
|
|
|
|
### Institution Type Breakdown
|
|
|
|
| Type | Count | % of Fuzzy |
|
|
|------|-------|------------|
|
|
| **LIBRARY** | 152 | 82.2% |
|
|
| **ARCHIVE** | 33 | 17.8% |
|
|
|
|
**Observation**: Libraries dominate fuzzy matches (likely due to branch naming variations)
|
|
|
|
---
|
|
|
|
## Generated Files
|
|
|
|
### 1. Review Report (CSV)
|
|
|
|
**File**: `data/review/denmark_wikidata_fuzzy_matches.csv`
|
|
**Rows**: 185 (header + 185 data rows)
|
|
**Columns**: 13
|
|
|
|
**Column Reference**:
|
|
|
|
| Column | Description | Action |
|
|
|--------|-------------|--------|
|
|
| `priority` | 1-5 (1=most uncertain) | Sort by this to prioritize |
|
|
| `match_score` | 85.0-99.x% | Fuzzy match confidence |
|
|
| `institution_name` | Our dataset name | Compare with wikidata_label |
|
|
| `wikidata_label` | Wikidata entity label | Compare with institution_name |
|
|
| `city` | Institution location | Cross-check with Wikidata |
|
|
| `institution_type` | LIBRARY or ARCHIVE | Verify on Wikidata (P31) |
|
|
| `isil_code` | ISIL identifier (if any) | Strong validation signal |
|
|
| `ghcid` | Our persistent ID | Reference only |
|
|
| `wikidata_qid` | Q-number (e.g. Q12345) | Link target |
|
|
| `wikidata_url` | Direct Wikidata link | **CLICK TO VERIFY** |
|
|
| `validation_status` | **FILL IN**: CORRECT \| INCORRECT \| UNCERTAIN | Your decision |
|
|
| `validation_notes` | **FILL IN**: Explanation | Document reasoning |
|
|
| `institution_id` | W3ID URI | For script processing |
|
|
|
|
### 2. Validation Checklist
|
|
|
|
**File**: `docs/WIKIDATA_VALIDATION_CHECKLIST.md`
|
|
**Purpose**: Step-by-step guide for manual reviewers
|
|
**Contents**:
|
|
- Validation workflow (4 steps per row)
|
|
- 5 common validation scenarios
|
|
- Quality assurance checklist
|
|
- Research sources (Danish registries)
|
|
- Batch validation tips
|
|
- Example validation session
|
|
|
|
### 3. Processing Scripts
|
|
|
|
#### Generate Report Script
|
|
**File**: `scripts/generate_wikidata_review_report.py`
|
|
**Purpose**: Extract fuzzy matches from enriched dataset
|
|
**Status**: ✅ Already executed
|
|
**Output**: CSV report
|
|
|
|
#### Apply Validation Script
|
|
**File**: `scripts/apply_wikidata_validation.py`
|
|
**Purpose**: Update dataset based on manual review
|
|
**Status**: ⏳ Ready to run after manual review
|
|
**Input**: CSV with filled `validation_status` column
|
|
**Output**: `denmark_complete_validated.json`
|
|
|
|
---
|
|
|
|
## Quick Start Guide
|
|
|
|
### For Reviewers (Immediate Action)
|
|
|
|
1. **Open CSV in spreadsheet software**:
|
|
```bash
|
|
# Option A: Excel
|
|
open data/review/denmark_wikidata_fuzzy_matches.csv
|
|
|
|
# Option B: Google Sheets
|
|
# Upload data/review/denmark_wikidata_fuzzy_matches.csv
|
|
|
|
# Option C: LibreOffice Calc
|
|
libreoffice --calc data/review/denmark_wikidata_fuzzy_matches.csv
|
|
```
|
|
|
|
2. **Sort by Priority 1** (most uncertain)
|
|
|
|
3. **For each row**:
|
|
- Compare `institution_name` vs `wikidata_label`
|
|
- Click `wikidata_url` to verify match
|
|
- Check `city`, `institution_type`, `isil_code`
|
|
- Fill `validation_status`: CORRECT | INCORRECT | UNCERTAIN
|
|
- Add `validation_notes` (recommended)
|
|
|
|
4. **Save CSV** (preserve column structure)
|
|
|
|
5. **Run update script**:
|
|
```bash
|
|
python scripts/apply_wikidata_validation.py
|
|
```
|
|
|
|
### For Project Managers (Progress Tracking)
|
|
|
|
**Estimated Time**:
|
|
- Priority 1-2 (120 matches): ~4-6 hours (2-3 minutes per match)
|
|
- Priority 3-5 (65 matches): ~1-2 hours (1-2 minutes per match)
|
|
- **Total**: ~5-8 hours
|
|
|
|
**Milestones**:
|
|
- [ ] Priority 1 complete (58 matches)
|
|
- [ ] Priority 2 complete (62 matches)
|
|
- [ ] Priority 3 complete (44 matches)
|
|
- [ ] Priority 4-5 complete (21 matches)
|
|
- [ ] Validation applied to dataset
|
|
- [ ] RDF re-exported with corrections
|
|
|
|
---
|
|
|
|
## Sample Review Records
|
|
|
|
### Example 1: Priority 1 - Likely Incorrect
|
|
|
|
```csv
|
|
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
|
|
1,85.0,"Campus Vejle, Biblioteket",Vejle Bibliotek,Vejle,LIBRARY,DK-861510,INCORRECT,"Wikidata is main library, ours is campus branch"
|
|
```
|
|
|
|
**Analysis**:
|
|
- Name similar but ours has ", Biblioteket" suffix
|
|
- Likely branch vs main library mismatch
|
|
- Needs verification on Wikidata (check P361 "part of")
|
|
|
|
### Example 2: Priority 2 - Needs Research
|
|
|
|
```csv
|
|
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
|
|
2,87.0,Fur Lokalhistoriske Arkiv,Randers Lokalhistoriske Arkiv,Skive,ARCHIVE,,INCORRECT,"City mismatch: Fur vs Randers, different local archives"
|
|
```
|
|
|
|
**Analysis**:
|
|
- City mismatch (Skive vs Randers)
|
|
- Both are local historical archives
|
|
- Likely wrong match due to similar names
|
|
- No ISIL to cross-check
|
|
|
|
### Example 3: Priority 3 - Likely Correct
|
|
|
|
```csv
|
|
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
|
|
3,92.5,Aalborg Universitetsbibliotek,Aalborg University Library,Aalborg,LIBRARY,DK-820010,CORRECT,"ISIL match, Danish/English variant, same entity"
|
|
```
|
|
|
|
**Analysis**:
|
|
- ISIL code match (DK-820010) = high confidence
|
|
- Danish vs English name
|
|
- City match
|
|
- Type match
|
|
- → Almost certainly CORRECT
|
|
|
|
---
|
|
|
|
## Validation Expectations
|
|
|
|
### Predicted Outcomes (Based on Match Scores)
|
|
|
|
| Status | Expected % | Expected Count | Description |
|
|
|--------|------------|----------------|-------------|
|
|
| **CORRECT** | 85-90% | 157-167 | Keep Wikidata link, update provenance |
|
|
| **INCORRECT** | 5-10% | 9-19 | Remove Wikidata link, document reason |
|
|
| **UNCERTAIN** | 5% | 9 | Flag for expert review, keep tentatively |
|
|
|
|
### Quality Thresholds
|
|
|
|
**Acceptable Quality**:
|
|
- ≥80% CORRECT
|
|
- ≤15% INCORRECT
|
|
- ≤10% UNCERTAIN
|
|
|
|
**High Quality**:
|
|
- ≥90% CORRECT
|
|
- ≤5% INCORRECT
|
|
- ≤5% UNCERTAIN
|
|
|
|
**Red Flags** (indicate algorithm issues):
|
|
- <70% CORRECT
|
|
- >20% INCORRECT
|
|
- >15% UNCERTAIN
|
|
|
|
---
|
|
|
|
## Known Issues to Watch For
|
|
|
|
### Issue 1: Branch vs Main Library
|
|
|
|
**Pattern**: Institution name ends with ", Biblioteket" (the library)
|
|
|
|
**Example**:
|
|
- Ours: "Campus Vejle, Biblioteket"
|
|
- Wikidata: "Vejle Bibliotek"
|
|
|
|
**Likely Outcome**: INCORRECT (branch matched to main library)
|
|
|
|
**Fix**: Check Wikidata for "part of" (P361) relationship
|
|
|
|
---
|
|
|
|
### Issue 2: Gymnasium Libraries
|
|
|
|
**Pattern**: Institution name starts with "[School Name] Gymnasium, Biblioteket"
|
|
|
|
**Example**:
|
|
- Ours: "Fredericia Gymnasium, Biblioteket"
|
|
- Wikidata: "Fredericia Bibliotek"
|
|
|
|
**Likely Outcome**: INCORRECT (school library matched to public library)
|
|
|
|
**Fix**: Verify institution type on Wikidata (P31)
|
|
|
|
---
|
|
|
|
### Issue 3: Location Mismatch
|
|
|
|
**Pattern**: City name differs between dataset and Wikidata
|
|
|
|
**Example**:
|
|
- Ours: "Fur Lokalhistoriske Arkiv" (Skive)
|
|
- Wikidata: "Randers Lokalhistoriske Arkiv" (Randers)
|
|
|
|
**Likely Outcome**: INCORRECT (similar names, different cities)
|
|
|
|
**Fix**: Google "[institution name] Denmark" to confirm location
|
|
|
|
---
|
|
|
|
### Issue 4: Historical Name Changes
|
|
|
|
**Pattern**: Institution renamed, Wikidata has old or new name
|
|
|
|
**Example**:
|
|
- Ours: "Statsbiblioteket" (historical)
|
|
- Wikidata: "Royal Danish Library" (current, post-merger)
|
|
|
|
**Likely Outcome**: UNCERTAIN (need to check merger date)
|
|
|
|
**Fix**: Check Wikidata history, look for "replaced by" (P1366) or "end time" (P582)
|
|
|
|
---
|
|
|
|
### Issue 5: Multilingual Variants
|
|
|
|
**Pattern**: Danish name vs English Wikidata label
|
|
|
|
**Example**:
|
|
- Ours: "Rigsarkivet"
|
|
- Wikidata: "Danish National Archives"
|
|
|
|
**Likely Outcome**: CORRECT (same entity, different language)
|
|
|
|
**Fix**: Check Wikidata for Danish label/alias
|
|
|
|
---
|
|
|
|
## Danish Language Resources
|
|
|
|
### Useful Terms
|
|
|
|
| Danish | English | Context |
|
|
|--------|---------|---------|
|
|
| Bibliotek | Library | General library |
|
|
| Bibliotekerne | The libraries | Library system |
|
|
| Hovedbiblioteket | Main library | Central/flagship branch |
|
|
| Kombi-bibliotek | Combined library | Library + community center |
|
|
| Lokalhistoriske Arkiv | Local history archive | Municipal archive |
|
|
| Rigsarkivet | National Archives | Denmark's national archive |
|
|
| Statsbiblioteket | State Library | Historical name (merged) |
|
|
| Universitetsbibliotek | University library | Academic library |
|
|
| Centralbibliotek | Central library | Main branch |
|
|
| Filial | Branch | Library branch |
|
|
| Gymnasium | High school | Upper secondary school |
|
|
|
|
### Danish ISIL Prefixes
|
|
|
|
- **DK-8xxxxx**: Libraries (6-digit codes)
|
|
- **DK-01x**: National institutions (Rigsarkivet = DK-011)
|
|
- **DK-xxx**: Archives and special collections
|
|
|
|
---
|
|
|
|
## Research Resources
|
|
|
|
### Primary Validation Sources
|
|
|
|
1. **Danish ISIL Registry**: https://isil.dk
|
|
- Authoritative source for ISIL codes
|
|
- Search by institution name or code
|
|
- Official library/archive registry
|
|
|
|
2. **Wikidata Query Service**: https://query.wikidata.org
|
|
- SPARQL endpoint for bulk queries
|
|
- Check P791 (ISIL), P17 (country), P31 (type)
|
|
|
|
3. **Danish Library Portal**: https://bibliotek.dk
|
|
- Public library directory
|
|
- Search by city or name
|
|
|
|
4. **Danish National Archives**: https://www.sa.dk
|
|
- Archive directory
|
|
- Member institution list
|
|
|
|
5. **Danish Agency for Culture**: https://slks.dk
|
|
- Government heritage institutions
|
|
- Official museum/gallery registers
|
|
|
|
---
|
|
|
|
## Post-Validation Workflow
|
|
|
|
### After Manual Review Completed
|
|
|
|
```bash
|
|
# Step 1: Apply validation results
|
|
python scripts/apply_wikidata_validation.py
|
|
|
|
# Expected output:
|
|
# - data/instances/denmark_complete_validated.json
|
|
# - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
|
|
|
|
# Step 2: Re-export RDF with corrections
|
|
python scripts/export_denmark_rdf.py \
|
|
--input data/instances/denmark_complete_validated.json \
|
|
--output data/rdf/denmark_validated
|
|
|
|
# Expected output:
|
|
# - denmark_validated.ttl
|
|
# - denmark_validated.rdf
|
|
# - denmark_validated.jsonld
|
|
# - denmark_validated.nt
|
|
|
|
# Step 3: Update documentation
|
|
# - PROGRESS.md: Add validation statistics
|
|
# - SESSION_SUMMARY: Document findings
|
|
# - data/rdf/README.md: Note validated version
|
|
|
|
# Step 4: Commit changes
|
|
git add data/review/denmark_wikidata_fuzzy_matches.csv
|
|
git add data/instances/denmark_complete_validated.json
|
|
git add data/rdf/denmark_validated.*
|
|
git commit -m "feat: Apply manual Wikidata validation to Danish dataset (185 fuzzy matches reviewed)"
|
|
```
|
|
|
|
---
|
|
|
|
## Validation Metrics Tracking
|
|
|
|
### Template for Progress Updates
|
|
|
|
```markdown
|
|
## Wikidata Validation Progress
|
|
|
|
**Date**: YYYY-MM-DD
|
|
**Reviewer**: [Name]
|
|
|
|
### Review Status
|
|
|
|
- [x] Report generated (185 matches)
|
|
- [ ] Priority 1 reviewed (58 matches)
|
|
- [ ] Priority 2 reviewed (62 matches)
|
|
- [ ] Priority 3 reviewed (44 matches)
|
|
- [ ] Priority 4-5 reviewed (21 matches)
|
|
- [ ] Validation applied to dataset
|
|
- [ ] RDF re-exported
|
|
|
|
### Preliminary Results (after X matches reviewed)
|
|
|
|
| Status | Count | % |
|
|
|--------|-------|---|
|
|
| CORRECT | X | X% |
|
|
| INCORRECT | X | X% |
|
|
| UNCERTAIN | X | X% |
|
|
| Not Reviewed | X | X% |
|
|
|
|
### Common Issues Found
|
|
|
|
1. [Issue description]
|
|
2. [Issue description]
|
|
|
|
### Time Spent
|
|
|
|
- Priority 1: X hours
|
|
- Priority 2: X hours
|
|
- Total: X hours
|
|
|
|
### Next Steps
|
|
|
|
- [ ] [Action item]
|
|
- [ ] [Action item]
|
|
```
|
|
|
|
---
|
|
|
|
## FAQ
|
|
|
|
### Q: Can I skip Priority 4-5 matches?
|
|
|
|
**A**: Yes, if time-constrained. Priority 1-2 (64.9% of fuzzy matches) capture most uncertainty. Priority 4-5 have 93-99% confidence and are likely correct.
|
|
|
|
### Q: What if I can't determine CORRECT vs INCORRECT?
|
|
|
|
**A**: Mark as UNCERTAIN and add detailed notes. Flag for expert review (Danish language expertise or local knowledge).
|
|
|
|
### Q: How do I handle merged institutions?
|
|
|
|
**A**: Check Wikidata for "replaced by" (P1366) property. If our data is post-merger and Wikidata is pre-merger entity → INCORRECT. Document merger date in notes.
|
|
|
|
### Q: Should I edit Wikidata during review?
|
|
|
|
**A**: Optional but helpful. If you find missing Danish labels or incorrect data on Wikidata, you can edit (requires Wikidata account). Document edits in `validation_notes`.
|
|
|
|
### Q: What if ISIL codes don't match?
|
|
|
|
**A**: ISIL mismatch = almost always INCORRECT. ISIL is authoritative identifier. Exception: Wikidata may have outdated ISIL after code reassignment.
|
|
|
|
### Q: How do I validate branch libraries?
|
|
|
|
**A**: Check Wikidata for "part of" (P361) property. If Wikidata entity is parent system, may still be CORRECT (acceptable abstraction level). If branch-to-branch mismatch → INCORRECT.
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Version | Date | Changes |
|
|
|---------|------|---------|
|
|
| 1.0 | 2025-11-19 | Initial report generation (185 fuzzy matches) |
|
|
|
|
---
|
|
|
|
## Contact
|
|
|
|
**Questions?** Open an issue on GitHub or contact project maintainer.
|
|
|
|
**Found a bug in scripts?** Report at: [GitHub Issues]
|
|
|
|
**Need Danish language help?** [Contact Danish institutional partners]
|
|
|
|
---
|
|
|
|
**Status**: 🟡 Awaiting Manual Review
|
|
**Next Milestone**: Priority 1-2 review completion (120 matches)
|
|
**Estimated Completion**: [Add date after work begins]
|