13 KiB
Wikidata Fuzzy Match Review - Danish Dataset
Executive Summary
Dataset: Danish GLAM institutions (2,348 total)
Wikidata Coverage: 769 institutions (32.8%)
Fuzzy Matches Requiring Review: 185 institutions (24.1% of linked)
Match Score Range: 85-99% confidence
Review Status: 🟡 PENDING MANUAL REVIEW
Review Scope
Match Distribution by Priority
| Priority | Score Range | Count | % of Fuzzy | Description |
|---|---|---|---|---|
| 1 | 85-87% | 58 | 31.4% | Very uncertain - REVIEW FIRST |
| 2 | 87-90% | 62 | 33.5% | Uncertain - needs verification |
| 3 | 90-93% | 44 | 23.8% | Moderate confidence |
| 4 | 93-96% | 14 | 7.6% | Fairly confident |
| 5 | 96-99% | 7 | 3.8% | Mostly confident |
Recommended Focus: Priority 1-2 (120 matches = 64.9% of fuzzy matches)
Institution Type Breakdown
| Type | Count | % of Fuzzy |
|---|---|---|
| LIBRARY | 152 | 82.2% |
| ARCHIVE | 33 | 17.8% |
Observation: Libraries dominate fuzzy matches (likely due to branch naming variations)
Generated Files
1. Review Report (CSV)
File: data/review/denmark_wikidata_fuzzy_matches.csv
Rows: 185 (header + 185 data rows)
Columns: 13
Column Reference:
| Column | Description | Action |
|---|---|---|
priority |
1-5 (1=most uncertain) | Sort by this to prioritize |
match_score |
85.0-99.x% | Fuzzy match confidence |
institution_name |
Our dataset name | Compare with wikidata_label |
wikidata_label |
Wikidata entity label | Compare with institution_name |
city |
Institution location | Cross-check with Wikidata |
institution_type |
LIBRARY or ARCHIVE | Verify on Wikidata (P31) |
isil_code |
ISIL identifier (if any) | Strong validation signal |
ghcid |
Our persistent ID | Reference only |
wikidata_qid |
Q-number (e.g. Q12345) | Link target |
wikidata_url |
Direct Wikidata link | CLICK TO VERIFY |
validation_status |
FILL IN: CORRECT | INCORRECT | UNCERTAIN | Your decision |
validation_notes |
FILL IN: Explanation | Document reasoning |
institution_id |
W3ID URI | For script processing |
2. Validation Checklist
File: docs/WIKIDATA_VALIDATION_CHECKLIST.md
Purpose: Step-by-step guide for manual reviewers
Contents:
- Validation workflow (4 steps per row)
- 5 common validation scenarios
- Quality assurance checklist
- Research sources (Danish registries)
- Batch validation tips
- Example validation session
3. Processing Scripts
Generate Report Script
File: scripts/generate_wikidata_review_report.py
Purpose: Extract fuzzy matches from enriched dataset
Status: ✅ Already executed
Output: CSV report
Apply Validation Script
File: scripts/apply_wikidata_validation.py
Purpose: Update dataset based on manual review
Status: ⏳ Ready to run after manual review
Input: CSV with filled validation_status column
Output: denmark_complete_validated.json
Quick Start Guide
For Reviewers (Immediate Action)
-
Open CSV in spreadsheet software:
# Option A: Excel open data/review/denmark_wikidata_fuzzy_matches.csv # Option B: Google Sheets # Upload data/review/denmark_wikidata_fuzzy_matches.csv # Option C: LibreOffice Calc libreoffice --calc data/review/denmark_wikidata_fuzzy_matches.csv -
Sort by Priority 1 (most uncertain)
-
For each row:
- Compare
institution_namevswikidata_label - Click
wikidata_urlto verify match - Check
city,institution_type,isil_code - Fill
validation_status: CORRECT | INCORRECT | UNCERTAIN - Add
validation_notes(recommended)
- Compare
-
Save CSV (preserve column structure)
-
Run update script:
python scripts/apply_wikidata_validation.py
For Project Managers (Progress Tracking)
Estimated Time:
- Priority 1-2 (120 matches): ~4-6 hours (2-3 minutes per match)
- Priority 3-5 (65 matches): ~1-2 hours (1-2 minutes per match)
- Total: ~5-8 hours
Milestones:
- Priority 1 complete (58 matches)
- Priority 2 complete (62 matches)
- Priority 3 complete (44 matches)
- Priority 4-5 complete (21 matches)
- Validation applied to dataset
- RDF re-exported with corrections
Sample Review Records
Example 1: Priority 1 - Likely Incorrect
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
1,85.0,"Campus Vejle, Biblioteket",Vejle Bibliotek,Vejle,LIBRARY,DK-861510,INCORRECT,"Wikidata is main library, ours is campus branch"
Analysis:
- Name similar but ours has ", Biblioteket" suffix
- Likely branch vs main library mismatch
- Needs verification on Wikidata (check P361 "part of")
Example 2: Priority 2 - Needs Research
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
2,87.0,Fur Lokalhistoriske Arkiv,Randers Lokalhistoriske Arkiv,Skive,ARCHIVE,,INCORRECT,"City mismatch: Fur vs Randers, different local archives"
Analysis:
- City mismatch (Skive vs Randers)
- Both are local historical archives
- Likely wrong match due to similar names
- No ISIL to cross-check
Example 3: Priority 3 - Likely Correct
priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
3,92.5,Aalborg Universitetsbibliotek,Aalborg University Library,Aalborg,LIBRARY,DK-820010,CORRECT,"ISIL match, Danish/English variant, same entity"
Analysis:
- ISIL code match (DK-820010) = high confidence
- Danish vs English name
- City match
- Type match
- → Almost certainly CORRECT
Validation Expectations
Predicted Outcomes (Based on Match Scores)
| Status | Expected % | Expected Count | Description |
|---|---|---|---|
| CORRECT | 85-90% | 157-167 | Keep Wikidata link, update provenance |
| INCORRECT | 5-10% | 9-19 | Remove Wikidata link, document reason |
| UNCERTAIN | 5% | 9 | Flag for expert review, keep tentatively |
Quality Thresholds
Acceptable Quality:
- ≥80% CORRECT
- ≤15% INCORRECT
- ≤10% UNCERTAIN
High Quality:
- ≥90% CORRECT
- ≤5% INCORRECT
- ≤5% UNCERTAIN
Red Flags (indicate algorithm issues):
- <70% CORRECT
-
20% INCORRECT
-
15% UNCERTAIN
Known Issues to Watch For
Issue 1: Branch vs Main Library
Pattern: Institution name ends with ", Biblioteket" (the library)
Example:
- Ours: "Campus Vejle, Biblioteket"
- Wikidata: "Vejle Bibliotek"
Likely Outcome: INCORRECT (branch matched to main library)
Fix: Check Wikidata for "part of" (P361) relationship
Issue 2: Gymnasium Libraries
Pattern: Institution name starts with "[School Name] Gymnasium, Biblioteket"
Example:
- Ours: "Fredericia Gymnasium, Biblioteket"
- Wikidata: "Fredericia Bibliotek"
Likely Outcome: INCORRECT (school library matched to public library)
Fix: Verify institution type on Wikidata (P31)
Issue 3: Location Mismatch
Pattern: City name differs between dataset and Wikidata
Example:
- Ours: "Fur Lokalhistoriske Arkiv" (Skive)
- Wikidata: "Randers Lokalhistoriske Arkiv" (Randers)
Likely Outcome: INCORRECT (similar names, different cities)
Fix: Google "[institution name] Denmark" to confirm location
Issue 4: Historical Name Changes
Pattern: Institution renamed, Wikidata has old or new name
Example:
- Ours: "Statsbiblioteket" (historical)
- Wikidata: "Royal Danish Library" (current, post-merger)
Likely Outcome: UNCERTAIN (need to check merger date)
Fix: Check Wikidata history, look for "replaced by" (P1366) or "end time" (P582)
Issue 5: Multilingual Variants
Pattern: Danish name vs English Wikidata label
Example:
- Ours: "Rigsarkivet"
- Wikidata: "Danish National Archives"
Likely Outcome: CORRECT (same entity, different language)
Fix: Check Wikidata for Danish label/alias
Danish Language Resources
Useful Terms
| Danish | English | Context |
|---|---|---|
| Bibliotek | Library | General library |
| Bibliotekerne | The libraries | Library system |
| Hovedbiblioteket | Main library | Central/flagship branch |
| Kombi-bibliotek | Combined library | Library + community center |
| Lokalhistoriske Arkiv | Local history archive | Municipal archive |
| Rigsarkivet | National Archives | Denmark's national archive |
| Statsbiblioteket | State Library | Historical name (merged) |
| Universitetsbibliotek | University library | Academic library |
| Centralbibliotek | Central library | Main branch |
| Filial | Branch | Library branch |
| Gymnasium | High school | Upper secondary school |
Danish ISIL Prefixes
- DK-8xxxxx: Libraries (6-digit codes)
- DK-01x: National institutions (Rigsarkivet = DK-011)
- DK-xxx: Archives and special collections
Research Resources
Primary Validation Sources
-
Danish ISIL Registry: https://isil.dk
- Authoritative source for ISIL codes
- Search by institution name or code
- Official library/archive registry
-
Wikidata Query Service: https://query.wikidata.org
- SPARQL endpoint for bulk queries
- Check P791 (ISIL), P17 (country), P31 (type)
-
Danish Library Portal: https://bibliotek.dk
- Public library directory
- Search by city or name
-
Danish National Archives: https://www.sa.dk
- Archive directory
- Member institution list
-
Danish Agency for Culture: https://slks.dk
- Government heritage institutions
- Official museum/gallery registers
Post-Validation Workflow
After Manual Review Completed
# Step 1: Apply validation results
python scripts/apply_wikidata_validation.py
# Expected output:
# - data/instances/denmark_complete_validated.json
# - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
# Step 2: Re-export RDF with corrections
python scripts/export_denmark_rdf.py \
--input data/instances/denmark_complete_validated.json \
--output data/rdf/denmark_validated
# Expected output:
# - denmark_validated.ttl
# - denmark_validated.rdf
# - denmark_validated.jsonld
# - denmark_validated.nt
# Step 3: Update documentation
# - PROGRESS.md: Add validation statistics
# - SESSION_SUMMARY: Document findings
# - data/rdf/README.md: Note validated version
# Step 4: Commit changes
git add data/review/denmark_wikidata_fuzzy_matches.csv
git add data/instances/denmark_complete_validated.json
git add data/rdf/denmark_validated.*
git commit -m "feat: Apply manual Wikidata validation to Danish dataset (185 fuzzy matches reviewed)"
Validation Metrics Tracking
Template for Progress Updates
## Wikidata Validation Progress
**Date**: YYYY-MM-DD
**Reviewer**: [Name]
### Review Status
- [x] Report generated (185 matches)
- [ ] Priority 1 reviewed (58 matches)
- [ ] Priority 2 reviewed (62 matches)
- [ ] Priority 3 reviewed (44 matches)
- [ ] Priority 4-5 reviewed (21 matches)
- [ ] Validation applied to dataset
- [ ] RDF re-exported
### Preliminary Results (after X matches reviewed)
| Status | Count | % |
|--------|-------|---|
| CORRECT | X | X% |
| INCORRECT | X | X% |
| UNCERTAIN | X | X% |
| Not Reviewed | X | X% |
### Common Issues Found
1. [Issue description]
2. [Issue description]
### Time Spent
- Priority 1: X hours
- Priority 2: X hours
- Total: X hours
### Next Steps
- [ ] [Action item]
- [ ] [Action item]
FAQ
Q: Can I skip Priority 4-5 matches?
A: Yes, if time-constrained. Priority 1-2 (64.9% of fuzzy matches) capture most uncertainty. Priority 4-5 have 93-99% confidence and are likely correct.
Q: What if I can't determine CORRECT vs INCORRECT?
A: Mark as UNCERTAIN and add detailed notes. Flag for expert review (Danish language expertise or local knowledge).
Q: How do I handle merged institutions?
A: Check Wikidata for "replaced by" (P1366) property. If our data is post-merger and Wikidata is pre-merger entity → INCORRECT. Document merger date in notes.
Q: Should I edit Wikidata during review?
A: Optional but helpful. If you find missing Danish labels or incorrect data on Wikidata, you can edit (requires Wikidata account). Document edits in validation_notes.
Q: What if ISIL codes don't match?
A: ISIL mismatch = almost always INCORRECT. ISIL is authoritative identifier. Exception: Wikidata may have outdated ISIL after code reassignment.
Q: How do I validate branch libraries?
A: Check Wikidata for "part of" (P361) property. If Wikidata entity is parent system, may still be CORRECT (acceptable abstraction level). If branch-to-branch mismatch → INCORRECT.
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-11-19 | Initial report generation (185 fuzzy matches) |
Contact
Questions? Open an issue on GitHub or contact project maintainer.
Found a bug in scripts? Report at: [GitHub Issues]
Need Danish language help? [Contact Danish institutional partners]
Status: 🟡 Awaiting Manual Review
Next Milestone: Priority 1-2 review completion (120 matches)
Estimated Completion: [Add date after work begins]