# Wikidata Fuzzy Match Review - Danish Dataset ## Executive Summary **Dataset**: Danish GLAM institutions (2,348 total) **Wikidata Coverage**: 769 institutions (32.8%) **Fuzzy Matches Requiring Review**: 185 institutions (24.1% of linked) **Match Score Range**: 85-99% confidence **Review Status**: 🟡 **PENDING MANUAL REVIEW** --- ## Review Scope ### Match Distribution by Priority | Priority | Score Range | Count | % of Fuzzy | Description | |----------|-------------|-------|------------|-------------| | **1** | 85-87% | 58 | 31.4% | Very uncertain - **REVIEW FIRST** | | **2** | 87-90% | 62 | 33.5% | Uncertain - needs verification | | **3** | 90-93% | 44 | 23.8% | Moderate confidence | | **4** | 93-96% | 14 | 7.6% | Fairly confident | | **5** | 96-99% | 7 | 3.8% | Mostly confident | **Recommended Focus**: Priority 1-2 (120 matches = 64.9% of fuzzy matches) ### Institution Type Breakdown | Type | Count | % of Fuzzy | |------|-------|------------| | **LIBRARY** | 152 | 82.2% | | **ARCHIVE** | 33 | 17.8% | **Observation**: Libraries dominate fuzzy matches (likely due to branch naming variations) --- ## Generated Files ### 1. Review Report (CSV) **File**: `data/review/denmark_wikidata_fuzzy_matches.csv` **Rows**: 185 (header + 185 data rows) **Columns**: 13 **Column Reference**: | Column | Description | Action | |--------|-------------|--------| | `priority` | 1-5 (1=most uncertain) | Sort by this to prioritize | | `match_score` | 85.0-99.x% | Fuzzy match confidence | | `institution_name` | Our dataset name | Compare with wikidata_label | | `wikidata_label` | Wikidata entity label | Compare with institution_name | | `city` | Institution location | Cross-check with Wikidata | | `institution_type` | LIBRARY or ARCHIVE | Verify on Wikidata (P31) | | `isil_code` | ISIL identifier (if any) | Strong validation signal | | `ghcid` | Our persistent ID | Reference only | | `wikidata_qid` | Q-number (e.g. Q12345) | Link target | | `wikidata_url` | Direct Wikidata link | **CLICK TO VERIFY** | | `validation_status` | **FILL IN**: CORRECT \| INCORRECT \| UNCERTAIN | Your decision | | `validation_notes` | **FILL IN**: Explanation | Document reasoning | | `institution_id` | W3ID URI | For script processing | ### 2. Validation Checklist **File**: `docs/WIKIDATA_VALIDATION_CHECKLIST.md` **Purpose**: Step-by-step guide for manual reviewers **Contents**: - Validation workflow (4 steps per row) - 5 common validation scenarios - Quality assurance checklist - Research sources (Danish registries) - Batch validation tips - Example validation session ### 3. Processing Scripts #### Generate Report Script **File**: `scripts/generate_wikidata_review_report.py` **Purpose**: Extract fuzzy matches from enriched dataset **Status**: ✅ Already executed **Output**: CSV report #### Apply Validation Script **File**: `scripts/apply_wikidata_validation.py` **Purpose**: Update dataset based on manual review **Status**: ⏳ Ready to run after manual review **Input**: CSV with filled `validation_status` column **Output**: `denmark_complete_validated.json` --- ## Quick Start Guide ### For Reviewers (Immediate Action) 1. **Open CSV in spreadsheet software**: ```bash # Option A: Excel open data/review/denmark_wikidata_fuzzy_matches.csv # Option B: Google Sheets # Upload data/review/denmark_wikidata_fuzzy_matches.csv # Option C: LibreOffice Calc libreoffice --calc data/review/denmark_wikidata_fuzzy_matches.csv ``` 2. **Sort by Priority 1** (most uncertain) 3. **For each row**: - Compare `institution_name` vs `wikidata_label` - Click `wikidata_url` to verify match - Check `city`, `institution_type`, `isil_code` - Fill `validation_status`: CORRECT | INCORRECT | UNCERTAIN - Add `validation_notes` (recommended) 4. **Save CSV** (preserve column structure) 5. **Run update script**: ```bash python scripts/apply_wikidata_validation.py ``` ### For Project Managers (Progress Tracking) **Estimated Time**: - Priority 1-2 (120 matches): ~4-6 hours (2-3 minutes per match) - Priority 3-5 (65 matches): ~1-2 hours (1-2 minutes per match) - **Total**: ~5-8 hours **Milestones**: - [ ] Priority 1 complete (58 matches) - [ ] Priority 2 complete (62 matches) - [ ] Priority 3 complete (44 matches) - [ ] Priority 4-5 complete (21 matches) - [ ] Validation applied to dataset - [ ] RDF re-exported with corrections --- ## Sample Review Records ### Example 1: Priority 1 - Likely Incorrect ```csv priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes 1,85.0,"Campus Vejle, Biblioteket",Vejle Bibliotek,Vejle,LIBRARY,DK-861510,INCORRECT,"Wikidata is main library, ours is campus branch" ``` **Analysis**: - Name similar but ours has ", Biblioteket" suffix - Likely branch vs main library mismatch - Needs verification on Wikidata (check P361 "part of") ### Example 2: Priority 2 - Needs Research ```csv priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes 2,87.0,Fur Lokalhistoriske Arkiv,Randers Lokalhistoriske Arkiv,Skive,ARCHIVE,,INCORRECT,"City mismatch: Fur vs Randers, different local archives" ``` **Analysis**: - City mismatch (Skive vs Randers) - Both are local historical archives - Likely wrong match due to similar names - No ISIL to cross-check ### Example 3: Priority 3 - Likely Correct ```csv priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes 3,92.5,Aalborg Universitetsbibliotek,Aalborg University Library,Aalborg,LIBRARY,DK-820010,CORRECT,"ISIL match, Danish/English variant, same entity" ``` **Analysis**: - ISIL code match (DK-820010) = high confidence - Danish vs English name - City match - Type match - → Almost certainly CORRECT --- ## Validation Expectations ### Predicted Outcomes (Based on Match Scores) | Status | Expected % | Expected Count | Description | |--------|------------|----------------|-------------| | **CORRECT** | 85-90% | 157-167 | Keep Wikidata link, update provenance | | **INCORRECT** | 5-10% | 9-19 | Remove Wikidata link, document reason | | **UNCERTAIN** | 5% | 9 | Flag for expert review, keep tentatively | ### Quality Thresholds **Acceptable Quality**: - ≥80% CORRECT - ≤15% INCORRECT - ≤10% UNCERTAIN **High Quality**: - ≥90% CORRECT - ≤5% INCORRECT - ≤5% UNCERTAIN **Red Flags** (indicate algorithm issues): - <70% CORRECT - >20% INCORRECT - >15% UNCERTAIN --- ## Known Issues to Watch For ### Issue 1: Branch vs Main Library **Pattern**: Institution name ends with ", Biblioteket" (the library) **Example**: - Ours: "Campus Vejle, Biblioteket" - Wikidata: "Vejle Bibliotek" **Likely Outcome**: INCORRECT (branch matched to main library) **Fix**: Check Wikidata for "part of" (P361) relationship --- ### Issue 2: Gymnasium Libraries **Pattern**: Institution name starts with "[School Name] Gymnasium, Biblioteket" **Example**: - Ours: "Fredericia Gymnasium, Biblioteket" - Wikidata: "Fredericia Bibliotek" **Likely Outcome**: INCORRECT (school library matched to public library) **Fix**: Verify institution type on Wikidata (P31) --- ### Issue 3: Location Mismatch **Pattern**: City name differs between dataset and Wikidata **Example**: - Ours: "Fur Lokalhistoriske Arkiv" (Skive) - Wikidata: "Randers Lokalhistoriske Arkiv" (Randers) **Likely Outcome**: INCORRECT (similar names, different cities) **Fix**: Google "[institution name] Denmark" to confirm location --- ### Issue 4: Historical Name Changes **Pattern**: Institution renamed, Wikidata has old or new name **Example**: - Ours: "Statsbiblioteket" (historical) - Wikidata: "Royal Danish Library" (current, post-merger) **Likely Outcome**: UNCERTAIN (need to check merger date) **Fix**: Check Wikidata history, look for "replaced by" (P1366) or "end time" (P582) --- ### Issue 5: Multilingual Variants **Pattern**: Danish name vs English Wikidata label **Example**: - Ours: "Rigsarkivet" - Wikidata: "Danish National Archives" **Likely Outcome**: CORRECT (same entity, different language) **Fix**: Check Wikidata for Danish label/alias --- ## Danish Language Resources ### Useful Terms | Danish | English | Context | |--------|---------|---------| | Bibliotek | Library | General library | | Bibliotekerne | The libraries | Library system | | Hovedbiblioteket | Main library | Central/flagship branch | | Kombi-bibliotek | Combined library | Library + community center | | Lokalhistoriske Arkiv | Local history archive | Municipal archive | | Rigsarkivet | National Archives | Denmark's national archive | | Statsbiblioteket | State Library | Historical name (merged) | | Universitetsbibliotek | University library | Academic library | | Centralbibliotek | Central library | Main branch | | Filial | Branch | Library branch | | Gymnasium | High school | Upper secondary school | ### Danish ISIL Prefixes - **DK-8xxxxx**: Libraries (6-digit codes) - **DK-01x**: National institutions (Rigsarkivet = DK-011) - **DK-xxx**: Archives and special collections --- ## Research Resources ### Primary Validation Sources 1. **Danish ISIL Registry**: https://isil.dk - Authoritative source for ISIL codes - Search by institution name or code - Official library/archive registry 2. **Wikidata Query Service**: https://query.wikidata.org - SPARQL endpoint for bulk queries - Check P791 (ISIL), P17 (country), P31 (type) 3. **Danish Library Portal**: https://bibliotek.dk - Public library directory - Search by city or name 4. **Danish National Archives**: https://www.sa.dk - Archive directory - Member institution list 5. **Danish Agency for Culture**: https://slks.dk - Government heritage institutions - Official museum/gallery registers --- ## Post-Validation Workflow ### After Manual Review Completed ```bash # Step 1: Apply validation results python scripts/apply_wikidata_validation.py # Expected output: # - data/instances/denmark_complete_validated.json # - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN # Step 2: Re-export RDF with corrections python scripts/export_denmark_rdf.py \ --input data/instances/denmark_complete_validated.json \ --output data/rdf/denmark_validated # Expected output: # - denmark_validated.ttl # - denmark_validated.rdf # - denmark_validated.jsonld # - denmark_validated.nt # Step 3: Update documentation # - PROGRESS.md: Add validation statistics # - SESSION_SUMMARY: Document findings # - data/rdf/README.md: Note validated version # Step 4: Commit changes git add data/review/denmark_wikidata_fuzzy_matches.csv git add data/instances/denmark_complete_validated.json git add data/rdf/denmark_validated.* git commit -m "feat: Apply manual Wikidata validation to Danish dataset (185 fuzzy matches reviewed)" ``` --- ## Validation Metrics Tracking ### Template for Progress Updates ```markdown ## Wikidata Validation Progress **Date**: YYYY-MM-DD **Reviewer**: [Name] ### Review Status - [x] Report generated (185 matches) - [ ] Priority 1 reviewed (58 matches) - [ ] Priority 2 reviewed (62 matches) - [ ] Priority 3 reviewed (44 matches) - [ ] Priority 4-5 reviewed (21 matches) - [ ] Validation applied to dataset - [ ] RDF re-exported ### Preliminary Results (after X matches reviewed) | Status | Count | % | |--------|-------|---| | CORRECT | X | X% | | INCORRECT | X | X% | | UNCERTAIN | X | X% | | Not Reviewed | X | X% | ### Common Issues Found 1. [Issue description] 2. [Issue description] ### Time Spent - Priority 1: X hours - Priority 2: X hours - Total: X hours ### Next Steps - [ ] [Action item] - [ ] [Action item] ``` --- ## FAQ ### Q: Can I skip Priority 4-5 matches? **A**: Yes, if time-constrained. Priority 1-2 (64.9% of fuzzy matches) capture most uncertainty. Priority 4-5 have 93-99% confidence and are likely correct. ### Q: What if I can't determine CORRECT vs INCORRECT? **A**: Mark as UNCERTAIN and add detailed notes. Flag for expert review (Danish language expertise or local knowledge). ### Q: How do I handle merged institutions? **A**: Check Wikidata for "replaced by" (P1366) property. If our data is post-merger and Wikidata is pre-merger entity → INCORRECT. Document merger date in notes. ### Q: Should I edit Wikidata during review? **A**: Optional but helpful. If you find missing Danish labels or incorrect data on Wikidata, you can edit (requires Wikidata account). Document edits in `validation_notes`. ### Q: What if ISIL codes don't match? **A**: ISIL mismatch = almost always INCORRECT. ISIL is authoritative identifier. Exception: Wikidata may have outdated ISIL after code reassignment. ### Q: How do I validate branch libraries? **A**: Check Wikidata for "part of" (P361) property. If Wikidata entity is parent system, may still be CORRECT (acceptable abstraction level). If branch-to-branch mismatch → INCORRECT. --- ## Version History | Version | Date | Changes | |---------|------|---------| | 1.0 | 2025-11-19 | Initial report generation (185 fuzzy matches) | --- ## Contact **Questions?** Open an issue on GitHub or contact project maintainer. **Found a bug in scripts?** Report at: [GitHub Issues] **Need Danish language help?** [Contact Danish institutional partners] --- **Status**: 🟡 Awaiting Manual Review **Next Milestone**: Priority 1-2 review completion (120 matches) **Estimated Completion**: [Add date after work begins]