# Wikidata Fuzzy Match Review Package This directory contains files for manual validation of fuzzy Wikidata matches in the Danish GLAM dataset. ## 📋 Package Contents ### Review Data - **`denmark_wikidata_fuzzy_matches.csv`** - Main review file (185 fuzzy matches) - **ACTION REQUIRED**: Fill in `validation_status` and `validation_notes` columns - Prioritized by match score (Priority 1 = most uncertain) ### Documentation See `/docs/` directory: - **`WIKIDATA_VALIDATION_CHECKLIST.md`** - Step-by-step validation guide - **`WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md`** - Complete overview and FAQ ### Scripts See `/scripts/` directory: - **`generate_wikidata_review_report.py`** - ✅ Already run (generated this CSV) - **`check_validation_progress.py`** - Check review progress anytime - **`apply_wikidata_validation.py`** - Apply completed review to dataset --- ## 🚀 Quick Start ### 1. Open CSV for Review ```bash # Option A: Excel open denmark_wikidata_fuzzy_matches.csv # Option B: Google Sheets # Upload denmark_wikidata_fuzzy_matches.csv # Option C: LibreOffice libreoffice --calc denmark_wikidata_fuzzy_matches.csv ``` ### 2. Review Process (For Each Row) 1. **Compare names**: `institution_name` vs `wikidata_label` 2. **Click URL**: Visit `wikidata_url` to verify match 3. **Check metadata**: City, institution type, ISIL code 4. **Decide**: Fill `validation_status` with: - `CORRECT` - Keep Wikidata link - `INCORRECT` - Remove Wikidata link - `UNCERTAIN` - Flag for expert review 5. **Document**: Add explanation in `validation_notes` (optional but recommended) ### 3. Check Progress ```bash python ../scripts/check_validation_progress.py ``` Output shows: - Overall completion percentage - Progress by priority level - Estimated time remaining - Quality indicators ### 4. Apply Validation Results ```bash # After completing review python ../scripts/apply_wikidata_validation.py ``` Output: - `denmark_complete_validated.json` (updated dataset) - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN --- ## 📊 Review Statistics **Generated**: 2025-11-19 **Total Fuzzy Matches**: 185 **Match Score Range**: 85-99% ### Priority Breakdown | Priority | Score Range | Count | Status | |----------|-------------|-------|--------| | 1 | 85-87% | 58 | ⬜ Not Started | | 2 | 87-90% | 62 | ⬜ Not Started | | 3 | 90-93% | 44 | ⬜ Not Started | | 4 | 93-96% | 14 | ⬜ Not Started | | 5 | 96-99% | 7 | ⬜ Not Started | **Recommended Focus**: Priority 1-2 (120 matches = 64.9%) ### Institution Types - **LIBRARY**: 152 (82.2%) - **ARCHIVE**: 33 (17.8%) --- ## ⏱️ Time Estimates - **Priority 1-2** (120 matches): 4-6 hours (~2-3 min per match) - **Priority 3-5** (65 matches): 1-2 hours (~1-2 min per match) - **Total**: 5-8 hours **Tips for Faster Review**: - Sort by priority (start with 1) - Filter by institution type (review all libraries, then archives) - Use keyboard shortcuts in spreadsheet software - Mark obvious matches first (ISIL match = instant CORRECT) --- ## ✅ Validation Checklist Before running `apply_wikidata_validation.py`: - [ ] All Priority 1 rows have `validation_status` filled - [ ] All Priority 2 rows have `validation_status` filled - [ ] At least 50% of rows have `validation_notes` - [ ] Status values are exact: `CORRECT`, `INCORRECT`, or `UNCERTAIN` - [ ] No accidentally deleted data in other columns - [ ] CSV saved in original location --- ## 📚 Key Resources ### Danish Registries (For Research) - **ISIL Registry**: https://isil.dk - **Library Portal**: https://bibliotek.dk - **National Archives**: https://www.sa.dk - **Cultural Agency**: https://slks.dk ### Wikidata Tools - **Query Service**: https://query.wikidata.org - **Property Browser**: https://www.wikidata.org/wiki/Special:ListProperties - **Entity Search**: https://www.wikidata.org/wiki/Special:Search ### Validation Key Properties Check these on Wikidata: - **P31** (instance of) - Should match institution_type - **P17** (country) - Should be Q35 (Denmark) - **P791** (ISIL) - Cross-check with isil_code column - **P131** (located in) - Should match city - **P625** (coordinates) - Verify location on map --- ## 🔍 Common Validation Scenarios ### Scenario 1: Danish vs English Names **Example**: "Rigsarkivet" vs "Danish National Archives" **Decision**: CORRECT (same entity, different language) **Check**: Wikidata should have Danish label as alias ### Scenario 2: Branch vs Main Library **Example**: "Campus Vejle, Biblioteket" vs "Vejle Bibliotek" **Decision**: INCORRECT (branch matched to main) **Check**: Wikidata P361 (part of) property ### Scenario 3: Type Mismatch **Example**: Our LIBRARY vs Wikidata MUSEUM **Decision**: INCORRECT (wrong entity type) **Check**: Wikidata P31 (instance of) property ### Scenario 4: Historical Merger **Example**: "Statsbiblioteket" vs "Royal Danish Library" **Decision**: UNCERTAIN (check merger date) **Check**: Wikidata P1366 (replaced by) or P582 (end time) ### Scenario 5: ISIL Match **Example**: Both have DK-820010 **Decision**: CORRECT (authoritative identifier match) **Check**: No further validation needed --- ## 📈 Expected Outcomes Based on match score distribution: | Status | Expected % | Expected Count | |--------|------------|----------------| | CORRECT | 85-90% | 157-167 | | INCORRECT | 5-10% | 9-19 | | UNCERTAIN | 5% | 9 | **Quality Thresholds**: - ✅ Acceptable: ≥80% CORRECT, ≤15% INCORRECT - 🌟 High Quality: ≥90% CORRECT, ≤5% INCORRECT - 🚨 Red Flag: <70% CORRECT, >20% INCORRECT --- ## 🛠️ Troubleshooting ### Issue: CSV won't open with proper encoding **Solution**: Ensure UTF-8 encoding when opening ```bash # Excel: Save As → More Options → UTF-8 # Google Sheets: File → Import → Character encoding: UTF-8 # LibreOffice: Open → Character set: Unicode (UTF-8) ``` ### Issue: Progress checker shows 0% but I've reviewed some **Solution**: Ensure `validation_status` values are EXACT (all caps): - ✅ `CORRECT` (correct) - ❌ `Correct`, `correct`, `OK` (incorrect) ### Issue: Apply script fails with "No validation results" **Solution**: Check that `validation_status` column has values (not blank) ### Issue: Can't decide CORRECT vs INCORRECT **Solution**: Mark as `UNCERTAIN` and add detailed notes. Flag for expert review. --- ## 📧 Support **Questions?** See `/docs/WIKIDATA_VALIDATION_CHECKLIST.md` for detailed guidance **Found issues?** Open GitHub issue or contact project maintainer **Need Danish expertise?** Contact institutional partners for local knowledge --- ## 🎯 Current Status **Progress**: 0/185 (0.0%) **Priority 1**: 0/58 (0.0%) **Priority 2**: 0/62 (0.0%) **Next Step**: Begin Priority 1 review Run `python ../scripts/check_validation_progress.py` to update status. --- **Last Updated**: 2025-11-19 **Dataset**: `denmark_complete_enriched.json` (2,348 institutions) **Wikidata Links**: 769 total (584 exact + 185 fuzzy)