| .. | ||
| denmark_wikidata_fuzzy_matches.csv | ||
| README.md | ||
Wikidata Fuzzy Match Review Package
This directory contains files for manual validation of fuzzy Wikidata matches in the Danish GLAM dataset.
📋 Package Contents
Review Data
denmark_wikidata_fuzzy_matches.csv- Main review file (185 fuzzy matches)- ACTION REQUIRED: Fill in
validation_statusandvalidation_notescolumns - Prioritized by match score (Priority 1 = most uncertain)
- ACTION REQUIRED: Fill in
Documentation
See /docs/ directory:
WIKIDATA_VALIDATION_CHECKLIST.md- Step-by-step validation guideWIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md- Complete overview and FAQ
Scripts
See /scripts/ directory:
generate_wikidata_review_report.py- ✅ Already run (generated this CSV)check_validation_progress.py- Check review progress anytimeapply_wikidata_validation.py- Apply completed review to dataset
🚀 Quick Start
1. Open CSV for Review
# Option A: Excel
open denmark_wikidata_fuzzy_matches.csv
# Option B: Google Sheets
# Upload denmark_wikidata_fuzzy_matches.csv
# Option C: LibreOffice
libreoffice --calc denmark_wikidata_fuzzy_matches.csv
2. Review Process (For Each Row)
- Compare names:
institution_namevswikidata_label - Click URL: Visit
wikidata_urlto verify match - Check metadata: City, institution type, ISIL code
- Decide: Fill
validation_statuswith:CORRECT- Keep Wikidata linkINCORRECT- Remove Wikidata linkUNCERTAIN- Flag for expert review
- Document: Add explanation in
validation_notes(optional but recommended)
3. Check Progress
python ../scripts/check_validation_progress.py
Output shows:
- Overall completion percentage
- Progress by priority level
- Estimated time remaining
- Quality indicators
4. Apply Validation Results
# After completing review
python ../scripts/apply_wikidata_validation.py
Output:
denmark_complete_validated.json(updated dataset)- Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
📊 Review Statistics
Generated: 2025-11-19
Total Fuzzy Matches: 185
Match Score Range: 85-99%
Priority Breakdown
| Priority | Score Range | Count | Status |
|---|---|---|---|
| 1 | 85-87% | 58 | ⬜ Not Started |
| 2 | 87-90% | 62 | ⬜ Not Started |
| 3 | 90-93% | 44 | ⬜ Not Started |
| 4 | 93-96% | 14 | ⬜ Not Started |
| 5 | 96-99% | 7 | ⬜ Not Started |
Recommended Focus: Priority 1-2 (120 matches = 64.9%)
Institution Types
- LIBRARY: 152 (82.2%)
- ARCHIVE: 33 (17.8%)
⏱️ Time Estimates
- Priority 1-2 (120 matches): 4-6 hours (~2-3 min per match)
- Priority 3-5 (65 matches): 1-2 hours (~1-2 min per match)
- Total: 5-8 hours
Tips for Faster Review:
- Sort by priority (start with 1)
- Filter by institution type (review all libraries, then archives)
- Use keyboard shortcuts in spreadsheet software
- Mark obvious matches first (ISIL match = instant CORRECT)
✅ Validation Checklist
Before running apply_wikidata_validation.py:
- All Priority 1 rows have
validation_statusfilled - All Priority 2 rows have
validation_statusfilled - At least 50% of rows have
validation_notes - Status values are exact:
CORRECT,INCORRECT, orUNCERTAIN - No accidentally deleted data in other columns
- CSV saved in original location
📚 Key Resources
Danish Registries (For Research)
- ISIL Registry: https://isil.dk
- Library Portal: https://bibliotek.dk
- National Archives: https://www.sa.dk
- Cultural Agency: https://slks.dk
Wikidata Tools
- Query Service: https://query.wikidata.org
- Property Browser: https://www.wikidata.org/wiki/Special:ListProperties
- Entity Search: https://www.wikidata.org/wiki/Special:Search
Validation Key Properties
Check these on Wikidata:
- P31 (instance of) - Should match institution_type
- P17 (country) - Should be Q35 (Denmark)
- P791 (ISIL) - Cross-check with isil_code column
- P131 (located in) - Should match city
- P625 (coordinates) - Verify location on map
🔍 Common Validation Scenarios
Scenario 1: Danish vs English Names
Example: "Rigsarkivet" vs "Danish National Archives"
Decision: CORRECT (same entity, different language)
Check: Wikidata should have Danish label as alias
Scenario 2: Branch vs Main Library
Example: "Campus Vejle, Biblioteket" vs "Vejle Bibliotek"
Decision: INCORRECT (branch matched to main)
Check: Wikidata P361 (part of) property
Scenario 3: Type Mismatch
Example: Our LIBRARY vs Wikidata MUSEUM
Decision: INCORRECT (wrong entity type)
Check: Wikidata P31 (instance of) property
Scenario 4: Historical Merger
Example: "Statsbiblioteket" vs "Royal Danish Library"
Decision: UNCERTAIN (check merger date)
Check: Wikidata P1366 (replaced by) or P582 (end time)
Scenario 5: ISIL Match
Example: Both have DK-820010
Decision: CORRECT (authoritative identifier match)
Check: No further validation needed
📈 Expected Outcomes
Based on match score distribution:
| Status | Expected % | Expected Count |
|---|---|---|
| CORRECT | 85-90% | 157-167 |
| INCORRECT | 5-10% | 9-19 |
| UNCERTAIN | 5% | 9 |
Quality Thresholds:
- ✅ Acceptable: ≥80% CORRECT, ≤15% INCORRECT
- 🌟 High Quality: ≥90% CORRECT, ≤5% INCORRECT
- 🚨 Red Flag: <70% CORRECT, >20% INCORRECT
🛠️ Troubleshooting
Issue: CSV won't open with proper encoding
Solution: Ensure UTF-8 encoding when opening
# Excel: Save As → More Options → UTF-8
# Google Sheets: File → Import → Character encoding: UTF-8
# LibreOffice: Open → Character set: Unicode (UTF-8)
Issue: Progress checker shows 0% but I've reviewed some
Solution: Ensure validation_status values are EXACT (all caps):
- ✅
CORRECT(correct) - ❌
Correct,correct,OK(incorrect)
Issue: Apply script fails with "No validation results"
Solution: Check that validation_status column has values (not blank)
Issue: Can't decide CORRECT vs INCORRECT
Solution: Mark as UNCERTAIN and add detailed notes. Flag for expert review.
📧 Support
Questions? See /docs/WIKIDATA_VALIDATION_CHECKLIST.md for detailed guidance
Found issues? Open GitHub issue or contact project maintainer
Need Danish expertise? Contact institutional partners for local knowledge
🎯 Current Status
Progress: 0/185 (0.0%)
Priority 1: 0/58 (0.0%)
Priority 2: 0/62 (0.0%)
Next Step: Begin Priority 1 review
Run python ../scripts/check_validation_progress.py to update status.
Last Updated: 2025-11-19
Dataset: denmark_complete_enriched.json (2,348 institutions)
Wikidata Links: 769 total (584 exact + 185 fuzzy)