262 lines
6.8 KiB
Markdown
262 lines
6.8 KiB
Markdown
# Wikidata Fuzzy Match Review Package
|
|
|
|
This directory contains files for manual validation of fuzzy Wikidata matches in the Danish GLAM dataset.
|
|
|
|
## 📋 Package Contents
|
|
|
|
### Review Data
|
|
|
|
- **`denmark_wikidata_fuzzy_matches.csv`** - Main review file (185 fuzzy matches)
|
|
- **ACTION REQUIRED**: Fill in `validation_status` and `validation_notes` columns
|
|
- Prioritized by match score (Priority 1 = most uncertain)
|
|
|
|
### Documentation
|
|
|
|
See `/docs/` directory:
|
|
|
|
- **`WIKIDATA_VALIDATION_CHECKLIST.md`** - Step-by-step validation guide
|
|
- **`WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md`** - Complete overview and FAQ
|
|
|
|
### Scripts
|
|
|
|
See `/scripts/` directory:
|
|
|
|
- **`generate_wikidata_review_report.py`** - ✅ Already run (generated this CSV)
|
|
- **`check_validation_progress.py`** - Check review progress anytime
|
|
- **`apply_wikidata_validation.py`** - Apply completed review to dataset
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### 1. Open CSV for Review
|
|
|
|
```bash
|
|
# Option A: Excel
|
|
open denmark_wikidata_fuzzy_matches.csv
|
|
|
|
# Option B: Google Sheets
|
|
# Upload denmark_wikidata_fuzzy_matches.csv
|
|
|
|
# Option C: LibreOffice
|
|
libreoffice --calc denmark_wikidata_fuzzy_matches.csv
|
|
```
|
|
|
|
### 2. Review Process (For Each Row)
|
|
|
|
1. **Compare names**: `institution_name` vs `wikidata_label`
|
|
2. **Click URL**: Visit `wikidata_url` to verify match
|
|
3. **Check metadata**: City, institution type, ISIL code
|
|
4. **Decide**: Fill `validation_status` with:
|
|
- `CORRECT` - Keep Wikidata link
|
|
- `INCORRECT` - Remove Wikidata link
|
|
- `UNCERTAIN` - Flag for expert review
|
|
5. **Document**: Add explanation in `validation_notes` (optional but recommended)
|
|
|
|
### 3. Check Progress
|
|
|
|
```bash
|
|
python ../scripts/check_validation_progress.py
|
|
```
|
|
|
|
Output shows:
|
|
- Overall completion percentage
|
|
- Progress by priority level
|
|
- Estimated time remaining
|
|
- Quality indicators
|
|
|
|
### 4. Apply Validation Results
|
|
|
|
```bash
|
|
# After completing review
|
|
python ../scripts/apply_wikidata_validation.py
|
|
```
|
|
|
|
Output:
|
|
- `denmark_complete_validated.json` (updated dataset)
|
|
- Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
|
|
|
|
---
|
|
|
|
## 📊 Review Statistics
|
|
|
|
**Generated**: 2025-11-19
|
|
**Total Fuzzy Matches**: 185
|
|
**Match Score Range**: 85-99%
|
|
|
|
### Priority Breakdown
|
|
|
|
| Priority | Score Range | Count | Status |
|
|
|----------|-------------|-------|--------|
|
|
| 1 | 85-87% | 58 | ⬜ Not Started |
|
|
| 2 | 87-90% | 62 | ⬜ Not Started |
|
|
| 3 | 90-93% | 44 | ⬜ Not Started |
|
|
| 4 | 93-96% | 14 | ⬜ Not Started |
|
|
| 5 | 96-99% | 7 | ⬜ Not Started |
|
|
|
|
**Recommended Focus**: Priority 1-2 (120 matches = 64.9%)
|
|
|
|
### Institution Types
|
|
|
|
- **LIBRARY**: 152 (82.2%)
|
|
- **ARCHIVE**: 33 (17.8%)
|
|
|
|
---
|
|
|
|
## ⏱️ Time Estimates
|
|
|
|
- **Priority 1-2** (120 matches): 4-6 hours (~2-3 min per match)
|
|
- **Priority 3-5** (65 matches): 1-2 hours (~1-2 min per match)
|
|
- **Total**: 5-8 hours
|
|
|
|
**Tips for Faster Review**:
|
|
- Sort by priority (start with 1)
|
|
- Filter by institution type (review all libraries, then archives)
|
|
- Use keyboard shortcuts in spreadsheet software
|
|
- Mark obvious matches first (ISIL match = instant CORRECT)
|
|
|
|
---
|
|
|
|
## ✅ Validation Checklist
|
|
|
|
Before running `apply_wikidata_validation.py`:
|
|
|
|
- [ ] All Priority 1 rows have `validation_status` filled
|
|
- [ ] All Priority 2 rows have `validation_status` filled
|
|
- [ ] At least 50% of rows have `validation_notes`
|
|
- [ ] Status values are exact: `CORRECT`, `INCORRECT`, or `UNCERTAIN`
|
|
- [ ] No accidentally deleted data in other columns
|
|
- [ ] CSV saved in original location
|
|
|
|
---
|
|
|
|
## 📚 Key Resources
|
|
|
|
### Danish Registries (For Research)
|
|
|
|
- **ISIL Registry**: https://isil.dk
|
|
- **Library Portal**: https://bibliotek.dk
|
|
- **National Archives**: https://www.sa.dk
|
|
- **Cultural Agency**: https://slks.dk
|
|
|
|
### Wikidata Tools
|
|
|
|
- **Query Service**: https://query.wikidata.org
|
|
- **Property Browser**: https://www.wikidata.org/wiki/Special:ListProperties
|
|
- **Entity Search**: https://www.wikidata.org/wiki/Special:Search
|
|
|
|
### Validation Key Properties
|
|
|
|
Check these on Wikidata:
|
|
|
|
- **P31** (instance of) - Should match institution_type
|
|
- **P17** (country) - Should be Q35 (Denmark)
|
|
- **P791** (ISIL) - Cross-check with isil_code column
|
|
- **P131** (located in) - Should match city
|
|
- **P625** (coordinates) - Verify location on map
|
|
|
|
---
|
|
|
|
## 🔍 Common Validation Scenarios
|
|
|
|
### Scenario 1: Danish vs English Names
|
|
|
|
**Example**: "Rigsarkivet" vs "Danish National Archives"
|
|
**Decision**: CORRECT (same entity, different language)
|
|
**Check**: Wikidata should have Danish label as alias
|
|
|
|
### Scenario 2: Branch vs Main Library
|
|
|
|
**Example**: "Campus Vejle, Biblioteket" vs "Vejle Bibliotek"
|
|
**Decision**: INCORRECT (branch matched to main)
|
|
**Check**: Wikidata P361 (part of) property
|
|
|
|
### Scenario 3: Type Mismatch
|
|
|
|
**Example**: Our LIBRARY vs Wikidata MUSEUM
|
|
**Decision**: INCORRECT (wrong entity type)
|
|
**Check**: Wikidata P31 (instance of) property
|
|
|
|
### Scenario 4: Historical Merger
|
|
|
|
**Example**: "Statsbiblioteket" vs "Royal Danish Library"
|
|
**Decision**: UNCERTAIN (check merger date)
|
|
**Check**: Wikidata P1366 (replaced by) or P582 (end time)
|
|
|
|
### Scenario 5: ISIL Match
|
|
|
|
**Example**: Both have DK-820010
|
|
**Decision**: CORRECT (authoritative identifier match)
|
|
**Check**: No further validation needed
|
|
|
|
---
|
|
|
|
## 📈 Expected Outcomes
|
|
|
|
Based on match score distribution:
|
|
|
|
| Status | Expected % | Expected Count |
|
|
|--------|------------|----------------|
|
|
| CORRECT | 85-90% | 157-167 |
|
|
| INCORRECT | 5-10% | 9-19 |
|
|
| UNCERTAIN | 5% | 9 |
|
|
|
|
**Quality Thresholds**:
|
|
- ✅ Acceptable: ≥80% CORRECT, ≤15% INCORRECT
|
|
- 🌟 High Quality: ≥90% CORRECT, ≤5% INCORRECT
|
|
- 🚨 Red Flag: <70% CORRECT, >20% INCORRECT
|
|
|
|
---
|
|
|
|
## 🛠️ Troubleshooting
|
|
|
|
### Issue: CSV won't open with proper encoding
|
|
|
|
**Solution**: Ensure UTF-8 encoding when opening
|
|
|
|
```bash
|
|
# Excel: Save As → More Options → UTF-8
|
|
# Google Sheets: File → Import → Character encoding: UTF-8
|
|
# LibreOffice: Open → Character set: Unicode (UTF-8)
|
|
```
|
|
|
|
### Issue: Progress checker shows 0% but I've reviewed some
|
|
|
|
**Solution**: Ensure `validation_status` values are EXACT (all caps):
|
|
- ✅ `CORRECT` (correct)
|
|
- ❌ `Correct`, `correct`, `OK` (incorrect)
|
|
|
|
### Issue: Apply script fails with "No validation results"
|
|
|
|
**Solution**: Check that `validation_status` column has values (not blank)
|
|
|
|
### Issue: Can't decide CORRECT vs INCORRECT
|
|
|
|
**Solution**: Mark as `UNCERTAIN` and add detailed notes. Flag for expert review.
|
|
|
|
---
|
|
|
|
## 📧 Support
|
|
|
|
**Questions?** See `/docs/WIKIDATA_VALIDATION_CHECKLIST.md` for detailed guidance
|
|
|
|
**Found issues?** Open GitHub issue or contact project maintainer
|
|
|
|
**Need Danish expertise?** Contact institutional partners for local knowledge
|
|
|
|
---
|
|
|
|
## 🎯 Current Status
|
|
|
|
**Progress**: 0/185 (0.0%)
|
|
**Priority 1**: 0/58 (0.0%)
|
|
**Priority 2**: 0/62 (0.0%)
|
|
**Next Step**: Begin Priority 1 review
|
|
|
|
Run `python ../scripts/check_validation_progress.py` to update status.
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-19
|
|
**Dataset**: `denmark_complete_enriched.json` (2,348 institutions)
|
|
**Wikidata Links**: 769 total (584 exact + 185 fuzzy)
|