glam/data/review/README.md
2025-11-19 23:25:22 +01:00

262 lines
6.8 KiB
Markdown

# Wikidata Fuzzy Match Review Package
This directory contains files for manual validation of fuzzy Wikidata matches in the Danish GLAM dataset.
## 📋 Package Contents
### Review Data
- **`denmark_wikidata_fuzzy_matches.csv`** - Main review file (185 fuzzy matches)
- **ACTION REQUIRED**: Fill in `validation_status` and `validation_notes` columns
- Prioritized by match score (Priority 1 = most uncertain)
### Documentation
See `/docs/` directory:
- **`WIKIDATA_VALIDATION_CHECKLIST.md`** - Step-by-step validation guide
- **`WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md`** - Complete overview and FAQ
### Scripts
See `/scripts/` directory:
- **`generate_wikidata_review_report.py`** - ✅ Already run (generated this CSV)
- **`check_validation_progress.py`** - Check review progress anytime
- **`apply_wikidata_validation.py`** - Apply completed review to dataset
---
## 🚀 Quick Start
### 1. Open CSV for Review
```bash
# Option A: Excel
open denmark_wikidata_fuzzy_matches.csv
# Option B: Google Sheets
# Upload denmark_wikidata_fuzzy_matches.csv
# Option C: LibreOffice
libreoffice --calc denmark_wikidata_fuzzy_matches.csv
```
### 2. Review Process (For Each Row)
1. **Compare names**: `institution_name` vs `wikidata_label`
2. **Click URL**: Visit `wikidata_url` to verify match
3. **Check metadata**: City, institution type, ISIL code
4. **Decide**: Fill `validation_status` with:
- `CORRECT` - Keep Wikidata link
- `INCORRECT` - Remove Wikidata link
- `UNCERTAIN` - Flag for expert review
5. **Document**: Add explanation in `validation_notes` (optional but recommended)
### 3. Check Progress
```bash
python ../scripts/check_validation_progress.py
```
Output shows:
- Overall completion percentage
- Progress by priority level
- Estimated time remaining
- Quality indicators
### 4. Apply Validation Results
```bash
# After completing review
python ../scripts/apply_wikidata_validation.py
```
Output:
- `denmark_complete_validated.json` (updated dataset)
- Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN
---
## 📊 Review Statistics
**Generated**: 2025-11-19
**Total Fuzzy Matches**: 185
**Match Score Range**: 85-99%
### Priority Breakdown
| Priority | Score Range | Count | Status |
|----------|-------------|-------|--------|
| 1 | 85-87% | 58 | ⬜ Not Started |
| 2 | 87-90% | 62 | ⬜ Not Started |
| 3 | 90-93% | 44 | ⬜ Not Started |
| 4 | 93-96% | 14 | ⬜ Not Started |
| 5 | 96-99% | 7 | ⬜ Not Started |
**Recommended Focus**: Priority 1-2 (120 matches = 64.9%)
### Institution Types
- **LIBRARY**: 152 (82.2%)
- **ARCHIVE**: 33 (17.8%)
---
## ⏱️ Time Estimates
- **Priority 1-2** (120 matches): 4-6 hours (~2-3 min per match)
- **Priority 3-5** (65 matches): 1-2 hours (~1-2 min per match)
- **Total**: 5-8 hours
**Tips for Faster Review**:
- Sort by priority (start with 1)
- Filter by institution type (review all libraries, then archives)
- Use keyboard shortcuts in spreadsheet software
- Mark obvious matches first (ISIL match = instant CORRECT)
---
## ✅ Validation Checklist
Before running `apply_wikidata_validation.py`:
- [ ] All Priority 1 rows have `validation_status` filled
- [ ] All Priority 2 rows have `validation_status` filled
- [ ] At least 50% of rows have `validation_notes`
- [ ] Status values are exact: `CORRECT`, `INCORRECT`, or `UNCERTAIN`
- [ ] No accidentally deleted data in other columns
- [ ] CSV saved in original location
---
## 📚 Key Resources
### Danish Registries (For Research)
- **ISIL Registry**: https://isil.dk
- **Library Portal**: https://bibliotek.dk
- **National Archives**: https://www.sa.dk
- **Cultural Agency**: https://slks.dk
### Wikidata Tools
- **Query Service**: https://query.wikidata.org
- **Property Browser**: https://www.wikidata.org/wiki/Special:ListProperties
- **Entity Search**: https://www.wikidata.org/wiki/Special:Search
### Validation Key Properties
Check these on Wikidata:
- **P31** (instance of) - Should match institution_type
- **P17** (country) - Should be Q35 (Denmark)
- **P791** (ISIL) - Cross-check with isil_code column
- **P131** (located in) - Should match city
- **P625** (coordinates) - Verify location on map
---
## 🔍 Common Validation Scenarios
### Scenario 1: Danish vs English Names
**Example**: "Rigsarkivet" vs "Danish National Archives"
**Decision**: CORRECT (same entity, different language)
**Check**: Wikidata should have Danish label as alias
### Scenario 2: Branch vs Main Library
**Example**: "Campus Vejle, Biblioteket" vs "Vejle Bibliotek"
**Decision**: INCORRECT (branch matched to main)
**Check**: Wikidata P361 (part of) property
### Scenario 3: Type Mismatch
**Example**: Our LIBRARY vs Wikidata MUSEUM
**Decision**: INCORRECT (wrong entity type)
**Check**: Wikidata P31 (instance of) property
### Scenario 4: Historical Merger
**Example**: "Statsbiblioteket" vs "Royal Danish Library"
**Decision**: UNCERTAIN (check merger date)
**Check**: Wikidata P1366 (replaced by) or P582 (end time)
### Scenario 5: ISIL Match
**Example**: Both have DK-820010
**Decision**: CORRECT (authoritative identifier match)
**Check**: No further validation needed
---
## 📈 Expected Outcomes
Based on match score distribution:
| Status | Expected % | Expected Count |
|--------|------------|----------------|
| CORRECT | 85-90% | 157-167 |
| INCORRECT | 5-10% | 9-19 |
| UNCERTAIN | 5% | 9 |
**Quality Thresholds**:
- ✅ Acceptable: ≥80% CORRECT, ≤15% INCORRECT
- 🌟 High Quality: ≥90% CORRECT, ≤5% INCORRECT
- 🚨 Red Flag: <70% CORRECT, >20% INCORRECT
---
## 🛠️ Troubleshooting
### Issue: CSV won't open with proper encoding
**Solution**: Ensure UTF-8 encoding when opening
```bash
# Excel: Save As → More Options → UTF-8
# Google Sheets: File → Import → Character encoding: UTF-8
# LibreOffice: Open → Character set: Unicode (UTF-8)
```
### Issue: Progress checker shows 0% but I've reviewed some
**Solution**: Ensure `validation_status` values are EXACT (all caps):
-`CORRECT` (correct)
-`Correct`, `correct`, `OK` (incorrect)
### Issue: Apply script fails with "No validation results"
**Solution**: Check that `validation_status` column has values (not blank)
### Issue: Can't decide CORRECT vs INCORRECT
**Solution**: Mark as `UNCERTAIN` and add detailed notes. Flag for expert review.
---
## 📧 Support
**Questions?** See `/docs/WIKIDATA_VALIDATION_CHECKLIST.md` for detailed guidance
**Found issues?** Open GitHub issue or contact project maintainer
**Need Danish expertise?** Contact institutional partners for local knowledge
---
## 🎯 Current Status
**Progress**: 0/185 (0.0%)
**Priority 1**: 0/58 (0.0%)
**Priority 2**: 0/62 (0.0%)
**Next Step**: Begin Priority 1 review
Run `python ../scripts/check_validation_progress.py` to update status.
---
**Last Updated**: 2025-11-19
**Dataset**: `denmark_complete_enriched.json` (2,348 institutions)
**Wikidata Links**: 769 total (584 exact + 185 fuzzy)