glam/QUICK_REFERENCE_VALIDATION.md
2025-11-21 22:12:33 +01:00

163 lines
4.7 KiB
Markdown

# Quick Reference: Wikidata Validation Workflow
**Status**: 73/185 automatically validated | 75 need manual review
**Time saved**: 5.2 hours (67.6%) | **Remaining**: 2.5 hours
---
## 🚀 Quick Start (Manual Review)
### Option A: Streamlined Review (Recommended)
```bash
# Open this file in spreadsheet software
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
```
**Contains**: Only 75 ambiguous rows requiring your judgment
**Guide**: docs/PREFILLED_VALIDATION_GUIDE.md
### Option B: Full Review
```bash
# Open full dataset (all 185 rows)
open data/review/denmark_wikidata_fuzzy_matches_prefilled.csv
```
**Contains**: All 185 rows, 73 already validated, filter for empty `validation_status`
---
## ✍️ Fill These Columns
For each row, add:
| Column | Values | Example |
|--------|--------|---------|
| `validation_status` | CORRECT / INCORRECT / UNCERTAIN | INCORRECT |
| `validation_notes` | Why? Evidence? | "City mismatch: Viborg vs Aalborg, checked Q21107842" |
---
## 🎯 Decision Guide
### Mark CORRECT when:
- ✅ Branch library → Main library match
- ✅ Name variation (same institution)
- ✅ ISIL codes match
### Mark INCORRECT when:
- ❌ Different cities (already auto-marked: 71 cases)
- ❌ Different types (library vs museum)
- ❌ School library ≠ public library
- ❌ Very different names (no relationship)
### Mark UNCERTAIN when:
- ⁉️ Possible merger (need expert)
- ⁉️ Missing info (need research)
---
## ⚡ After Review
```bash
# Step 1: Apply your validation decisions
python scripts/apply_wikidata_validation.py
# Step 2: Check progress and results
python scripts/check_validation_progress.py
```
---
## 📂 Files at a Glance
```
data/review/
├── denmark_wikidata_fuzzy_matches_needs_review.csv ← Review this (75 rows)
├── denmark_wikidata_fuzzy_matches_prefilled.csv ← Or this (185 rows)
└── README.md ← Quick reference
docs/
├── PREFILLED_VALIDATION_GUIDE.md ← Complete workflow (read first!)
├── AUTOMATED_SPOT_CHECK_RESULTS.md ← How automation works
└── WIKIDATA_VALIDATION_CHECKLIST.md ← Original detailed guide
scripts/
├── prefill_obvious_errors.py ← Already run ✅
├── apply_wikidata_validation.py ← Run after review
└── check_validation_progress.py ← Check results
```
---
## 🔍 Common Patterns in Needs Review
| Pattern | Count | Typical Decision | Time |
|---------|-------|------------------|------|
| Name similarity issues | 11 | 60% INCORRECT | 22 min |
| Gymnasium libraries | 7 | 90% INCORRECT | 14 min |
| Branch suffix | 10 | 70% CORRECT | 20 min |
| Low confidence | 8 | 50/50 | 16 min |
| Priority 1-2 check | 19 | 95% CORRECT | 38 min |
| **Total** | **75** | - | **150 min** |
---
## 💡 Pro Tips
1. **Start with Priority 1** - Most important matches first
2. **Click Wikidata URLs** - Verify addresses, dates, locations
3. **Use batch validation** - Same error pattern → Same decision
4. **Document well** - Future you will thank you
5. **When uncertain** - Mark UNCERTAIN, don't guess
---
## 🆘 Need Help?
- **Workflow unclear?** → Read `docs/PREFILLED_VALIDATION_GUIDE.md`
- **Decision uncertain?** → Check decision guide (page 15)
- **Found automation error?** → Override it! Change status, add note
- **Need examples?** → See guide pages 20-25
---
## 📊 Progress Tracker
```
┌─────────────────────────────────────┐
│ Wikidata Validation Progress │
├─────────────────────────────────────┤
│ ✅ Automated: 73/185 (39.5%) │
│ 📝 Needs review: 75/185 (40.5%) │
│ ⏸️ Lower priority: 37/185 (20.0%) │
├─────────────────────────────────────┤
│ Time saved: 5.2h / 7.7h (67.6%) │
│ Remaining: 2.5h (manual review) │
└─────────────────────────────────────┘
```
---
## ✅ Checklist
Before running apply script:
- [ ] Opened needs_review.csv or prefilled.csv
- [ ] Reviewed all 75 rows (or filtered empty validation_status)
- [ ] Filled `validation_status` for each row
- [ ] Filled `validation_notes` with reasoning
- [ ] Checked for any blank validation cells
- [ ] Saved CSV file
Ready to apply:
```bash
python scripts/apply_wikidata_validation.py
```
---
**Last Updated**: November 19, 2025
**Next**: Manual review → Apply validation → Verify results
**Goal**: ~95%+ accurate Wikidata links (769 → ~680-700 high-quality links)