glam/QUICK_REFERENCE_VALIDATION.md
2025-11-21 22:12:33 +01:00

4.7 KiB

Quick Reference: Wikidata Validation Workflow

Status: 73/185 automatically validated | 75 need manual review
Time saved: 5.2 hours (67.6%) | Remaining: 2.5 hours


🚀 Quick Start (Manual Review)

# Open this file in spreadsheet software
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv

Contains: Only 75 ambiguous rows requiring your judgment
Guide: docs/PREFILLED_VALIDATION_GUIDE.md

Option B: Full Review

# Open full dataset (all 185 rows)
open data/review/denmark_wikidata_fuzzy_matches_prefilled.csv

Contains: All 185 rows, 73 already validated, filter for empty validation_status


✍️ Fill These Columns

For each row, add:

Column Values Example
validation_status CORRECT / INCORRECT / UNCERTAIN INCORRECT
validation_notes Why? Evidence? "City mismatch: Viborg vs Aalborg, checked Q21107842"

🎯 Decision Guide

Mark CORRECT when:

  • Branch library → Main library match
  • Name variation (same institution)
  • ISIL codes match

Mark INCORRECT when:

  • Different cities (already auto-marked: 71 cases)
  • Different types (library vs museum)
  • School library ≠ public library
  • Very different names (no relationship)

Mark UNCERTAIN when:

  • ⁉️ Possible merger (need expert)
  • ⁉️ Missing info (need research)

After Review

# Step 1: Apply your validation decisions
python scripts/apply_wikidata_validation.py

# Step 2: Check progress and results
python scripts/check_validation_progress.py

📂 Files at a Glance

data/review/
├── denmark_wikidata_fuzzy_matches_needs_review.csv  ← Review this (75 rows)
├── denmark_wikidata_fuzzy_matches_prefilled.csv     ← Or this (185 rows)
└── README.md                                         ← Quick reference

docs/
├── PREFILLED_VALIDATION_GUIDE.md  ← Complete workflow (read first!)
├── AUTOMATED_SPOT_CHECK_RESULTS.md ← How automation works
└── WIKIDATA_VALIDATION_CHECKLIST.md ← Original detailed guide

scripts/
├── prefill_obvious_errors.py          ← Already run ✅
├── apply_wikidata_validation.py       ← Run after review
└── check_validation_progress.py       ← Check results

🔍 Common Patterns in Needs Review

Pattern Count Typical Decision Time
Name similarity issues 11 60% INCORRECT 22 min
Gymnasium libraries 7 90% INCORRECT 14 min
Branch suffix 10 70% CORRECT 20 min
Low confidence 8 50/50 16 min
Priority 1-2 check 19 95% CORRECT 38 min
Total 75 - 150 min

💡 Pro Tips

  1. Start with Priority 1 - Most important matches first
  2. Click Wikidata URLs - Verify addresses, dates, locations
  3. Use batch validation - Same error pattern → Same decision
  4. Document well - Future you will thank you
  5. When uncertain - Mark UNCERTAIN, don't guess

🆘 Need Help?

  • Workflow unclear? → Read docs/PREFILLED_VALIDATION_GUIDE.md
  • Decision uncertain? → Check decision guide (page 15)
  • Found automation error? → Override it! Change status, add note
  • Need examples? → See guide pages 20-25

📊 Progress Tracker

┌─────────────────────────────────────┐
│ Wikidata Validation Progress       │
├─────────────────────────────────────┤
│ ✅ Automated:      73/185 (39.5%)  │
│ 📝 Needs review:   75/185 (40.5%)  │
│ ⏸️  Lower priority: 37/185 (20.0%)  │
├─────────────────────────────────────┤
│ Time saved:   5.2h / 7.7h (67.6%)  │
│ Remaining:    2.5h (manual review)  │
└─────────────────────────────────────┘

Checklist

Before running apply script:

  • Opened needs_review.csv or prefilled.csv
  • Reviewed all 75 rows (or filtered empty validation_status)
  • Filled validation_status for each row
  • Filled validation_notes with reasoning
  • Checked for any blank validation cells
  • Saved CSV file

Ready to apply:

python scripts/apply_wikidata_validation.py

Last Updated: November 19, 2025
Next: Manual review → Apply validation → Verify results
Goal: ~95%+ accurate Wikidata links (769 → ~680-700 high-quality links)