15 KiB
Pre-filled Validation Guide: Denmark Wikidata Fuzzy Matches
Status: ✅ 73 obvious errors automatically marked INCORRECT (Nov 19, 2025)
Remaining: 75 matches require manual judgment
Summary of Automated Pre-fill
What Was Done
An automated script (scripts/prefill_obvious_errors.py) analyzed all 185 fuzzy Wikidata matches and:
- Identified 73 obvious errors based on clear criteria
- Automatically marked them as INCORRECT in
validation_status - Added explanatory notes documenting why each was flagged
- Generated streamlined review file with only remaining 75 ambiguous cases
Automated Detection Rules
The script marked matches as INCORRECT when they had:
Rule 1: City Mismatch (71 matches)
- Pattern:
🚨 City mismatch: our 'X' but Wikidata mentions 'Y' - Logic: Different cities = different institutions
- Confidence: Very high (>99% accuracy)
- Examples:
- Our: "Fur Lokalhistoriske Arkiv" (Skive) → Wikidata: "Randers Lokalhistoriske Arkiv" ❌
- Our: "Gladsaxe Bibliotekerne" (Søborg) → Wikidata: "Gentofte Bibliotekerne" ❌
Rule 2: Type Mismatch (1 match)
- Pattern:
⚠️ Type mismatch: we're LIBRARY but Wikidata mentions museum/gallery - Logic: Fundamentally different institution types
- Example: Our LIBRARY matched to Wikidata museum entry
Rule 3: Very Low Name Similarity (1 match)
- Pattern:
Low name similarity (<30%) - Logic: Names too different to be same institution
- Example: "Lunds stadsbibliotek" vs "Billund Bibliotek" (29.6% similarity)
Files Generated
1. Pre-filled Full CSV (All 185 matches)
File: data/review/denmark_wikidata_fuzzy_matches_prefilled.csv
Size: 64.3 KB
Contents: All 185 fuzzy matches with 73 pre-filled as INCORRECT
Use when:
- You want to see everything (validated + remaining)
- You want to verify automated decisions
- You need full context
How to use:
# Columns:
auto_flag → REVIEW_URGENT or OK
spot_check_issues → Detected problems
validation_status → INCORRECT (auto), or empty (needs review)
validation_notes → [AUTO] explanation or manual notes
2. Streamlined Needs Review CSV (75 matches only)
File: data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
Size: 22.3 KB
Contents: ONLY the 75 matches requiring your judgment
Use when:
- You want to focus on remaining work (recommended!)
- You trust the automated decisions
- You want faster review
What's included:
- 56 flagged matches NOT automatically marked (ambiguous cases)
- 19 "OK" matches with Priority 1-2 (spot check for safety)
Time Estimates
Original Estimate (Before Automation)
- Total matches: 185
- Estimated time: 462 minutes (7.7 hours)
- Breakdown: 2.5 min/match average
After Automated Pre-fill
- Pre-filled INCORRECT: 73 matches (no review needed) ✅
- Needs manual review: 75 matches
- Estimated time: 150 minutes (2.5 hours)
- Time saved: 67.6% (312 minutes = 5.2 hours)
Breakdown of Remaining 75 Matches
| Category | Count | Est. Time | Description |
|---|---|---|---|
| Name pattern issues | 11 | 22 min | Low similarity, different first words |
| Gymnasium libraries | 7 | 14 min | School library vs public library |
| Branch vs main | 10 | 20 min | Branch suffix mismatch |
| Low confidence | 8 | 16 min | Score <87% without ISIL |
| Priority 1-2 spot check | 19 | 38 min | "OK" matches needing safety check |
| Other ambiguous | 20 | 40 min | Case-by-case judgment |
| Total | 75 | 150 min | (2.5 hours) |
Manual Review Workflow
Step 1: Open Streamlined CSV (Recommended)
# Open in Excel, Google Sheets, or text editor
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
Columns to focus on:
auto_flag- REVIEW_URGENT = needs judgmentspot_check_issues- What patterns were detectedinstitution_name- Our institutionwikidata_label- Wikidata entity labelcity- Our city (check consistency)wikidata_url- Click to verify on Wikidata
Step 2: Review by Category
A. Name Pattern Issues (11 matches)
Pattern: 🔍 Low name similarity or 🔍 First word differs
Decision guide:
- CORRECT if: Branch vs main library (e.g., "Campus Vejle, Biblioteket" → "Vejle Bibliotek")
- INCORRECT if: Truly different institutions (different names, no branch relationship)
Example:
"Campus Vejle, Biblioteket" → "Vejle Bibliotek"
Decision: CORRECT (campus branch of main library)
Notes: "Campus library is branch of main Vejle public library"
B. Gymnasium Libraries (7 matches)
Pattern: 🔍 Our 'Gymnasium' library matched to public library
Decision guide:
- INCORRECT: Usually school libraries ≠ public libraries
- CORRECT: Only if they genuinely share facilities/systems
Example:
"Fredericia Gymnasium, Biblioteket" → "Fredericia Bibliotek"
Decision: INCORRECT (school library vs public library)
Notes: "Gymnasium library is separate from public library system"
C. Branch vs Main (10 matches)
Pattern: , Biblioteket suffix in our name
Decision guide:
- Check Wikidata page - does it list branches?
- If Wikidata entry is MAIN library and ours is BRANCH → CORRECT
- If completely different institution → INCORRECT
D. Low Confidence (8 matches)
Pattern: ⚠️ Low confidence (<87%) with no ISIL to verify
Action: Visit Wikidata URL, verify:
- Address/location matches?
- Opening year matches?
- Type matches (library/archive/museum)?
E. Priority 1-2 Spot Check (19 matches)
Pattern: auto_flag = OK but Priority 1-2
Action: Quick sanity check only
- Most should be CORRECT (passed automated checks)
- Just verify names look reasonable
- Mark CORRECT if looks good
Step 3: Fill Validation Columns
For each row, fill:
validation_status (required):
CORRECT- Wikidata match is correctINCORRECT- Wikidata match is wrongUNCERTAIN- Need expert review
validation_notes (required):
- Explain your decision
- Include URL visited, dates checked, etc.
Example entries:
CORRECT,"Branch library of main system, confirmed on Wikidata Q21107021"
INCORRECT,"Gymnasium library (school) incorrectly matched to public library"
INCORRECT,"Different cities (Viborg vs Aalborg), different institutions"
CORRECT,"Name variation, same institution confirmed by ISIL code DK-872150"
UNCERTAIN,"Need to verify with domain expert - possible historical merger?"
Step 4: Apply Validation
After filling all rows:
# Apply validation decisions to main dataset
python scripts/apply_wikidata_validation.py
# Check progress
python scripts/check_validation_progress.py
Automated Pre-fill Examples
Example 1: City Mismatch (Auto-INCORRECT)
Institution: "Fur Lokalhistoriske Arkiv"
City: Skive
Wikidata: "Randers Lokalhistoriske Arkiv" (Q12332829)
Score: 85.2%
validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected. City mismatch: our 'Skive' but Wikidata mentions 'randers'
Why auto-marked: Different cities (Skive vs Randers) = different local archives
Example 2: Multiple City Mismatches (Pattern)
Common error pattern discovered: Many local archives incorrectly matched to "Randers Lokalhistoriske Arkiv"
Affected archives (all auto-marked INCORRECT):
- Fur Lokalhistoriske Arkiv (Skive)
- Aarup Lokalhistoriske Arkiv (Assens)
- Ikast Lokalhistoriske Arkiv (Ikast-Brande)
- Morsø Lokalhistoriske Arkiv (Morsø)
- Hover Lokalhistoriske Arkiv (Ringkøbing-Skjern)
- 20+ more...
Root cause: Fuzzy matcher incorrectly grouped local archives with similar names
Example 3: Type Mismatch (Auto-INCORRECT)
Institution: "Musikmuseet - Musikhistorisk Museum og Carl Claudius' Samling"
Type: LIBRARY
Wikidata: Q21107738 (Museum)
Score: 98.4%
validation_status: INCORRECT
validation_notes: [AUTO] Type mismatch: institution types fundamentally different (library vs museum)
Why auto-marked: Despite high name similarity, type mismatch is definitive
Validation Decision Guide
Quick Reference Table
| Issue Type | Default | Check For | Common Outcome |
|---|---|---|---|
| 🚨 City mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
| ⚠️ Type mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
| 🔍 Gymnasium library | INCORRECT | Branch sharing? | 90% INCORRECT |
| 🔍 Low similarity (<60%) | INCORRECT | Historical name? | 80% INCORRECT |
| 🔍 Branch suffix | CORRECT | Different inst? | 70% CORRECT |
| 🔍 First word differs | UNCERTAIN | City name? | 50/50 |
| ⚠️ Low score (<87%) | UNCERTAIN | Check Wikidata | 50/50 |
When to Mark CORRECT
✅ Branch vs Main Library
- Our name: "Campus Vejle, Biblioteket"
- Wikidata: "Vejle Bibliotek"
- Same library system, branch location
✅ Name Variation
- Our name: "Sjællands Stiftsbiblioteks gamle samling"
- Wikidata: "Sjællands Stiftsbibliotek"
- Historical vs current name, same institution
✅ Confirmed by ISIL
- Our ISIL: DK-872150
- Wikidata ISIL: DK-872150 (same)
- Names differ slightly but ISIL confirms match
When to Mark INCORRECT
❌ Different Cities
- Our city: Skive
- Wikidata city: Randers
- Local archives are inherently city-specific
❌ Different Types
- Our type: LIBRARY
- Wikidata type: MUSEUM
- Fundamentally different institution categories
❌ School vs Public
- Our name: "Fredericia Gymnasium, Biblioteket"
- Wikidata: "Fredericia Bibliotek" (public library)
- School library ≠ public library
❌ Very Different Names
- Our name: "Lunds stadsbibliotek"
- Wikidata: "Billund Bibliotek"
- Only 29.6% similarity, no relationship
When to Mark UNCERTAIN
⁉️ Possible Historical Merger
- Names differ + dates unclear
- Need expert to verify organizational history
⁉️ Ambiguous Branch Relationship
- Could be branch OR different institution
- Need domain knowledge
⁉️ Missing Data
- Not enough information to decide
- Flag for follow-up research
Validation Quality Standards
Minimum Requirements
For each validated row, ensure:
- ✅ validation_status is filled (CORRECT/INCORRECT/UNCERTAIN)
- ✅ validation_notes explains the decision
- ✅ Notes include evidence (URL checked, date verified, etc.)
- ✅ If UNCERTAIN, notes explain what info is missing
Good Validation Notes Examples
CORRECT decision:
"Branch library confirmed on Wikidata page Q21107021. Main library system
operates multiple branch locations including this one."
INCORRECT decision:
"City mismatch: our institution in Viborg, Wikidata entity in Aalborg.
Checked Q21107842 - describes Aalborg gymnasium specifically."
UNCERTAIN decision:
"Names differ significantly but both in Roskilde. Possible historical name
change or merger. Recommend expert review to confirm organizational history."
Bad Validation Notes Examples
❌ Too vague:
"Looks wrong" → No evidence provided
"Probably correct" → No verification described
❌ Missing evidence:
"INCORRECT" → Why? What did you check?
"Different institutions" → How do you know?
❌ No investigation:
"Not sure, marked UNCERTAIN" → Did you check Wikidata page? Address?
After Validation
Step 1: Apply Validation Decisions
python scripts/apply_wikidata_validation.py
What this does:
- Reads your validation decisions from CSV
- Updates main dataset (
denmark_complete_enriched.json) - Removes INCORRECT Wikidata links
- Keeps CORRECT Wikidata links
- Flags UNCERTAIN for follow-up
Step 2: Check Progress
python scripts/check_validation_progress.py
Output:
- Total fuzzy matches reviewed
- Breakdown: CORRECT vs INCORRECT vs UNCERTAIN
- Remaining unvalidated matches
- Next steps
Step 3: Verify Results
Before validation:
- Wikidata links: 769 total (584 exact + 185 fuzzy)
- Fuzzy match accuracy: Unknown (need validation)
After validation (expected):
- Wikidata links: ~680-700 total
- Fuzzy CORRECT: ~100-110 (54-59%)
- Fuzzy INCORRECT: ~70-80 (38-43%) → Removed
- Overall accuracy: ~95%+
Step 4: Document Findings
Create summary report:
- Total matches validated
- Accuracy of fuzzy matching algorithm
- Common error patterns discovered
- Recommendations for improving fuzzy matching
Troubleshooting
Q: What if I disagree with an auto-marked INCORRECT?
A: You can override it! Change validation_status to CORRECT and add your reasoning in validation_notes. The automated decision is just a starting point.
Example:
# Original (auto):
validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected...
# Your override:
validation_status: CORRECT
validation_notes: "Overriding auto-mark: Checked Wikidata, this is a branch
library that serves both cities. Confirmed with institution website."
Q: How do I know if a gymnasium library shares facilities?
A: Check:
- Visit Wikidata page → Look for "part of" relationships
- Search institution website → Look for shared catalog/systems
- Check ISIL codes → Same ISIL = shared system
Q: What if I can't decide after checking Wikidata?
A: Mark as UNCERTAIN and document what you checked:
validation_status: UNCERTAIN
validation_notes: "Checked Q21107861, addresses differ slightly. Possible
relocation or branch. Need institutional records to confirm."
Q: Can I batch-mark multiple rows?
A: Yes! If you find a pattern:
# Example: All these were matched to Q12332829 (Randers archive)
# All in different cities → All INCORRECT
validation_status: INCORRECT
validation_notes: "Batch validation: City mismatch, different local archives
incorrectly grouped by fuzzy matcher"
Progress Tracking
Current Status
| Metric | Count | Percentage |
|---|---|---|
| Total fuzzy matches | 185 | 100% |
| Auto-marked INCORRECT | 73 | 39.5% |
| Needs manual review | 75 | 40.5% |
| Remaining unvalidated | 37 | 20.0% |
Note: The 37 "remaining unvalidated" are Priority 3-5 matches in the full CSV that aren't in the streamlined needs_review file. You can validate these later if needed.
Validation Milestones
- Automated spot checks - 185 matches flagged (Nov 19)
- Automated pre-fill - 73 obvious errors marked (Nov 19)
- Manual review - 75 ambiguous cases (in progress)
- Apply validation - Update main dataset
- Quality check - Verify results
- Documentation - Write summary report
Contact & Support
Questions?
- Check:
docs/WIKIDATA_VALIDATION_CHECKLIST.md- Detailed validation guide - Check:
docs/AUTOMATED_SPOT_CHECK_RESULTS.md- Spot check methodology - Check:
data/review/README.md- Quick reference
Found a bug in automated pre-fill?
- Script:
scripts/prefill_obvious_errors.py - Report issue with example row
Need expert review?
- Mark as UNCERTAIN
- Document what's unclear
- Escalate after validation complete
Last Updated: November 19, 2025
Status: 73/185 validated (39.5% complete)
Next Action: Manual review of 75 remaining matches