glam/docs/PREFILLED_VALIDATION_GUIDE.md
2025-11-21 22:12:33 +01:00

15 KiB

Pre-filled Validation Guide: Denmark Wikidata Fuzzy Matches

Status: 73 obvious errors automatically marked INCORRECT (Nov 19, 2025)
Remaining: 75 matches require manual judgment


Summary of Automated Pre-fill

What Was Done

An automated script (scripts/prefill_obvious_errors.py) analyzed all 185 fuzzy Wikidata matches and:

  1. Identified 73 obvious errors based on clear criteria
  2. Automatically marked them as INCORRECT in validation_status
  3. Added explanatory notes documenting why each was flagged
  4. Generated streamlined review file with only remaining 75 ambiguous cases

Automated Detection Rules

The script marked matches as INCORRECT when they had:

Rule 1: City Mismatch (71 matches)

  • Pattern: 🚨 City mismatch: our 'X' but Wikidata mentions 'Y'
  • Logic: Different cities = different institutions
  • Confidence: Very high (>99% accuracy)
  • Examples:
    • Our: "Fur Lokalhistoriske Arkiv" (Skive) → Wikidata: "Randers Lokalhistoriske Arkiv"
    • Our: "Gladsaxe Bibliotekerne" (Søborg) → Wikidata: "Gentofte Bibliotekerne"

Rule 2: Type Mismatch (1 match)

  • Pattern: ⚠️ Type mismatch: we're LIBRARY but Wikidata mentions museum/gallery
  • Logic: Fundamentally different institution types
  • Example: Our LIBRARY matched to Wikidata museum entry

Rule 3: Very Low Name Similarity (1 match)

  • Pattern: Low name similarity (<30%)
  • Logic: Names too different to be same institution
  • Example: "Lunds stadsbibliotek" vs "Billund Bibliotek" (29.6% similarity)

Files Generated

1. Pre-filled Full CSV (All 185 matches)

File: data/review/denmark_wikidata_fuzzy_matches_prefilled.csv
Size: 64.3 KB
Contents: All 185 fuzzy matches with 73 pre-filled as INCORRECT

Use when:

  • You want to see everything (validated + remaining)
  • You want to verify automated decisions
  • You need full context

How to use:

# Columns:
auto_flag          → REVIEW_URGENT or OK
spot_check_issues  → Detected problems
validation_status  → INCORRECT (auto), or empty (needs review)
validation_notes   → [AUTO] explanation or manual notes

2. Streamlined Needs Review CSV (75 matches only)

File: data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
Size: 22.3 KB
Contents: ONLY the 75 matches requiring your judgment

Use when:

  • You want to focus on remaining work (recommended!)
  • You trust the automated decisions
  • You want faster review

What's included:

  • 56 flagged matches NOT automatically marked (ambiguous cases)
  • 19 "OK" matches with Priority 1-2 (spot check for safety)

Time Estimates

Original Estimate (Before Automation)

  • Total matches: 185
  • Estimated time: 462 minutes (7.7 hours)
  • Breakdown: 2.5 min/match average

After Automated Pre-fill

  • Pre-filled INCORRECT: 73 matches (no review needed)
  • Needs manual review: 75 matches
  • Estimated time: 150 minutes (2.5 hours)
  • Time saved: 67.6% (312 minutes = 5.2 hours)

Breakdown of Remaining 75 Matches

Category Count Est. Time Description
Name pattern issues 11 22 min Low similarity, different first words
Gymnasium libraries 7 14 min School library vs public library
Branch vs main 10 20 min Branch suffix mismatch
Low confidence 8 16 min Score <87% without ISIL
Priority 1-2 spot check 19 38 min "OK" matches needing safety check
Other ambiguous 20 40 min Case-by-case judgment
Total 75 150 min (2.5 hours)

Manual Review Workflow

# Open in Excel, Google Sheets, or text editor
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv

Columns to focus on:

  • auto_flag - REVIEW_URGENT = needs judgment
  • spot_check_issues - What patterns were detected
  • institution_name - Our institution
  • wikidata_label - Wikidata entity label
  • city - Our city (check consistency)
  • wikidata_url - Click to verify on Wikidata

Step 2: Review by Category

A. Name Pattern Issues (11 matches)

Pattern: 🔍 Low name similarity or 🔍 First word differs

Decision guide:

  • CORRECT if: Branch vs main library (e.g., "Campus Vejle, Biblioteket" → "Vejle Bibliotek")
  • INCORRECT if: Truly different institutions (different names, no branch relationship)

Example:

"Campus Vejle, Biblioteket" → "Vejle Bibliotek"
Decision: CORRECT (campus branch of main library)
Notes: "Campus library is branch of main Vejle public library"

B. Gymnasium Libraries (7 matches)

Pattern: 🔍 Our 'Gymnasium' library matched to public library

Decision guide:

  • INCORRECT: Usually school libraries ≠ public libraries
  • CORRECT: Only if they genuinely share facilities/systems

Example:

"Fredericia Gymnasium, Biblioteket" → "Fredericia Bibliotek"
Decision: INCORRECT (school library vs public library)
Notes: "Gymnasium library is separate from public library system"

C. Branch vs Main (10 matches)

Pattern: , Biblioteket suffix in our name

Decision guide:

  • Check Wikidata page - does it list branches?
  • If Wikidata entry is MAIN library and ours is BRANCH → CORRECT
  • If completely different institution → INCORRECT

D. Low Confidence (8 matches)

Pattern: ⚠️ Low confidence (<87%) with no ISIL to verify

Action: Visit Wikidata URL, verify:

  • Address/location matches?
  • Opening year matches?
  • Type matches (library/archive/museum)?

E. Priority 1-2 Spot Check (19 matches)

Pattern: auto_flag = OK but Priority 1-2

Action: Quick sanity check only

  • Most should be CORRECT (passed automated checks)
  • Just verify names look reasonable
  • Mark CORRECT if looks good

Step 3: Fill Validation Columns

For each row, fill:

validation_status (required):

  • CORRECT - Wikidata match is correct
  • INCORRECT - Wikidata match is wrong
  • UNCERTAIN - Need expert review

validation_notes (required):

  • Explain your decision
  • Include URL visited, dates checked, etc.

Example entries:

CORRECT,"Branch library of main system, confirmed on Wikidata Q21107021"
INCORRECT,"Gymnasium library (school) incorrectly matched to public library"
INCORRECT,"Different cities (Viborg vs Aalborg), different institutions"
CORRECT,"Name variation, same institution confirmed by ISIL code DK-872150"
UNCERTAIN,"Need to verify with domain expert - possible historical merger?"

Step 4: Apply Validation

After filling all rows:

# Apply validation decisions to main dataset
python scripts/apply_wikidata_validation.py

# Check progress
python scripts/check_validation_progress.py

Automated Pre-fill Examples

Example 1: City Mismatch (Auto-INCORRECT)

Institution: "Fur Lokalhistoriske Arkiv"
City: Skive
Wikidata: "Randers Lokalhistoriske Arkiv" (Q12332829)
Score: 85.2%

validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected. City mismatch: our 'Skive' but Wikidata mentions 'randers'

Why auto-marked: Different cities (Skive vs Randers) = different local archives

Example 2: Multiple City Mismatches (Pattern)

Common error pattern discovered: Many local archives incorrectly matched to "Randers Lokalhistoriske Arkiv"

Affected archives (all auto-marked INCORRECT):

  • Fur Lokalhistoriske Arkiv (Skive)
  • Aarup Lokalhistoriske Arkiv (Assens)
  • Ikast Lokalhistoriske Arkiv (Ikast-Brande)
  • Morsø Lokalhistoriske Arkiv (Morsø)
  • Hover Lokalhistoriske Arkiv (Ringkøbing-Skjern)
  • 20+ more...

Root cause: Fuzzy matcher incorrectly grouped local archives with similar names

Example 3: Type Mismatch (Auto-INCORRECT)

Institution: "Musikmuseet - Musikhistorisk Museum og Carl Claudius' Samling"
Type: LIBRARY
Wikidata: Q21107738 (Museum)
Score: 98.4%

validation_status: INCORRECT
validation_notes: [AUTO] Type mismatch: institution types fundamentally different (library vs museum)

Why auto-marked: Despite high name similarity, type mismatch is definitive


Validation Decision Guide

Quick Reference Table

Issue Type Default Check For Common Outcome
🚨 City mismatch INCORRECT Auto-filled 100% INCORRECT
⚠️ Type mismatch INCORRECT Auto-filled 100% INCORRECT
🔍 Gymnasium library INCORRECT Branch sharing? 90% INCORRECT
🔍 Low similarity (<60%) INCORRECT Historical name? 80% INCORRECT
🔍 Branch suffix CORRECT Different inst? 70% CORRECT
🔍 First word differs UNCERTAIN City name? 50/50
⚠️ Low score (<87%) UNCERTAIN Check Wikidata 50/50

When to Mark CORRECT

Branch vs Main Library

  • Our name: "Campus Vejle, Biblioteket"
  • Wikidata: "Vejle Bibliotek"
  • Same library system, branch location

Name Variation

  • Our name: "Sjællands Stiftsbiblioteks gamle samling"
  • Wikidata: "Sjællands Stiftsbibliotek"
  • Historical vs current name, same institution

Confirmed by ISIL

  • Our ISIL: DK-872150
  • Wikidata ISIL: DK-872150 (same)
  • Names differ slightly but ISIL confirms match

When to Mark INCORRECT

Different Cities

  • Our city: Skive
  • Wikidata city: Randers
  • Local archives are inherently city-specific

Different Types

  • Our type: LIBRARY
  • Wikidata type: MUSEUM
  • Fundamentally different institution categories

School vs Public

  • Our name: "Fredericia Gymnasium, Biblioteket"
  • Wikidata: "Fredericia Bibliotek" (public library)
  • School library ≠ public library

Very Different Names

  • Our name: "Lunds stadsbibliotek"
  • Wikidata: "Billund Bibliotek"
  • Only 29.6% similarity, no relationship

When to Mark UNCERTAIN

⁉️ Possible Historical Merger

  • Names differ + dates unclear
  • Need expert to verify organizational history

⁉️ Ambiguous Branch Relationship

  • Could be branch OR different institution
  • Need domain knowledge

⁉️ Missing Data

  • Not enough information to decide
  • Flag for follow-up research

Validation Quality Standards

Minimum Requirements

For each validated row, ensure:

  1. validation_status is filled (CORRECT/INCORRECT/UNCERTAIN)
  2. validation_notes explains the decision
  3. Notes include evidence (URL checked, date verified, etc.)
  4. If UNCERTAIN, notes explain what info is missing

Good Validation Notes Examples

CORRECT decision:

"Branch library confirmed on Wikidata page Q21107021. Main library system 
operates multiple branch locations including this one."

INCORRECT decision:

"City mismatch: our institution in Viborg, Wikidata entity in Aalborg. 
Checked Q21107842 - describes Aalborg gymnasium specifically."

UNCERTAIN decision:

"Names differ significantly but both in Roskilde. Possible historical name 
change or merger. Recommend expert review to confirm organizational history."

Bad Validation Notes Examples

Too vague:

"Looks wrong"  → No evidence provided
"Probably correct"  → No verification described

Missing evidence:

"INCORRECT"  → Why? What did you check?
"Different institutions"  → How do you know?

No investigation:

"Not sure, marked UNCERTAIN"  → Did you check Wikidata page? Address?

After Validation

Step 1: Apply Validation Decisions

python scripts/apply_wikidata_validation.py

What this does:

  • Reads your validation decisions from CSV
  • Updates main dataset (denmark_complete_enriched.json)
  • Removes INCORRECT Wikidata links
  • Keeps CORRECT Wikidata links
  • Flags UNCERTAIN for follow-up

Step 2: Check Progress

python scripts/check_validation_progress.py

Output:

  • Total fuzzy matches reviewed
  • Breakdown: CORRECT vs INCORRECT vs UNCERTAIN
  • Remaining unvalidated matches
  • Next steps

Step 3: Verify Results

Before validation:

  • Wikidata links: 769 total (584 exact + 185 fuzzy)
  • Fuzzy match accuracy: Unknown (need validation)

After validation (expected):

  • Wikidata links: ~680-700 total
  • Fuzzy CORRECT: ~100-110 (54-59%)
  • Fuzzy INCORRECT: ~70-80 (38-43%) → Removed
  • Overall accuracy: ~95%+

Step 4: Document Findings

Create summary report:

  • Total matches validated
  • Accuracy of fuzzy matching algorithm
  • Common error patterns discovered
  • Recommendations for improving fuzzy matching

Troubleshooting

Q: What if I disagree with an auto-marked INCORRECT?

A: You can override it! Change validation_status to CORRECT and add your reasoning in validation_notes. The automated decision is just a starting point.

Example:

# Original (auto):
validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected...

# Your override:
validation_status: CORRECT
validation_notes: "Overriding auto-mark: Checked Wikidata, this is a branch 
library that serves both cities. Confirmed with institution website."

Q: How do I know if a gymnasium library shares facilities?

A: Check:

  1. Visit Wikidata page → Look for "part of" relationships
  2. Search institution website → Look for shared catalog/systems
  3. Check ISIL codes → Same ISIL = shared system

Q: What if I can't decide after checking Wikidata?

A: Mark as UNCERTAIN and document what you checked:

validation_status: UNCERTAIN
validation_notes: "Checked Q21107861, addresses differ slightly. Possible 
relocation or branch. Need institutional records to confirm."

Q: Can I batch-mark multiple rows?

A: Yes! If you find a pattern:

# Example: All these were matched to Q12332829 (Randers archive)
# All in different cities → All INCORRECT

validation_status: INCORRECT
validation_notes: "Batch validation: City mismatch, different local archives 
incorrectly grouped by fuzzy matcher"

Progress Tracking

Current Status

Metric Count Percentage
Total fuzzy matches 185 100%
Auto-marked INCORRECT 73 39.5%
Needs manual review 75 40.5%
Remaining unvalidated 37 20.0%

Note: The 37 "remaining unvalidated" are Priority 3-5 matches in the full CSV that aren't in the streamlined needs_review file. You can validate these later if needed.

Validation Milestones

  • Automated spot checks - 185 matches flagged (Nov 19)
  • Automated pre-fill - 73 obvious errors marked (Nov 19)
  • Manual review - 75 ambiguous cases (in progress)
  • Apply validation - Update main dataset
  • Quality check - Verify results
  • Documentation - Write summary report

Contact & Support

Questions?

  • Check: docs/WIKIDATA_VALIDATION_CHECKLIST.md - Detailed validation guide
  • Check: docs/AUTOMATED_SPOT_CHECK_RESULTS.md - Spot check methodology
  • Check: data/review/README.md - Quick reference

Found a bug in automated pre-fill?

  • Script: scripts/prefill_obvious_errors.py
  • Report issue with example row

Need expert review?

  • Mark as UNCERTAIN
  • Document what's unclear
  • Escalate after validation complete

Last Updated: November 19, 2025
Status: 73/185 validated (39.5% complete)
Next Action: Manual review of 75 remaining matches