glam/docs/AUTOMATED_SPOT_CHECK_RESULTS.md
2025-11-21 22:12:33 +01:00

11 KiB

Automated Spot Check Results - Danish Wikidata Fuzzy Matches

Date: 2025-11-19
Method: Fast pattern-based detection (no Wikidata API queries)
Total Matches: 185
Flagged Issues: 129 (69.7%)
No Issues: 56 (30.3%)


🎯 Executive Summary

Automated checks identified 71 obvious city mismatches that are almost certainly INCORRECT matches. These can be quickly marked INCORRECT without manual research, reducing review time significantly.

Key Findings

Category Count Confidence Action
🚨 City Mismatches 71 Very High Mark INCORRECT immediately
🔍 Kombi Library Mismatches 1 Moderate Needs judgment
🔍 Low Name Similarity (<60%) 11 Moderate Needs judgment
🔍 Gymnasium Libraries 7 Moderate Usually INCORRECT
🔍 Other Name Pattern Issues 39 Low Needs case-by-case review
No Issues Detected 56 N/A Spot check Priority 1-2 only

🚨 Priority 1: City Mismatches (71 matches) - MARK AS INCORRECT

Confidence: 95%+ these are wrong matches
Time Required: ~71 minutes (1 min each)
Action: Open CSV, filter for "🚨 City mismatch", mark validation_status = INCORRECT

Examples:

  1. Gladsaxe Bibliotekerne (Søborg) matched to Gentofte Bibliotekerne (Gentofte)

    • Different cities → INCORRECT
  2. Fur Lokalhistoriske Arkiv (Skive) matched to Randers Lokalhistoriske Arkiv (Randers)

    • Different cities → INCORRECT
  3. Rysensteen Gymnasium (København V) matched to Greve Gymnasium (Greve)

    • Different cities → INCORRECT
  4. Multiple "X Lokalhistoriske Arkiv" matched to Randers Lokalhistoriske Arkiv

    • Algorithm confused similar names in different cities → INCORRECT

Pattern: The fuzzy matching algorithm matched institutions with similar names but in completely different cities. These are clearly distinct institutions.

Validation Notes Template:

City mismatch detected by automated spot check: our institution in [City A] but Wikidata entity in [City B]. Different institutions.

🔍 Priority 2: Low Name Similarity (11 matches) - NEEDS JUDGMENT

Confidence: 60-70% likely INCORRECT
Time Required: ~22 minutes (2 min each)
Action: Review each, check Wikidata page for verification

Examples:

  1. Campus Vejle, Biblioteket (58% similarity) vs Vejle Bibliotek

    • Possibly campus branch vs main library?
    • Check Wikidata P361 (part of) property
  2. Lunds stadsbibliotek vs Billund Bibliotek

    • Very different names, likely wrong match
    • "Lunds" suggests Sweden, not Denmark?

Validation Steps:

  1. Visit wikidata_url
  2. Check P131 (located in) - does city match?
  3. Check P361 (part of) - is one a branch of the other?
  4. Mark CORRECT if branch/main relationship, INCORRECT if completely different

🏫 Priority 3: Gymnasium Libraries (7 matches) - USUALLY INCORRECT

Confidence: 70-80% likely INCORRECT
Time Required: ~14 minutes (2 min each)
Action: Verify if school library vs public library

Pattern:

Our Name: "[School Name] Gymnasium, Biblioteket"
Wikidata: "[City Name] Bibliotek" (public library)

Issue: School libraries matched to public libraries in same city.

Examples:

  • Fredericia Gymnasium, Biblioteket → Fredericia Bibliotek
  • Viborg Handelsskole, Biblioteket → Viborg Bibliotek

Check:

  1. Visit Wikidata page
  2. Look for P31 (instance of) - should show "public library" or "school library"
  3. If Wikidata is public library and ours is gymnasium → INCORRECT
  4. If Wikidata is also school library → CORRECT

🔍 Priority 4: Other Flagged Issues (40 matches) - CASE BY CASE

Confidence: Varies
Time Required: ~80 minutes (2 min each)
Action: Review based on specific issue type

Issue Types:

  • Branch suffix ", Biblioteket" in our name
  • First word differs (possible city mismatch)
  • Low score (<87%) without ISIL confirmation
  • Kombi library location mismatches

Approach: Follow validation checklist for each match.


Priority 5: No Issues Detected (56 matches) - LOWER PRIORITY

Confidence: 80-90% likely CORRECT
Time Required: ~28 minutes (spot check only)
Action: Spot check Priority 1-2 matches, skip Priority 3-5

These matches passed all automated checks:

  • Cities match or no conflict detected
  • Names reasonably similar (>60%)
  • No obvious type mismatches
  • No problematic patterns

Recommendation:

  • Review Priority 1-2 "no issues" matches (30-40 matches)
  • Skip Priority 3-5 "no issues" matches (high confidence)
  • Estimated time: ~15-20 minutes

⏱️ Time Estimates

Task Matches Time/Match Total Time
City Mismatches (mark INCORRECT) 71 1 min 71 min
Low Similarity (review) 11 2 min 22 min
Gymnasium Libraries (review) 7 2 min 14 min
Other Flagged (review) 40 2 min 80 min
No Issues P1-2 (spot check) 30 0.5 min 15 min
──────────────── ─── ───── ──────
TOTAL 159 avg 1.3 min ~3.4 hours

Original Estimate: 5-8 hours for all 185 matches
Revised with Automation: ~3.4 hours (57% time savings!)


📋 Step-by-Step Workflow

Step 1: Open Flagged CSV

open data/review/denmark_wikidata_fuzzy_matches_flagged.csv

Step 2: Mark City Mismatches (71 matches, 1 hour)

  1. Sort by spot_check_issues column
  2. Filter for rows containing "🚨 City mismatch"
  3. For each row:
    • Fill validation_status = INCORRECT
    • Fill validation_notes = City mismatch: [our city] vs [Wikidata city], different institutions
  4. Save CSV

Step 3: Review Low Similarity (11 matches, 22 min)

  1. Filter for "Low name similarity"
  2. For each row:
    • Click wikidata_url
    • Check P131 (location), P361 (part of)
    • Decide: CORRECT (branch/main) or INCORRECT (different)
    • Fill validation_status and validation_notes

Step 4: Review Gymnasium Libraries (7 matches, 14 min)

  1. Filter for "Gymnasium"
  2. For each row:
    • Click wikidata_url
    • Check P31 (instance of) - public vs school library?
    • If mismatch → INCORRECT
    • Fill validation_status and validation_notes

Step 5: Review Other Flagged (40 matches, 80 min)

  1. Filter for remaining REVIEW_URGENT rows
  2. Follow validation checklist for each
  3. Fill validation_status and validation_notes

Step 6: Spot Check "No Issues" (30 matches, 15 min)

  1. Filter for auto_flag = OK AND priority IN (1, 2)
  2. Quick review (30 sec each):
    • Names look similar? → CORRECT
    • Any obvious issues? → INCORRECT
  3. Fill validation_status

Step 7: Apply Validation

python scripts/apply_wikidata_validation.py

Step 8: Check Results

python scripts/check_validation_progress.py

📊 Expected Validation Results

Based on automated spot check findings:

Status Expected Count % of Total Notes
CORRECT 100-110 54-59% No-issues matches + verified
INCORRECT 70-80 38-43% City mismatches + other errors
UNCERTAIN 5-10 3-5% Ambiguous cases

Quality Target: ≥50% CORRECT, ≤45% INCORRECT (acceptable given fuzzy matching)

Note: Higher INCORRECT rate than original estimate (5-10%) because automated checks caught many city mismatch errors that would have required manual research.


🎓 Validation Decision Guide

Mark as INCORRECT if:

  • City names differ (e.g., Skive vs Randers)
  • Institution type differs (library vs museum)
  • Gymnasium library matched to public library
  • Name similarity <50% with no other confirmation

Mark as CORRECT if:

  • ISIL codes match (authoritative)
  • Branch relationship confirmed on Wikidata (P361)
  • Same institution, different language (Danish/English)
  • Name similarity >70% AND city matches

Mark as UNCERTAIN if:

  • ⚠️ Cannot determine branch vs main relationship
  • ⚠️ Historical name change unclear
  • ⚠️ No clear evidence either way

📁 Files Generated

Input

  • data/review/denmark_wikidata_fuzzy_matches.csv (original, 42 KB)

Output

  • data/review/denmark_wikidata_fuzzy_matches_flagged.csv (with spot check results, 57 KB)

Scripts

  • scripts/spot_check_fuzzy_matches_fast.py (pattern-based detection)
  • scripts/apply_wikidata_validation.py (apply results after manual review)
  • scripts/check_validation_progress.py (progress tracking)

🚀 Quick Start Command

# Open flagged CSV
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv

# Sort by spot_check_issues column (🚨 city mismatches first)
# Mark all city mismatches as INCORRECT (validation_status = INCORRECT)
# Review remaining flagged rows
# Spot check "OK" rows in Priority 1-2
# Save CSV

# Apply validation
python scripts/apply_wikidata_validation.py

# Check progress
python scripts/check_validation_progress.py

🎯 Success Criteria

After manual review:

  • All 71 city mismatches marked INCORRECT
  • All 11 low similarity cases reviewed
  • All 7 gymnasium libraries reviewed
  • Priority 1-2 "OK" rows spot-checked
  • At least 150/185 (81%) rows have validation_status
  • At least 100/185 (54%) rows have validation_notes
  • Apply script runs successfully
  • Final dataset has <100 INCORRECT removals

📝 Sample Validation Notes

For City Mismatches (INCORRECT)

Automated spot check detected city mismatch: our institution in Skive vs Wikidata entity in Randers. Different local historical archives.

For Low Similarity (needs judgment)

Low name similarity (58%). Checked Wikidata - "Campus Vejle" is campus library branch, Wikidata entry is main public library. Different institutions. INCORRECT.

For Gymnasium (INCORRECT)

School library (gymnasium) incorrectly matched to public library. Wikidata P31 shows "public library" but ours is "school library". INCORRECT.

For Branch Relationships (CORRECT)

Branch library matched to main library. Checked Wikidata P361 - confirms branch relationship. Same institution system. CORRECT.

🔧 Troubleshooting

Q: CSV won't sort by spot_check_issues?
A: Try filtering instead - Excel/Sheets: Data → Filter → Select "🚨 City mismatch"

Q: Too many matches to review in one session?
A: Focus on city mismatches first (71 matches), complete in 1 session. Rest can wait.

Q: Unsure about a match?
A: Mark as UNCERTAIN, add detailed notes. We can research further later.

Q: How do I know if done?
A: Run python scripts/check_validation_progress.py - shows completion %


📈 Progress Tracking

Use this checklist:

[x] Automated spot checks run (129 flagged)
[ ] City mismatches reviewed (0/71)
[ ] Low similarity reviewed (0/11)
[ ] Gymnasium libraries reviewed (0/7)
[ ] Other flagged reviewed (0/40)
[ ] No-issues spot checked (0/30)
[ ] Validation applied
[ ] RDF re-exported

Current Progress: 0/159 (0%)

Last Updated: 2025-11-19
Generated By: scripts/spot_check_fuzzy_matches_fast.py
Review CSV: data/review/denmark_wikidata_fuzzy_matches_flagged.csv