11 KiB
Automated Spot Check Results - Danish Wikidata Fuzzy Matches
Date: 2025-11-19
Method: Fast pattern-based detection (no Wikidata API queries)
Total Matches: 185
Flagged Issues: 129 (69.7%)
No Issues: 56 (30.3%)
🎯 Executive Summary
Automated checks identified 71 obvious city mismatches that are almost certainly INCORRECT matches. These can be quickly marked INCORRECT without manual research, reducing review time significantly.
Key Findings
| Category | Count | Confidence | Action |
|---|---|---|---|
| 🚨 City Mismatches | 71 | Very High | Mark INCORRECT immediately |
| 🔍 Kombi Library Mismatches | 1 | Moderate | Needs judgment |
| 🔍 Low Name Similarity (<60%) | 11 | Moderate | Needs judgment |
| 🔍 Gymnasium Libraries | 7 | Moderate | Usually INCORRECT |
| 🔍 Other Name Pattern Issues | 39 | Low | Needs case-by-case review |
| ✅ No Issues Detected | 56 | N/A | Spot check Priority 1-2 only |
🚨 Priority 1: City Mismatches (71 matches) - MARK AS INCORRECT
Confidence: 95%+ these are wrong matches
Time Required: ~71 minutes (1 min each)
Action: Open CSV, filter for "🚨 City mismatch", mark validation_status = INCORRECT
Examples:
-
Gladsaxe Bibliotekerne (Søborg) matched to Gentofte Bibliotekerne (Gentofte)
- Different cities → INCORRECT
-
Fur Lokalhistoriske Arkiv (Skive) matched to Randers Lokalhistoriske Arkiv (Randers)
- Different cities → INCORRECT
-
Rysensteen Gymnasium (København V) matched to Greve Gymnasium (Greve)
- Different cities → INCORRECT
-
Multiple "X Lokalhistoriske Arkiv" matched to Randers Lokalhistoriske Arkiv
- Algorithm confused similar names in different cities → INCORRECT
Pattern: The fuzzy matching algorithm matched institutions with similar names but in completely different cities. These are clearly distinct institutions.
Validation Notes Template:
City mismatch detected by automated spot check: our institution in [City A] but Wikidata entity in [City B]. Different institutions.
🔍 Priority 2: Low Name Similarity (11 matches) - NEEDS JUDGMENT
Confidence: 60-70% likely INCORRECT
Time Required: ~22 minutes (2 min each)
Action: Review each, check Wikidata page for verification
Examples:
-
Campus Vejle, Biblioteket (58% similarity) vs Vejle Bibliotek
- Possibly campus branch vs main library?
- Check Wikidata P361 (part of) property
-
Lunds stadsbibliotek vs Billund Bibliotek
- Very different names, likely wrong match
- "Lunds" suggests Sweden, not Denmark?
Validation Steps:
- Visit wikidata_url
- Check P131 (located in) - does city match?
- Check P361 (part of) - is one a branch of the other?
- Mark CORRECT if branch/main relationship, INCORRECT if completely different
🏫 Priority 3: Gymnasium Libraries (7 matches) - USUALLY INCORRECT
Confidence: 70-80% likely INCORRECT
Time Required: ~14 minutes (2 min each)
Action: Verify if school library vs public library
Pattern:
Our Name: "[School Name] Gymnasium, Biblioteket"
Wikidata: "[City Name] Bibliotek" (public library)
Issue: School libraries matched to public libraries in same city.
Examples:
- Fredericia Gymnasium, Biblioteket → Fredericia Bibliotek
- Viborg Handelsskole, Biblioteket → Viborg Bibliotek
Check:
- Visit Wikidata page
- Look for P31 (instance of) - should show "public library" or "school library"
- If Wikidata is public library and ours is gymnasium → INCORRECT
- If Wikidata is also school library → CORRECT
🔍 Priority 4: Other Flagged Issues (40 matches) - CASE BY CASE
Confidence: Varies
Time Required: ~80 minutes (2 min each)
Action: Review based on specific issue type
Issue Types:
- Branch suffix ", Biblioteket" in our name
- First word differs (possible city mismatch)
- Low score (<87%) without ISIL confirmation
- Kombi library location mismatches
Approach: Follow validation checklist for each match.
✅ Priority 5: No Issues Detected (56 matches) - LOWER PRIORITY
Confidence: 80-90% likely CORRECT
Time Required: ~28 minutes (spot check only)
Action: Spot check Priority 1-2 matches, skip Priority 3-5
These matches passed all automated checks:
- Cities match or no conflict detected
- Names reasonably similar (>60%)
- No obvious type mismatches
- No problematic patterns
Recommendation:
- Review Priority 1-2 "no issues" matches (30-40 matches)
- Skip Priority 3-5 "no issues" matches (high confidence)
- Estimated time: ~15-20 minutes
⏱️ Time Estimates
| Task | Matches | Time/Match | Total Time |
|---|---|---|---|
| City Mismatches (mark INCORRECT) | 71 | 1 min | 71 min |
| Low Similarity (review) | 11 | 2 min | 22 min |
| Gymnasium Libraries (review) | 7 | 2 min | 14 min |
| Other Flagged (review) | 40 | 2 min | 80 min |
| No Issues P1-2 (spot check) | 30 | 0.5 min | 15 min |
| ──────────────── | ─── | ───── | ────── |
| TOTAL | 159 | avg 1.3 min | ~3.4 hours |
Original Estimate: 5-8 hours for all 185 matches
Revised with Automation: ~3.4 hours (57% time savings!)
📋 Step-by-Step Workflow
Step 1: Open Flagged CSV
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
Step 2: Mark City Mismatches (71 matches, 1 hour)
- Sort by
spot_check_issuescolumn - Filter for rows containing "🚨 City mismatch"
- For each row:
- Fill
validation_status=INCORRECT - Fill
validation_notes=City mismatch: [our city] vs [Wikidata city], different institutions
- Fill
- Save CSV
Step 3: Review Low Similarity (11 matches, 22 min)
- Filter for "Low name similarity"
- For each row:
- Click
wikidata_url - Check P131 (location), P361 (part of)
- Decide: CORRECT (branch/main) or INCORRECT (different)
- Fill
validation_statusandvalidation_notes
- Click
Step 4: Review Gymnasium Libraries (7 matches, 14 min)
- Filter for "Gymnasium"
- For each row:
- Click
wikidata_url - Check P31 (instance of) - public vs school library?
- If mismatch → INCORRECT
- Fill
validation_statusandvalidation_notes
- Click
Step 5: Review Other Flagged (40 matches, 80 min)
- Filter for remaining
REVIEW_URGENTrows - Follow validation checklist for each
- Fill
validation_statusandvalidation_notes
Step 6: Spot Check "No Issues" (30 matches, 15 min)
- Filter for
auto_flag = OKANDpriority IN (1, 2) - Quick review (30 sec each):
- Names look similar? → CORRECT
- Any obvious issues? → INCORRECT
- Fill
validation_status
Step 7: Apply Validation
python scripts/apply_wikidata_validation.py
Step 8: Check Results
python scripts/check_validation_progress.py
📊 Expected Validation Results
Based on automated spot check findings:
| Status | Expected Count | % of Total | Notes |
|---|---|---|---|
| CORRECT | 100-110 | 54-59% | No-issues matches + verified |
| INCORRECT | 70-80 | 38-43% | City mismatches + other errors |
| UNCERTAIN | 5-10 | 3-5% | Ambiguous cases |
Quality Target: ≥50% CORRECT, ≤45% INCORRECT (acceptable given fuzzy matching)
Note: Higher INCORRECT rate than original estimate (5-10%) because automated checks caught many city mismatch errors that would have required manual research.
🎓 Validation Decision Guide
Mark as INCORRECT if:
- ✅ City names differ (e.g., Skive vs Randers)
- ✅ Institution type differs (library vs museum)
- ✅ Gymnasium library matched to public library
- ✅ Name similarity <50% with no other confirmation
Mark as CORRECT if:
- ✅ ISIL codes match (authoritative)
- ✅ Branch relationship confirmed on Wikidata (P361)
- ✅ Same institution, different language (Danish/English)
- ✅ Name similarity >70% AND city matches
Mark as UNCERTAIN if:
- ⚠️ Cannot determine branch vs main relationship
- ⚠️ Historical name change unclear
- ⚠️ No clear evidence either way
📁 Files Generated
Input
data/review/denmark_wikidata_fuzzy_matches.csv(original, 42 KB)
Output
data/review/denmark_wikidata_fuzzy_matches_flagged.csv(with spot check results, 57 KB)
Scripts
scripts/spot_check_fuzzy_matches_fast.py(pattern-based detection)scripts/apply_wikidata_validation.py(apply results after manual review)scripts/check_validation_progress.py(progress tracking)
🚀 Quick Start Command
# Open flagged CSV
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
# Sort by spot_check_issues column (🚨 city mismatches first)
# Mark all city mismatches as INCORRECT (validation_status = INCORRECT)
# Review remaining flagged rows
# Spot check "OK" rows in Priority 1-2
# Save CSV
# Apply validation
python scripts/apply_wikidata_validation.py
# Check progress
python scripts/check_validation_progress.py
🎯 Success Criteria
After manual review:
- All 71 city mismatches marked INCORRECT
- All 11 low similarity cases reviewed
- All 7 gymnasium libraries reviewed
- Priority 1-2 "OK" rows spot-checked
- At least 150/185 (81%) rows have validation_status
- At least 100/185 (54%) rows have validation_notes
- Apply script runs successfully
- Final dataset has <100 INCORRECT removals
📝 Sample Validation Notes
For City Mismatches (INCORRECT)
Automated spot check detected city mismatch: our institution in Skive vs Wikidata entity in Randers. Different local historical archives.
For Low Similarity (needs judgment)
Low name similarity (58%). Checked Wikidata - "Campus Vejle" is campus library branch, Wikidata entry is main public library. Different institutions. INCORRECT.
For Gymnasium (INCORRECT)
School library (gymnasium) incorrectly matched to public library. Wikidata P31 shows "public library" but ours is "school library". INCORRECT.
For Branch Relationships (CORRECT)
Branch library matched to main library. Checked Wikidata P361 - confirms branch relationship. Same institution system. CORRECT.
🔧 Troubleshooting
Q: CSV won't sort by spot_check_issues?
A: Try filtering instead - Excel/Sheets: Data → Filter → Select "🚨 City mismatch"
Q: Too many matches to review in one session?
A: Focus on city mismatches first (71 matches), complete in 1 session. Rest can wait.
Q: Unsure about a match?
A: Mark as UNCERTAIN, add detailed notes. We can research further later.
Q: How do I know if done?
A: Run python scripts/check_validation_progress.py - shows completion %
📈 Progress Tracking
Use this checklist:
[x] Automated spot checks run (129 flagged)
[ ] City mismatches reviewed (0/71)
[ ] Low similarity reviewed (0/11)
[ ] Gymnasium libraries reviewed (0/7)
[ ] Other flagged reviewed (0/40)
[ ] No-issues spot checked (0/30)
[ ] Validation applied
[ ] RDF re-exported
Current Progress: 0/159 (0%)
Last Updated: 2025-11-19
Generated By: scripts/spot_check_fuzzy_matches_fast.py
Review CSV: data/review/denmark_wikidata_fuzzy_matches_flagged.csv