# Automated Spot Check Results - Danish Wikidata Fuzzy Matches **Date**: 2025-11-19 **Method**: Fast pattern-based detection (no Wikidata API queries) **Total Matches**: 185 **Flagged Issues**: 129 (69.7%) **No Issues**: 56 (30.3%) --- ## 🎯 Executive Summary Automated checks identified **71 obvious city mismatches** that are almost certainly INCORRECT matches. These can be quickly marked INCORRECT without manual research, reducing review time significantly. ### Key Findings | Category | Count | Confidence | Action | |----------|-------|------------|--------| | 🚨 **City Mismatches** | **71** | Very High | Mark INCORRECT immediately | | πŸ” Kombi Library Mismatches | 1 | Moderate | Needs judgment | | πŸ” Low Name Similarity (<60%) | 11 | Moderate | Needs judgment | | πŸ” Gymnasium Libraries | 7 | Moderate | Usually INCORRECT | | πŸ” Other Name Pattern Issues | 39 | Low | Needs case-by-case review | | βœ… No Issues Detected | 56 | N/A | Spot check Priority 1-2 only | --- ## 🚨 Priority 1: City Mismatches (71 matches) - MARK AS INCORRECT **Confidence**: 95%+ these are wrong matches **Time Required**: ~71 minutes (1 min each) **Action**: Open CSV, filter for "🚨 City mismatch", mark validation_status = INCORRECT ### Examples: 1. **Gladsaxe Bibliotekerne** (SΓΈborg) matched to **Gentofte Bibliotekerne** (Gentofte) - Different cities β†’ INCORRECT 2. **Fur Lokalhistoriske Arkiv** (Skive) matched to **Randers Lokalhistoriske Arkiv** (Randers) - Different cities β†’ INCORRECT 3. **Rysensteen Gymnasium** (KΓΈbenhavn V) matched to **Greve Gymnasium** (Greve) - Different cities β†’ INCORRECT 4. **Multiple "X Lokalhistoriske Arkiv"** matched to **Randers Lokalhistoriske Arkiv** - Algorithm confused similar names in different cities β†’ INCORRECT **Pattern**: The fuzzy matching algorithm matched institutions with similar names but in completely different cities. These are clearly distinct institutions. **Validation Notes Template**: ``` City mismatch detected by automated spot check: our institution in [City A] but Wikidata entity in [City B]. Different institutions. ``` --- ## πŸ” Priority 2: Low Name Similarity (11 matches) - NEEDS JUDGMENT **Confidence**: 60-70% likely INCORRECT **Time Required**: ~22 minutes (2 min each) **Action**: Review each, check Wikidata page for verification ### Examples: 1. **Campus Vejle, Biblioteket** (58% similarity) vs **Vejle Bibliotek** - Possibly campus branch vs main library? - Check Wikidata P361 (part of) property 2. **Lunds stadsbibliotek** vs **Billund Bibliotek** - Very different names, likely wrong match - "Lunds" suggests Sweden, not Denmark? **Validation Steps**: 1. Visit wikidata_url 2. Check P131 (located in) - does city match? 3. Check P361 (part of) - is one a branch of the other? 4. Mark CORRECT if branch/main relationship, INCORRECT if completely different --- ## 🏫 Priority 3: Gymnasium Libraries (7 matches) - USUALLY INCORRECT **Confidence**: 70-80% likely INCORRECT **Time Required**: ~14 minutes (2 min each) **Action**: Verify if school library vs public library ### Pattern: **Our Name**: "[School Name] Gymnasium, Biblioteket" **Wikidata**: "[City Name] Bibliotek" (public library) **Issue**: School libraries matched to public libraries in same city. **Examples**: - Fredericia Gymnasium, Biblioteket β†’ Fredericia Bibliotek - Viborg Handelsskole, Biblioteket β†’ Viborg Bibliotek **Check**: 1. Visit Wikidata page 2. Look for P31 (instance of) - should show "public library" or "school library" 3. If Wikidata is public library and ours is gymnasium β†’ INCORRECT 4. If Wikidata is also school library β†’ CORRECT --- ## πŸ” Priority 4: Other Flagged Issues (40 matches) - CASE BY CASE **Confidence**: Varies **Time Required**: ~80 minutes (2 min each) **Action**: Review based on specific issue type **Issue Types**: - Branch suffix ", Biblioteket" in our name - First word differs (possible city mismatch) - Low score (<87%) without ISIL confirmation - Kombi library location mismatches **Approach**: Follow validation checklist for each match. --- ## βœ… Priority 5: No Issues Detected (56 matches) - LOWER PRIORITY **Confidence**: 80-90% likely CORRECT **Time Required**: ~28 minutes (spot check only) **Action**: Spot check Priority 1-2 matches, skip Priority 3-5 These matches passed all automated checks: - Cities match or no conflict detected - Names reasonably similar (>60%) - No obvious type mismatches - No problematic patterns **Recommendation**: - Review Priority 1-2 "no issues" matches (30-40 matches) - Skip Priority 3-5 "no issues" matches (high confidence) - Estimated time: ~15-20 minutes --- ## ⏱️ Time Estimates | Task | Matches | Time/Match | Total Time | |------|---------|------------|------------| | **City Mismatches** (mark INCORRECT) | 71 | 1 min | 71 min | | **Low Similarity** (review) | 11 | 2 min | 22 min | | **Gymnasium Libraries** (review) | 7 | 2 min | 14 min | | **Other Flagged** (review) | 40 | 2 min | 80 min | | **No Issues P1-2** (spot check) | 30 | 0.5 min | 15 min | | **────────────────** | **───** | **─────** | **──────** | | **TOTAL** | **159** | **avg 1.3 min** | **~3.4 hours** | **Original Estimate**: 5-8 hours for all 185 matches **Revised with Automation**: ~3.4 hours (57% time savings!) --- ## πŸ“‹ Step-by-Step Workflow ### Step 1: Open Flagged CSV ```bash open data/review/denmark_wikidata_fuzzy_matches_flagged.csv ``` ### Step 2: Mark City Mismatches (71 matches, 1 hour) 1. Sort by `spot_check_issues` column 2. Filter for rows containing "🚨 City mismatch" 3. For each row: - Fill `validation_status` = `INCORRECT` - Fill `validation_notes` = `City mismatch: [our city] vs [Wikidata city], different institutions` 4. Save CSV ### Step 3: Review Low Similarity (11 matches, 22 min) 1. Filter for "Low name similarity" 2. For each row: - Click `wikidata_url` - Check P131 (location), P361 (part of) - Decide: CORRECT (branch/main) or INCORRECT (different) - Fill `validation_status` and `validation_notes` ### Step 4: Review Gymnasium Libraries (7 matches, 14 min) 1. Filter for "Gymnasium" 2. For each row: - Click `wikidata_url` - Check P31 (instance of) - public vs school library? - If mismatch β†’ INCORRECT - Fill `validation_status` and `validation_notes` ### Step 5: Review Other Flagged (40 matches, 80 min) 1. Filter for remaining `REVIEW_URGENT` rows 2. Follow validation checklist for each 3. Fill `validation_status` and `validation_notes` ### Step 6: Spot Check "No Issues" (30 matches, 15 min) 1. Filter for `auto_flag = OK` AND `priority IN (1, 2)` 2. Quick review (30 sec each): - Names look similar? β†’ CORRECT - Any obvious issues? β†’ INCORRECT 3. Fill `validation_status` ### Step 7: Apply Validation ```bash python scripts/apply_wikidata_validation.py ``` ### Step 8: Check Results ```bash python scripts/check_validation_progress.py ``` --- ## πŸ“Š Expected Validation Results Based on automated spot check findings: | Status | Expected Count | % of Total | Notes | |--------|----------------|------------|-------| | **CORRECT** | 100-110 | 54-59% | No-issues matches + verified | | **INCORRECT** | 70-80 | 38-43% | City mismatches + other errors | | **UNCERTAIN** | 5-10 | 3-5% | Ambiguous cases | **Quality Target**: β‰₯50% CORRECT, ≀45% INCORRECT (acceptable given fuzzy matching) **Note**: Higher INCORRECT rate than original estimate (5-10%) because automated checks caught many city mismatch errors that would have required manual research. --- ## πŸŽ“ Validation Decision Guide ### Mark as INCORRECT if: - βœ… City names differ (e.g., Skive vs Randers) - βœ… Institution type differs (library vs museum) - βœ… Gymnasium library matched to public library - βœ… Name similarity <50% with no other confirmation ### Mark as CORRECT if: - βœ… ISIL codes match (authoritative) - βœ… Branch relationship confirmed on Wikidata (P361) - βœ… Same institution, different language (Danish/English) - βœ… Name similarity >70% AND city matches ### Mark as UNCERTAIN if: - ⚠️ Cannot determine branch vs main relationship - ⚠️ Historical name change unclear - ⚠️ No clear evidence either way --- ## πŸ“ Files Generated ### Input - `data/review/denmark_wikidata_fuzzy_matches.csv` (original, 42 KB) ### Output - `data/review/denmark_wikidata_fuzzy_matches_flagged.csv` (with spot check results, 57 KB) ### Scripts - `scripts/spot_check_fuzzy_matches_fast.py` (pattern-based detection) - `scripts/apply_wikidata_validation.py` (apply results after manual review) - `scripts/check_validation_progress.py` (progress tracking) --- ## πŸš€ Quick Start Command ```bash # Open flagged CSV open data/review/denmark_wikidata_fuzzy_matches_flagged.csv # Sort by spot_check_issues column (🚨 city mismatches first) # Mark all city mismatches as INCORRECT (validation_status = INCORRECT) # Review remaining flagged rows # Spot check "OK" rows in Priority 1-2 # Save CSV # Apply validation python scripts/apply_wikidata_validation.py # Check progress python scripts/check_validation_progress.py ``` --- ## 🎯 Success Criteria After manual review: - [ ] All 71 city mismatches marked INCORRECT - [ ] All 11 low similarity cases reviewed - [ ] All 7 gymnasium libraries reviewed - [ ] Priority 1-2 "OK" rows spot-checked - [ ] At least 150/185 (81%) rows have validation_status - [ ] At least 100/185 (54%) rows have validation_notes - [ ] Apply script runs successfully - [ ] Final dataset has <100 INCORRECT removals --- ## πŸ“ Sample Validation Notes ### For City Mismatches (INCORRECT) ``` Automated spot check detected city mismatch: our institution in Skive vs Wikidata entity in Randers. Different local historical archives. ``` ### For Low Similarity (needs judgment) ``` Low name similarity (58%). Checked Wikidata - "Campus Vejle" is campus library branch, Wikidata entry is main public library. Different institutions. INCORRECT. ``` ### For Gymnasium (INCORRECT) ``` School library (gymnasium) incorrectly matched to public library. Wikidata P31 shows "public library" but ours is "school library". INCORRECT. ``` ### For Branch Relationships (CORRECT) ``` Branch library matched to main library. Checked Wikidata P361 - confirms branch relationship. Same institution system. CORRECT. ``` --- ## πŸ”§ Troubleshooting **Q: CSV won't sort by spot_check_issues?** A: Try filtering instead - Excel/Sheets: Data β†’ Filter β†’ Select "🚨 City mismatch" **Q: Too many matches to review in one session?** A: Focus on city mismatches first (71 matches), complete in 1 session. Rest can wait. **Q: Unsure about a match?** A: Mark as UNCERTAIN, add detailed notes. We can research further later. **Q: How do I know if done?** A: Run `python scripts/check_validation_progress.py` - shows completion % --- ## πŸ“ˆ Progress Tracking Use this checklist: ``` [x] Automated spot checks run (129 flagged) [ ] City mismatches reviewed (0/71) [ ] Low similarity reviewed (0/11) [ ] Gymnasium libraries reviewed (0/7) [ ] Other flagged reviewed (0/40) [ ] No-issues spot checked (0/30) [ ] Validation applied [ ] RDF re-exported Current Progress: 0/159 (0%) ``` --- **Last Updated**: 2025-11-19 **Generated By**: `scripts/spot_check_fuzzy_matches_fast.py` **Review CSV**: `data/review/denmark_wikidata_fuzzy_matches_flagged.csv`