glam/docs/AUTOMATED_SPOT_CHECK_RESULTS.md
2025-11-21 22:12:33 +01:00

370 lines
11 KiB
Markdown

# Automated Spot Check Results - Danish Wikidata Fuzzy Matches
**Date**: 2025-11-19
**Method**: Fast pattern-based detection (no Wikidata API queries)
**Total Matches**: 185
**Flagged Issues**: 129 (69.7%)
**No Issues**: 56 (30.3%)
---
## 🎯 Executive Summary
Automated checks identified **71 obvious city mismatches** that are almost certainly INCORRECT matches. These can be quickly marked INCORRECT without manual research, reducing review time significantly.
### Key Findings
| Category | Count | Confidence | Action |
|----------|-------|------------|--------|
| 🚨 **City Mismatches** | **71** | Very High | Mark INCORRECT immediately |
| 🔍 Kombi Library Mismatches | 1 | Moderate | Needs judgment |
| 🔍 Low Name Similarity (<60%) | 11 | Moderate | Needs judgment |
| 🔍 Gymnasium Libraries | 7 | Moderate | Usually INCORRECT |
| 🔍 Other Name Pattern Issues | 39 | Low | Needs case-by-case review |
| No Issues Detected | 56 | N/A | Spot check Priority 1-2 only |
---
## 🚨 Priority 1: City Mismatches (71 matches) - MARK AS INCORRECT
**Confidence**: 95%+ these are wrong matches
**Time Required**: ~71 minutes (1 min each)
**Action**: Open CSV, filter for "🚨 City mismatch", mark validation_status = INCORRECT
### Examples:
1. **Gladsaxe Bibliotekerne** (Søborg) matched to **Gentofte Bibliotekerne** (Gentofte)
- Different cities INCORRECT
2. **Fur Lokalhistoriske Arkiv** (Skive) matched to **Randers Lokalhistoriske Arkiv** (Randers)
- Different cities INCORRECT
3. **Rysensteen Gymnasium** (København V) matched to **Greve Gymnasium** (Greve)
- Different cities INCORRECT
4. **Multiple "X Lokalhistoriske Arkiv"** matched to **Randers Lokalhistoriske Arkiv**
- Algorithm confused similar names in different cities INCORRECT
**Pattern**: The fuzzy matching algorithm matched institutions with similar names but in completely different cities. These are clearly distinct institutions.
**Validation Notes Template**:
```
City mismatch detected by automated spot check: our institution in [City A] but Wikidata entity in [City B]. Different institutions.
```
---
## 🔍 Priority 2: Low Name Similarity (11 matches) - NEEDS JUDGMENT
**Confidence**: 60-70% likely INCORRECT
**Time Required**: ~22 minutes (2 min each)
**Action**: Review each, check Wikidata page for verification
### Examples:
1. **Campus Vejle, Biblioteket** (58% similarity) vs **Vejle Bibliotek**
- Possibly campus branch vs main library?
- Check Wikidata P361 (part of) property
2. **Lunds stadsbibliotek** vs **Billund Bibliotek**
- Very different names, likely wrong match
- "Lunds" suggests Sweden, not Denmark?
**Validation Steps**:
1. Visit wikidata_url
2. Check P131 (located in) - does city match?
3. Check P361 (part of) - is one a branch of the other?
4. Mark CORRECT if branch/main relationship, INCORRECT if completely different
---
## 🏫 Priority 3: Gymnasium Libraries (7 matches) - USUALLY INCORRECT
**Confidence**: 70-80% likely INCORRECT
**Time Required**: ~14 minutes (2 min each)
**Action**: Verify if school library vs public library
### Pattern:
**Our Name**: "[School Name] Gymnasium, Biblioteket"
**Wikidata**: "[City Name] Bibliotek" (public library)
**Issue**: School libraries matched to public libraries in same city.
**Examples**:
- Fredericia Gymnasium, Biblioteket Fredericia Bibliotek
- Viborg Handelsskole, Biblioteket Viborg Bibliotek
**Check**:
1. Visit Wikidata page
2. Look for P31 (instance of) - should show "public library" or "school library"
3. If Wikidata is public library and ours is gymnasium INCORRECT
4. If Wikidata is also school library CORRECT
---
## 🔍 Priority 4: Other Flagged Issues (40 matches) - CASE BY CASE
**Confidence**: Varies
**Time Required**: ~80 minutes (2 min each)
**Action**: Review based on specific issue type
**Issue Types**:
- Branch suffix ", Biblioteket" in our name
- First word differs (possible city mismatch)
- Low score (<87%) without ISIL confirmation
- Kombi library location mismatches
**Approach**: Follow validation checklist for each match.
---
## ✅ Priority 5: No Issues Detected (56 matches) - LOWER PRIORITY
**Confidence**: 80-90% likely CORRECT
**Time Required**: ~28 minutes (spot check only)
**Action**: Spot check Priority 1-2 matches, skip Priority 3-5
These matches passed all automated checks:
- Cities match or no conflict detected
- Names reasonably similar (>60%)
- No obvious type mismatches
- No problematic patterns
**Recommendation**:
- Review Priority 1-2 "no issues" matches (30-40 matches)
- Skip Priority 3-5 "no issues" matches (high confidence)
- Estimated time: ~15-20 minutes
---
## ⏱️ Time Estimates
| Task | Matches | Time/Match | Total Time |
|------|---------|------------|------------|
| **City Mismatches** (mark INCORRECT) | 71 | 1 min | 71 min |
| **Low Similarity** (review) | 11 | 2 min | 22 min |
| **Gymnasium Libraries** (review) | 7 | 2 min | 14 min |
| **Other Flagged** (review) | 40 | 2 min | 80 min |
| **No Issues P1-2** (spot check) | 30 | 0.5 min | 15 min |
| **────────────────** | **───** | **─────** | **──────** |
| **TOTAL** | **159** | **avg 1.3 min** | **~3.4 hours** |
**Original Estimate**: 5-8 hours for all 185 matches
**Revised with Automation**: ~3.4 hours (57% time savings!)
---
## 📋 Step-by-Step Workflow
### Step 1: Open Flagged CSV
```bash
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
```
### Step 2: Mark City Mismatches (71 matches, 1 hour)
1. Sort by `spot_check_issues` column
2. Filter for rows containing "🚨 City mismatch"
3. For each row:
- Fill `validation_status` = `INCORRECT`
- Fill `validation_notes` = `City mismatch: [our city] vs [Wikidata city], different institutions`
4. Save CSV
### Step 3: Review Low Similarity (11 matches, 22 min)
1. Filter for "Low name similarity"
2. For each row:
- Click `wikidata_url`
- Check P131 (location), P361 (part of)
- Decide: CORRECT (branch/main) or INCORRECT (different)
- Fill `validation_status` and `validation_notes`
### Step 4: Review Gymnasium Libraries (7 matches, 14 min)
1. Filter for "Gymnasium"
2. For each row:
- Click `wikidata_url`
- Check P31 (instance of) - public vs school library?
- If mismatch → INCORRECT
- Fill `validation_status` and `validation_notes`
### Step 5: Review Other Flagged (40 matches, 80 min)
1. Filter for remaining `REVIEW_URGENT` rows
2. Follow validation checklist for each
3. Fill `validation_status` and `validation_notes`
### Step 6: Spot Check "No Issues" (30 matches, 15 min)
1. Filter for `auto_flag = OK` AND `priority IN (1, 2)`
2. Quick review (30 sec each):
- Names look similar? → CORRECT
- Any obvious issues? → INCORRECT
3. Fill `validation_status`
### Step 7: Apply Validation
```bash
python scripts/apply_wikidata_validation.py
```
### Step 8: Check Results
```bash
python scripts/check_validation_progress.py
```
---
## 📊 Expected Validation Results
Based on automated spot check findings:
| Status | Expected Count | % of Total | Notes |
|--------|----------------|------------|-------|
| **CORRECT** | 100-110 | 54-59% | No-issues matches + verified |
| **INCORRECT** | 70-80 | 38-43% | City mismatches + other errors |
| **UNCERTAIN** | 5-10 | 3-5% | Ambiguous cases |
**Quality Target**: ≥50% CORRECT, ≤45% INCORRECT (acceptable given fuzzy matching)
**Note**: Higher INCORRECT rate than original estimate (5-10%) because automated checks caught many city mismatch errors that would have required manual research.
---
## 🎓 Validation Decision Guide
### Mark as INCORRECT if:
- ✅ City names differ (e.g., Skive vs Randers)
- ✅ Institution type differs (library vs museum)
- ✅ Gymnasium library matched to public library
- ✅ Name similarity <50% with no other confirmation
### Mark as CORRECT if:
- ISIL codes match (authoritative)
- Branch relationship confirmed on Wikidata (P361)
- Same institution, different language (Danish/English)
- Name similarity >70% AND city matches
### Mark as UNCERTAIN if:
- ⚠️ Cannot determine branch vs main relationship
- ⚠️ Historical name change unclear
- ⚠️ No clear evidence either way
---
## 📁 Files Generated
### Input
- `data/review/denmark_wikidata_fuzzy_matches.csv` (original, 42 KB)
### Output
- `data/review/denmark_wikidata_fuzzy_matches_flagged.csv` (with spot check results, 57 KB)
### Scripts
- `scripts/spot_check_fuzzy_matches_fast.py` (pattern-based detection)
- `scripts/apply_wikidata_validation.py` (apply results after manual review)
- `scripts/check_validation_progress.py` (progress tracking)
---
## 🚀 Quick Start Command
```bash
# Open flagged CSV
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
# Sort by spot_check_issues column (🚨 city mismatches first)
# Mark all city mismatches as INCORRECT (validation_status = INCORRECT)
# Review remaining flagged rows
# Spot check "OK" rows in Priority 1-2
# Save CSV
# Apply validation
python scripts/apply_wikidata_validation.py
# Check progress
python scripts/check_validation_progress.py
```
---
## 🎯 Success Criteria
After manual review:
- [ ] All 71 city mismatches marked INCORRECT
- [ ] All 11 low similarity cases reviewed
- [ ] All 7 gymnasium libraries reviewed
- [ ] Priority 1-2 "OK" rows spot-checked
- [ ] At least 150/185 (81%) rows have validation_status
- [ ] At least 100/185 (54%) rows have validation_notes
- [ ] Apply script runs successfully
- [ ] Final dataset has <100 INCORRECT removals
---
## 📝 Sample Validation Notes
### For City Mismatches (INCORRECT)
```
Automated spot check detected city mismatch: our institution in Skive vs Wikidata entity in Randers. Different local historical archives.
```
### For Low Similarity (needs judgment)
```
Low name similarity (58%). Checked Wikidata - "Campus Vejle" is campus library branch, Wikidata entry is main public library. Different institutions. INCORRECT.
```
### For Gymnasium (INCORRECT)
```
School library (gymnasium) incorrectly matched to public library. Wikidata P31 shows "public library" but ours is "school library". INCORRECT.
```
### For Branch Relationships (CORRECT)
```
Branch library matched to main library. Checked Wikidata P361 - confirms branch relationship. Same institution system. CORRECT.
```
---
## 🔧 Troubleshooting
**Q: CSV won't sort by spot_check_issues?**
A: Try filtering instead - Excel/Sheets: Data Filter Select "🚨 City mismatch"
**Q: Too many matches to review in one session?**
A: Focus on city mismatches first (71 matches), complete in 1 session. Rest can wait.
**Q: Unsure about a match?**
A: Mark as UNCERTAIN, add detailed notes. We can research further later.
**Q: How do I know if done?**
A: Run `python scripts/check_validation_progress.py` - shows completion %
---
## 📈 Progress Tracking
Use this checklist:
```
[x] Automated spot checks run (129 flagged)
[ ] City mismatches reviewed (0/71)
[ ] Low similarity reviewed (0/11)
[ ] Gymnasium libraries reviewed (0/7)
[ ] Other flagged reviewed (0/40)
[ ] No-issues spot checked (0/30)
[ ] Validation applied
[ ] RDF re-exported
Current Progress: 0/159 (0%)
```
---
**Last Updated**: 2025-11-19
**Generated By**: `scripts/spot_check_fuzzy_matches_fast.py`
**Review CSV**: `data/review/denmark_wikidata_fuzzy_matches_flagged.csv`