370 lines
11 KiB
Markdown
370 lines
11 KiB
Markdown
# Automated Spot Check Results - Danish Wikidata Fuzzy Matches
|
|
|
|
**Date**: 2025-11-19
|
|
**Method**: Fast pattern-based detection (no Wikidata API queries)
|
|
**Total Matches**: 185
|
|
**Flagged Issues**: 129 (69.7%)
|
|
**No Issues**: 56 (30.3%)
|
|
|
|
---
|
|
|
|
## 🎯 Executive Summary
|
|
|
|
Automated checks identified **71 obvious city mismatches** that are almost certainly INCORRECT matches. These can be quickly marked INCORRECT without manual research, reducing review time significantly.
|
|
|
|
### Key Findings
|
|
|
|
| Category | Count | Confidence | Action |
|
|
|----------|-------|------------|--------|
|
|
| 🚨 **City Mismatches** | **71** | Very High | Mark INCORRECT immediately |
|
|
| 🔍 Kombi Library Mismatches | 1 | Moderate | Needs judgment |
|
|
| 🔍 Low Name Similarity (<60%) | 11 | Moderate | Needs judgment |
|
|
| 🔍 Gymnasium Libraries | 7 | Moderate | Usually INCORRECT |
|
|
| 🔍 Other Name Pattern Issues | 39 | Low | Needs case-by-case review |
|
|
| ✅ No Issues Detected | 56 | N/A | Spot check Priority 1-2 only |
|
|
|
|
---
|
|
|
|
## 🚨 Priority 1: City Mismatches (71 matches) - MARK AS INCORRECT
|
|
|
|
**Confidence**: 95%+ these are wrong matches
|
|
**Time Required**: ~71 minutes (1 min each)
|
|
**Action**: Open CSV, filter for "🚨 City mismatch", mark validation_status = INCORRECT
|
|
|
|
### Examples:
|
|
|
|
1. **Gladsaxe Bibliotekerne** (Søborg) matched to **Gentofte Bibliotekerne** (Gentofte)
|
|
- Different cities → INCORRECT
|
|
|
|
2. **Fur Lokalhistoriske Arkiv** (Skive) matched to **Randers Lokalhistoriske Arkiv** (Randers)
|
|
- Different cities → INCORRECT
|
|
|
|
3. **Rysensteen Gymnasium** (København V) matched to **Greve Gymnasium** (Greve)
|
|
- Different cities → INCORRECT
|
|
|
|
4. **Multiple "X Lokalhistoriske Arkiv"** matched to **Randers Lokalhistoriske Arkiv**
|
|
- Algorithm confused similar names in different cities → INCORRECT
|
|
|
|
**Pattern**: The fuzzy matching algorithm matched institutions with similar names but in completely different cities. These are clearly distinct institutions.
|
|
|
|
**Validation Notes Template**:
|
|
```
|
|
City mismatch detected by automated spot check: our institution in [City A] but Wikidata entity in [City B]. Different institutions.
|
|
```
|
|
|
|
---
|
|
|
|
## 🔍 Priority 2: Low Name Similarity (11 matches) - NEEDS JUDGMENT
|
|
|
|
**Confidence**: 60-70% likely INCORRECT
|
|
**Time Required**: ~22 minutes (2 min each)
|
|
**Action**: Review each, check Wikidata page for verification
|
|
|
|
### Examples:
|
|
|
|
1. **Campus Vejle, Biblioteket** (58% similarity) vs **Vejle Bibliotek**
|
|
- Possibly campus branch vs main library?
|
|
- Check Wikidata P361 (part of) property
|
|
|
|
2. **Lunds stadsbibliotek** vs **Billund Bibliotek**
|
|
- Very different names, likely wrong match
|
|
- "Lunds" suggests Sweden, not Denmark?
|
|
|
|
**Validation Steps**:
|
|
1. Visit wikidata_url
|
|
2. Check P131 (located in) - does city match?
|
|
3. Check P361 (part of) - is one a branch of the other?
|
|
4. Mark CORRECT if branch/main relationship, INCORRECT if completely different
|
|
|
|
---
|
|
|
|
## 🏫 Priority 3: Gymnasium Libraries (7 matches) - USUALLY INCORRECT
|
|
|
|
**Confidence**: 70-80% likely INCORRECT
|
|
**Time Required**: ~14 minutes (2 min each)
|
|
**Action**: Verify if school library vs public library
|
|
|
|
### Pattern:
|
|
|
|
**Our Name**: "[School Name] Gymnasium, Biblioteket"
|
|
**Wikidata**: "[City Name] Bibliotek" (public library)
|
|
|
|
**Issue**: School libraries matched to public libraries in same city.
|
|
|
|
**Examples**:
|
|
- Fredericia Gymnasium, Biblioteket → Fredericia Bibliotek
|
|
- Viborg Handelsskole, Biblioteket → Viborg Bibliotek
|
|
|
|
**Check**:
|
|
1. Visit Wikidata page
|
|
2. Look for P31 (instance of) - should show "public library" or "school library"
|
|
3. If Wikidata is public library and ours is gymnasium → INCORRECT
|
|
4. If Wikidata is also school library → CORRECT
|
|
|
|
---
|
|
|
|
## 🔍 Priority 4: Other Flagged Issues (40 matches) - CASE BY CASE
|
|
|
|
**Confidence**: Varies
|
|
**Time Required**: ~80 minutes (2 min each)
|
|
**Action**: Review based on specific issue type
|
|
|
|
**Issue Types**:
|
|
- Branch suffix ", Biblioteket" in our name
|
|
- First word differs (possible city mismatch)
|
|
- Low score (<87%) without ISIL confirmation
|
|
- Kombi library location mismatches
|
|
|
|
**Approach**: Follow validation checklist for each match.
|
|
|
|
---
|
|
|
|
## ✅ Priority 5: No Issues Detected (56 matches) - LOWER PRIORITY
|
|
|
|
**Confidence**: 80-90% likely CORRECT
|
|
**Time Required**: ~28 minutes (spot check only)
|
|
**Action**: Spot check Priority 1-2 matches, skip Priority 3-5
|
|
|
|
These matches passed all automated checks:
|
|
- Cities match or no conflict detected
|
|
- Names reasonably similar (>60%)
|
|
- No obvious type mismatches
|
|
- No problematic patterns
|
|
|
|
**Recommendation**:
|
|
- Review Priority 1-2 "no issues" matches (30-40 matches)
|
|
- Skip Priority 3-5 "no issues" matches (high confidence)
|
|
- Estimated time: ~15-20 minutes
|
|
|
|
---
|
|
|
|
## ⏱️ Time Estimates
|
|
|
|
| Task | Matches | Time/Match | Total Time |
|
|
|------|---------|------------|------------|
|
|
| **City Mismatches** (mark INCORRECT) | 71 | 1 min | 71 min |
|
|
| **Low Similarity** (review) | 11 | 2 min | 22 min |
|
|
| **Gymnasium Libraries** (review) | 7 | 2 min | 14 min |
|
|
| **Other Flagged** (review) | 40 | 2 min | 80 min |
|
|
| **No Issues P1-2** (spot check) | 30 | 0.5 min | 15 min |
|
|
| **────────────────** | **───** | **─────** | **──────** |
|
|
| **TOTAL** | **159** | **avg 1.3 min** | **~3.4 hours** |
|
|
|
|
**Original Estimate**: 5-8 hours for all 185 matches
|
|
**Revised with Automation**: ~3.4 hours (57% time savings!)
|
|
|
|
---
|
|
|
|
## 📋 Step-by-Step Workflow
|
|
|
|
### Step 1: Open Flagged CSV
|
|
|
|
```bash
|
|
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
|
|
```
|
|
|
|
### Step 2: Mark City Mismatches (71 matches, 1 hour)
|
|
|
|
1. Sort by `spot_check_issues` column
|
|
2. Filter for rows containing "🚨 City mismatch"
|
|
3. For each row:
|
|
- Fill `validation_status` = `INCORRECT`
|
|
- Fill `validation_notes` = `City mismatch: [our city] vs [Wikidata city], different institutions`
|
|
4. Save CSV
|
|
|
|
### Step 3: Review Low Similarity (11 matches, 22 min)
|
|
|
|
1. Filter for "Low name similarity"
|
|
2. For each row:
|
|
- Click `wikidata_url`
|
|
- Check P131 (location), P361 (part of)
|
|
- Decide: CORRECT (branch/main) or INCORRECT (different)
|
|
- Fill `validation_status` and `validation_notes`
|
|
|
|
### Step 4: Review Gymnasium Libraries (7 matches, 14 min)
|
|
|
|
1. Filter for "Gymnasium"
|
|
2. For each row:
|
|
- Click `wikidata_url`
|
|
- Check P31 (instance of) - public vs school library?
|
|
- If mismatch → INCORRECT
|
|
- Fill `validation_status` and `validation_notes`
|
|
|
|
### Step 5: Review Other Flagged (40 matches, 80 min)
|
|
|
|
1. Filter for remaining `REVIEW_URGENT` rows
|
|
2. Follow validation checklist for each
|
|
3. Fill `validation_status` and `validation_notes`
|
|
|
|
### Step 6: Spot Check "No Issues" (30 matches, 15 min)
|
|
|
|
1. Filter for `auto_flag = OK` AND `priority IN (1, 2)`
|
|
2. Quick review (30 sec each):
|
|
- Names look similar? → CORRECT
|
|
- Any obvious issues? → INCORRECT
|
|
3. Fill `validation_status`
|
|
|
|
### Step 7: Apply Validation
|
|
|
|
```bash
|
|
python scripts/apply_wikidata_validation.py
|
|
```
|
|
|
|
### Step 8: Check Results
|
|
|
|
```bash
|
|
python scripts/check_validation_progress.py
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Expected Validation Results
|
|
|
|
Based on automated spot check findings:
|
|
|
|
| Status | Expected Count | % of Total | Notes |
|
|
|--------|----------------|------------|-------|
|
|
| **CORRECT** | 100-110 | 54-59% | No-issues matches + verified |
|
|
| **INCORRECT** | 70-80 | 38-43% | City mismatches + other errors |
|
|
| **UNCERTAIN** | 5-10 | 3-5% | Ambiguous cases |
|
|
|
|
**Quality Target**: ≥50% CORRECT, ≤45% INCORRECT (acceptable given fuzzy matching)
|
|
|
|
**Note**: Higher INCORRECT rate than original estimate (5-10%) because automated checks caught many city mismatch errors that would have required manual research.
|
|
|
|
---
|
|
|
|
## 🎓 Validation Decision Guide
|
|
|
|
### Mark as INCORRECT if:
|
|
- ✅ City names differ (e.g., Skive vs Randers)
|
|
- ✅ Institution type differs (library vs museum)
|
|
- ✅ Gymnasium library matched to public library
|
|
- ✅ Name similarity <50% with no other confirmation
|
|
|
|
### Mark as CORRECT if:
|
|
- ✅ ISIL codes match (authoritative)
|
|
- ✅ Branch relationship confirmed on Wikidata (P361)
|
|
- ✅ Same institution, different language (Danish/English)
|
|
- ✅ Name similarity >70% AND city matches
|
|
|
|
### Mark as UNCERTAIN if:
|
|
- ⚠️ Cannot determine branch vs main relationship
|
|
- ⚠️ Historical name change unclear
|
|
- ⚠️ No clear evidence either way
|
|
|
|
---
|
|
|
|
## 📁 Files Generated
|
|
|
|
### Input
|
|
- `data/review/denmark_wikidata_fuzzy_matches.csv` (original, 42 KB)
|
|
|
|
### Output
|
|
- `data/review/denmark_wikidata_fuzzy_matches_flagged.csv` (with spot check results, 57 KB)
|
|
|
|
### Scripts
|
|
- `scripts/spot_check_fuzzy_matches_fast.py` (pattern-based detection)
|
|
- `scripts/apply_wikidata_validation.py` (apply results after manual review)
|
|
- `scripts/check_validation_progress.py` (progress tracking)
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start Command
|
|
|
|
```bash
|
|
# Open flagged CSV
|
|
open data/review/denmark_wikidata_fuzzy_matches_flagged.csv
|
|
|
|
# Sort by spot_check_issues column (🚨 city mismatches first)
|
|
# Mark all city mismatches as INCORRECT (validation_status = INCORRECT)
|
|
# Review remaining flagged rows
|
|
# Spot check "OK" rows in Priority 1-2
|
|
# Save CSV
|
|
|
|
# Apply validation
|
|
python scripts/apply_wikidata_validation.py
|
|
|
|
# Check progress
|
|
python scripts/check_validation_progress.py
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 Success Criteria
|
|
|
|
After manual review:
|
|
|
|
- [ ] All 71 city mismatches marked INCORRECT
|
|
- [ ] All 11 low similarity cases reviewed
|
|
- [ ] All 7 gymnasium libraries reviewed
|
|
- [ ] Priority 1-2 "OK" rows spot-checked
|
|
- [ ] At least 150/185 (81%) rows have validation_status
|
|
- [ ] At least 100/185 (54%) rows have validation_notes
|
|
- [ ] Apply script runs successfully
|
|
- [ ] Final dataset has <100 INCORRECT removals
|
|
|
|
---
|
|
|
|
## 📝 Sample Validation Notes
|
|
|
|
### For City Mismatches (INCORRECT)
|
|
```
|
|
Automated spot check detected city mismatch: our institution in Skive vs Wikidata entity in Randers. Different local historical archives.
|
|
```
|
|
|
|
### For Low Similarity (needs judgment)
|
|
```
|
|
Low name similarity (58%). Checked Wikidata - "Campus Vejle" is campus library branch, Wikidata entry is main public library. Different institutions. INCORRECT.
|
|
```
|
|
|
|
### For Gymnasium (INCORRECT)
|
|
```
|
|
School library (gymnasium) incorrectly matched to public library. Wikidata P31 shows "public library" but ours is "school library". INCORRECT.
|
|
```
|
|
|
|
### For Branch Relationships (CORRECT)
|
|
```
|
|
Branch library matched to main library. Checked Wikidata P361 - confirms branch relationship. Same institution system. CORRECT.
|
|
```
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
**Q: CSV won't sort by spot_check_issues?**
|
|
A: Try filtering instead - Excel/Sheets: Data → Filter → Select "🚨 City mismatch"
|
|
|
|
**Q: Too many matches to review in one session?**
|
|
A: Focus on city mismatches first (71 matches), complete in 1 session. Rest can wait.
|
|
|
|
**Q: Unsure about a match?**
|
|
A: Mark as UNCERTAIN, add detailed notes. We can research further later.
|
|
|
|
**Q: How do I know if done?**
|
|
A: Run `python scripts/check_validation_progress.py` - shows completion %
|
|
|
|
---
|
|
|
|
## 📈 Progress Tracking
|
|
|
|
Use this checklist:
|
|
|
|
```
|
|
[x] Automated spot checks run (129 flagged)
|
|
[ ] City mismatches reviewed (0/71)
|
|
[ ] Low similarity reviewed (0/11)
|
|
[ ] Gymnasium libraries reviewed (0/7)
|
|
[ ] Other flagged reviewed (0/40)
|
|
[ ] No-issues spot checked (0/30)
|
|
[ ] Validation applied
|
|
[ ] RDF re-exported
|
|
|
|
Current Progress: 0/159 (0%)
|
|
```
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-19
|
|
**Generated By**: `scripts/spot_check_fuzzy_matches_fast.py`
|
|
**Review CSV**: `data/review/denmark_wikidata_fuzzy_matches_flagged.csv`
|