522 lines
15 KiB
Markdown
522 lines
15 KiB
Markdown
# Pre-filled Validation Guide: Denmark Wikidata Fuzzy Matches
|
||
|
||
**Status**: ✅ 73 obvious errors automatically marked INCORRECT (Nov 19, 2025)
|
||
**Remaining**: 75 matches require manual judgment
|
||
|
||
---
|
||
|
||
## Summary of Automated Pre-fill
|
||
|
||
### What Was Done
|
||
|
||
An automated script (`scripts/prefill_obvious_errors.py`) analyzed all 185 fuzzy Wikidata matches and:
|
||
|
||
1. **Identified 73 obvious errors** based on clear criteria
|
||
2. **Automatically marked them as INCORRECT** in `validation_status`
|
||
3. **Added explanatory notes** documenting why each was flagged
|
||
4. **Generated streamlined review file** with only remaining 75 ambiguous cases
|
||
|
||
### Automated Detection Rules
|
||
|
||
The script marked matches as INCORRECT when they had:
|
||
|
||
#### Rule 1: City Mismatch (71 matches)
|
||
- **Pattern**: `🚨 City mismatch: our 'X' but Wikidata mentions 'Y'`
|
||
- **Logic**: Different cities = different institutions
|
||
- **Confidence**: Very high (>99% accuracy)
|
||
- **Examples**:
|
||
- Our: "Fur Lokalhistoriske Arkiv" (Skive) → Wikidata: "Randers Lokalhistoriske Arkiv" ❌
|
||
- Our: "Gladsaxe Bibliotekerne" (Søborg) → Wikidata: "Gentofte Bibliotekerne" ❌
|
||
|
||
#### Rule 2: Type Mismatch (1 match)
|
||
- **Pattern**: `⚠️ Type mismatch: we're LIBRARY but Wikidata mentions museum/gallery`
|
||
- **Logic**: Fundamentally different institution types
|
||
- **Example**: Our LIBRARY matched to Wikidata museum entry
|
||
|
||
#### Rule 3: Very Low Name Similarity (1 match)
|
||
- **Pattern**: `Low name similarity (<30%)`
|
||
- **Logic**: Names too different to be same institution
|
||
- **Example**: "Lunds stadsbibliotek" vs "Billund Bibliotek" (29.6% similarity)
|
||
|
||
---
|
||
|
||
## Files Generated
|
||
|
||
### 1. Pre-filled Full CSV (All 185 matches)
|
||
**File**: `data/review/denmark_wikidata_fuzzy_matches_prefilled.csv`
|
||
**Size**: 64.3 KB
|
||
**Contents**: All 185 fuzzy matches with 73 pre-filled as INCORRECT
|
||
|
||
**Use when**:
|
||
- You want to see everything (validated + remaining)
|
||
- You want to verify automated decisions
|
||
- You need full context
|
||
|
||
**How to use**:
|
||
```csv
|
||
# Columns:
|
||
auto_flag → REVIEW_URGENT or OK
|
||
spot_check_issues → Detected problems
|
||
validation_status → INCORRECT (auto), or empty (needs review)
|
||
validation_notes → [AUTO] explanation or manual notes
|
||
```
|
||
|
||
### 2. Streamlined Needs Review CSV (75 matches only)
|
||
**File**: `data/review/denmark_wikidata_fuzzy_matches_needs_review.csv`
|
||
**Size**: 22.3 KB
|
||
**Contents**: ONLY the 75 matches requiring your judgment
|
||
|
||
**Use when**:
|
||
- You want to focus on remaining work (recommended!)
|
||
- You trust the automated decisions
|
||
- You want faster review
|
||
|
||
**What's included**:
|
||
- 56 flagged matches NOT automatically marked (ambiguous cases)
|
||
- 19 "OK" matches with Priority 1-2 (spot check for safety)
|
||
|
||
---
|
||
|
||
## Time Estimates
|
||
|
||
### Original Estimate (Before Automation)
|
||
- **Total matches**: 185
|
||
- **Estimated time**: 462 minutes (7.7 hours)
|
||
- **Breakdown**: 2.5 min/match average
|
||
|
||
### After Automated Pre-fill
|
||
- **Pre-filled INCORRECT**: 73 matches (no review needed) ✅
|
||
- **Needs manual review**: 75 matches
|
||
- **Estimated time**: 150 minutes (2.5 hours)
|
||
- **Time saved**: **67.6%** (312 minutes = 5.2 hours)
|
||
|
||
### Breakdown of Remaining 75 Matches
|
||
|
||
| Category | Count | Est. Time | Description |
|
||
|----------|-------|-----------|-------------|
|
||
| **Name pattern issues** | 11 | 22 min | Low similarity, different first words |
|
||
| **Gymnasium libraries** | 7 | 14 min | School library vs public library |
|
||
| **Branch vs main** | 10 | 20 min | Branch suffix mismatch |
|
||
| **Low confidence** | 8 | 16 min | Score <87% without ISIL |
|
||
| **Priority 1-2 spot check** | 19 | 38 min | "OK" matches needing safety check |
|
||
| **Other ambiguous** | 20 | 40 min | Case-by-case judgment |
|
||
| **Total** | **75** | **150 min** | **(2.5 hours)** |
|
||
|
||
---
|
||
|
||
## Manual Review Workflow
|
||
|
||
### Step 1: Open Streamlined CSV (Recommended)
|
||
|
||
```bash
|
||
# Open in Excel, Google Sheets, or text editor
|
||
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
|
||
```
|
||
|
||
**Columns to focus on**:
|
||
- `auto_flag` - REVIEW_URGENT = needs judgment
|
||
- `spot_check_issues` - What patterns were detected
|
||
- `institution_name` - Our institution
|
||
- `wikidata_label` - Wikidata entity label
|
||
- `city` - Our city (check consistency)
|
||
- `wikidata_url` - Click to verify on Wikidata
|
||
|
||
### Step 2: Review by Category
|
||
|
||
#### A. Name Pattern Issues (11 matches)
|
||
**Pattern**: `🔍 Low name similarity` or `🔍 First word differs`
|
||
|
||
**Decision guide**:
|
||
- **CORRECT if**: Branch vs main library (e.g., "Campus Vejle, Biblioteket" → "Vejle Bibliotek")
|
||
- **INCORRECT if**: Truly different institutions (different names, no branch relationship)
|
||
|
||
**Example**:
|
||
```csv
|
||
"Campus Vejle, Biblioteket" → "Vejle Bibliotek"
|
||
Decision: CORRECT (campus branch of main library)
|
||
Notes: "Campus library is branch of main Vejle public library"
|
||
```
|
||
|
||
#### B. Gymnasium Libraries (7 matches)
|
||
**Pattern**: `🔍 Our 'Gymnasium' library matched to public library`
|
||
|
||
**Decision guide**:
|
||
- **INCORRECT**: Usually school libraries ≠ public libraries
|
||
- **CORRECT**: Only if they genuinely share facilities/systems
|
||
|
||
**Example**:
|
||
```csv
|
||
"Fredericia Gymnasium, Biblioteket" → "Fredericia Bibliotek"
|
||
Decision: INCORRECT (school library vs public library)
|
||
Notes: "Gymnasium library is separate from public library system"
|
||
```
|
||
|
||
#### C. Branch vs Main (10 matches)
|
||
**Pattern**: `, Biblioteket` suffix in our name
|
||
|
||
**Decision guide**:
|
||
- Check Wikidata page - does it list branches?
|
||
- If Wikidata entry is MAIN library and ours is BRANCH → CORRECT
|
||
- If completely different institution → INCORRECT
|
||
|
||
#### D. Low Confidence (8 matches)
|
||
**Pattern**: `⚠️ Low confidence (<87%) with no ISIL to verify`
|
||
|
||
**Action**: Visit Wikidata URL, verify:
|
||
- Address/location matches?
|
||
- Opening year matches?
|
||
- Type matches (library/archive/museum)?
|
||
|
||
#### E. Priority 1-2 Spot Check (19 matches)
|
||
**Pattern**: `auto_flag = OK` but Priority 1-2
|
||
|
||
**Action**: Quick sanity check only
|
||
- Most should be CORRECT (passed automated checks)
|
||
- Just verify names look reasonable
|
||
- Mark CORRECT if looks good
|
||
|
||
### Step 3: Fill Validation Columns
|
||
|
||
For each row, fill:
|
||
|
||
**validation_status** (required):
|
||
- `CORRECT` - Wikidata match is correct
|
||
- `INCORRECT` - Wikidata match is wrong
|
||
- `UNCERTAIN` - Need expert review
|
||
|
||
**validation_notes** (required):
|
||
- Explain your decision
|
||
- Include URL visited, dates checked, etc.
|
||
|
||
**Example entries**:
|
||
```csv
|
||
CORRECT,"Branch library of main system, confirmed on Wikidata Q21107021"
|
||
INCORRECT,"Gymnasium library (school) incorrectly matched to public library"
|
||
INCORRECT,"Different cities (Viborg vs Aalborg), different institutions"
|
||
CORRECT,"Name variation, same institution confirmed by ISIL code DK-872150"
|
||
UNCERTAIN,"Need to verify with domain expert - possible historical merger?"
|
||
```
|
||
|
||
### Step 4: Apply Validation
|
||
|
||
After filling all rows:
|
||
|
||
```bash
|
||
# Apply validation decisions to main dataset
|
||
python scripts/apply_wikidata_validation.py
|
||
|
||
# Check progress
|
||
python scripts/check_validation_progress.py
|
||
```
|
||
|
||
---
|
||
|
||
## Automated Pre-fill Examples
|
||
|
||
### Example 1: City Mismatch (Auto-INCORRECT)
|
||
|
||
```csv
|
||
Institution: "Fur Lokalhistoriske Arkiv"
|
||
City: Skive
|
||
Wikidata: "Randers Lokalhistoriske Arkiv" (Q12332829)
|
||
Score: 85.2%
|
||
|
||
validation_status: INCORRECT
|
||
validation_notes: [AUTO] City mismatch detected. City mismatch: our 'Skive' but Wikidata mentions 'randers'
|
||
```
|
||
|
||
**Why auto-marked**: Different cities (Skive vs Randers) = different local archives
|
||
|
||
### Example 2: Multiple City Mismatches (Pattern)
|
||
|
||
**Common error pattern discovered**: Many local archives incorrectly matched to "Randers Lokalhistoriske Arkiv"
|
||
|
||
Affected archives (all auto-marked INCORRECT):
|
||
- Fur Lokalhistoriske Arkiv (Skive)
|
||
- Aarup Lokalhistoriske Arkiv (Assens)
|
||
- Ikast Lokalhistoriske Arkiv (Ikast-Brande)
|
||
- Morsø Lokalhistoriske Arkiv (Morsø)
|
||
- Hover Lokalhistoriske Arkiv (Ringkøbing-Skjern)
|
||
- 20+ more...
|
||
|
||
**Root cause**: Fuzzy matcher incorrectly grouped local archives with similar names
|
||
|
||
### Example 3: Type Mismatch (Auto-INCORRECT)
|
||
|
||
```csv
|
||
Institution: "Musikmuseet - Musikhistorisk Museum og Carl Claudius' Samling"
|
||
Type: LIBRARY
|
||
Wikidata: Q21107738 (Museum)
|
||
Score: 98.4%
|
||
|
||
validation_status: INCORRECT
|
||
validation_notes: [AUTO] Type mismatch: institution types fundamentally different (library vs museum)
|
||
```
|
||
|
||
**Why auto-marked**: Despite high name similarity, type mismatch is definitive
|
||
|
||
---
|
||
|
||
## Validation Decision Guide
|
||
|
||
### Quick Reference Table
|
||
|
||
| Issue Type | Default | Check For | Common Outcome |
|
||
|------------|---------|-----------|----------------|
|
||
| 🚨 City mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
|
||
| ⚠️ Type mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
|
||
| 🔍 Gymnasium library | INCORRECT | Branch sharing? | 90% INCORRECT |
|
||
| 🔍 Low similarity (<60%) | INCORRECT | Historical name? | 80% INCORRECT |
|
||
| 🔍 Branch suffix | CORRECT | Different inst? | 70% CORRECT |
|
||
| 🔍 First word differs | UNCERTAIN | City name? | 50/50 |
|
||
| ⚠️ Low score (<87%) | UNCERTAIN | Check Wikidata | 50/50 |
|
||
|
||
### When to Mark CORRECT
|
||
|
||
✅ **Branch vs Main Library**
|
||
- Our name: "Campus Vejle, Biblioteket"
|
||
- Wikidata: "Vejle Bibliotek"
|
||
- Same library system, branch location
|
||
|
||
✅ **Name Variation**
|
||
- Our name: "Sjællands Stiftsbiblioteks gamle samling"
|
||
- Wikidata: "Sjællands Stiftsbibliotek"
|
||
- Historical vs current name, same institution
|
||
|
||
✅ **Confirmed by ISIL**
|
||
- Our ISIL: DK-872150
|
||
- Wikidata ISIL: DK-872150 (same)
|
||
- Names differ slightly but ISIL confirms match
|
||
|
||
### When to Mark INCORRECT
|
||
|
||
❌ **Different Cities**
|
||
- Our city: Skive
|
||
- Wikidata city: Randers
|
||
- Local archives are inherently city-specific
|
||
|
||
❌ **Different Types**
|
||
- Our type: LIBRARY
|
||
- Wikidata type: MUSEUM
|
||
- Fundamentally different institution categories
|
||
|
||
❌ **School vs Public**
|
||
- Our name: "Fredericia Gymnasium, Biblioteket"
|
||
- Wikidata: "Fredericia Bibliotek" (public library)
|
||
- School library ≠ public library
|
||
|
||
❌ **Very Different Names**
|
||
- Our name: "Lunds stadsbibliotek"
|
||
- Wikidata: "Billund Bibliotek"
|
||
- Only 29.6% similarity, no relationship
|
||
|
||
### When to Mark UNCERTAIN
|
||
|
||
⁉️ **Possible Historical Merger**
|
||
- Names differ + dates unclear
|
||
- Need expert to verify organizational history
|
||
|
||
⁉️ **Ambiguous Branch Relationship**
|
||
- Could be branch OR different institution
|
||
- Need domain knowledge
|
||
|
||
⁉️ **Missing Data**
|
||
- Not enough information to decide
|
||
- Flag for follow-up research
|
||
|
||
---
|
||
|
||
## Validation Quality Standards
|
||
|
||
### Minimum Requirements
|
||
|
||
For each validated row, ensure:
|
||
|
||
1. ✅ **validation_status** is filled (CORRECT/INCORRECT/UNCERTAIN)
|
||
2. ✅ **validation_notes** explains the decision
|
||
3. ✅ Notes include evidence (URL checked, date verified, etc.)
|
||
4. ✅ If UNCERTAIN, notes explain what info is missing
|
||
|
||
### Good Validation Notes Examples
|
||
|
||
**CORRECT decision**:
|
||
```
|
||
"Branch library confirmed on Wikidata page Q21107021. Main library system
|
||
operates multiple branch locations including this one."
|
||
```
|
||
|
||
**INCORRECT decision**:
|
||
```
|
||
"City mismatch: our institution in Viborg, Wikidata entity in Aalborg.
|
||
Checked Q21107842 - describes Aalborg gymnasium specifically."
|
||
```
|
||
|
||
**UNCERTAIN decision**:
|
||
```
|
||
"Names differ significantly but both in Roskilde. Possible historical name
|
||
change or merger. Recommend expert review to confirm organizational history."
|
||
```
|
||
|
||
### Bad Validation Notes Examples
|
||
|
||
❌ **Too vague**:
|
||
```
|
||
"Looks wrong" → No evidence provided
|
||
"Probably correct" → No verification described
|
||
```
|
||
|
||
❌ **Missing evidence**:
|
||
```
|
||
"INCORRECT" → Why? What did you check?
|
||
"Different institutions" → How do you know?
|
||
```
|
||
|
||
❌ **No investigation**:
|
||
```
|
||
"Not sure, marked UNCERTAIN" → Did you check Wikidata page? Address?
|
||
```
|
||
|
||
---
|
||
|
||
## After Validation
|
||
|
||
### Step 1: Apply Validation Decisions
|
||
|
||
```bash
|
||
python scripts/apply_wikidata_validation.py
|
||
```
|
||
|
||
**What this does**:
|
||
- Reads your validation decisions from CSV
|
||
- Updates main dataset (`denmark_complete_enriched.json`)
|
||
- Removes INCORRECT Wikidata links
|
||
- Keeps CORRECT Wikidata links
|
||
- Flags UNCERTAIN for follow-up
|
||
|
||
### Step 2: Check Progress
|
||
|
||
```bash
|
||
python scripts/check_validation_progress.py
|
||
```
|
||
|
||
**Output**:
|
||
- Total fuzzy matches reviewed
|
||
- Breakdown: CORRECT vs INCORRECT vs UNCERTAIN
|
||
- Remaining unvalidated matches
|
||
- Next steps
|
||
|
||
### Step 3: Verify Results
|
||
|
||
**Before validation**:
|
||
- Wikidata links: 769 total (584 exact + 185 fuzzy)
|
||
- Fuzzy match accuracy: Unknown (need validation)
|
||
|
||
**After validation** (expected):
|
||
- Wikidata links: ~680-700 total
|
||
- Fuzzy CORRECT: ~100-110 (54-59%)
|
||
- Fuzzy INCORRECT: ~70-80 (38-43%) → Removed
|
||
- Overall accuracy: ~95%+
|
||
|
||
### Step 4: Document Findings
|
||
|
||
Create summary report:
|
||
- Total matches validated
|
||
- Accuracy of fuzzy matching algorithm
|
||
- Common error patterns discovered
|
||
- Recommendations for improving fuzzy matching
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Q: What if I disagree with an auto-marked INCORRECT?
|
||
|
||
**A**: You can override it! Change `validation_status` to `CORRECT` and add your reasoning in `validation_notes`. The automated decision is just a starting point.
|
||
|
||
Example:
|
||
```csv
|
||
# Original (auto):
|
||
validation_status: INCORRECT
|
||
validation_notes: [AUTO] City mismatch detected...
|
||
|
||
# Your override:
|
||
validation_status: CORRECT
|
||
validation_notes: "Overriding auto-mark: Checked Wikidata, this is a branch
|
||
library that serves both cities. Confirmed with institution website."
|
||
```
|
||
|
||
### Q: How do I know if a gymnasium library shares facilities?
|
||
|
||
**A**: Check:
|
||
1. Visit Wikidata page → Look for "part of" relationships
|
||
2. Search institution website → Look for shared catalog/systems
|
||
3. Check ISIL codes → Same ISIL = shared system
|
||
|
||
### Q: What if I can't decide after checking Wikidata?
|
||
|
||
**A**: Mark as `UNCERTAIN` and document what you checked:
|
||
```csv
|
||
validation_status: UNCERTAIN
|
||
validation_notes: "Checked Q21107861, addresses differ slightly. Possible
|
||
relocation or branch. Need institutional records to confirm."
|
||
```
|
||
|
||
### Q: Can I batch-mark multiple rows?
|
||
|
||
**A**: Yes! If you find a pattern:
|
||
```csv
|
||
# Example: All these were matched to Q12332829 (Randers archive)
|
||
# All in different cities → All INCORRECT
|
||
|
||
validation_status: INCORRECT
|
||
validation_notes: "Batch validation: City mismatch, different local archives
|
||
incorrectly grouped by fuzzy matcher"
|
||
```
|
||
|
||
---
|
||
|
||
## Progress Tracking
|
||
|
||
### Current Status
|
||
|
||
| Metric | Count | Percentage |
|
||
|--------|-------|------------|
|
||
| **Total fuzzy matches** | 185 | 100% |
|
||
| **Auto-marked INCORRECT** | 73 | 39.5% |
|
||
| **Needs manual review** | 75 | 40.5% |
|
||
| **Remaining unvalidated** | 37 | 20.0% |
|
||
|
||
**Note**: The 37 "remaining unvalidated" are Priority 3-5 matches in the full CSV that aren't in the streamlined needs_review file. You can validate these later if needed.
|
||
|
||
### Validation Milestones
|
||
|
||
- [x] **Automated spot checks** - 185 matches flagged (Nov 19)
|
||
- [x] **Automated pre-fill** - 73 obvious errors marked (Nov 19)
|
||
- [ ] **Manual review** - 75 ambiguous cases (in progress)
|
||
- [ ] **Apply validation** - Update main dataset
|
||
- [ ] **Quality check** - Verify results
|
||
- [ ] **Documentation** - Write summary report
|
||
|
||
---
|
||
|
||
## Contact & Support
|
||
|
||
**Questions?**
|
||
- Check: `docs/WIKIDATA_VALIDATION_CHECKLIST.md` - Detailed validation guide
|
||
- Check: `docs/AUTOMATED_SPOT_CHECK_RESULTS.md` - Spot check methodology
|
||
- Check: `data/review/README.md` - Quick reference
|
||
|
||
**Found a bug in automated pre-fill?**
|
||
- Script: `scripts/prefill_obvious_errors.py`
|
||
- Report issue with example row
|
||
|
||
**Need expert review?**
|
||
- Mark as UNCERTAIN
|
||
- Document what's unclear
|
||
- Escalate after validation complete
|
||
|
||
---
|
||
|
||
**Last Updated**: November 19, 2025
|
||
**Status**: 73/185 validated (39.5% complete)
|
||
**Next Action**: Manual review of 75 remaining matches
|