glam/docs/PREFILLED_VALIDATION_GUIDE.md
2025-11-21 22:12:33 +01:00

522 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pre-filled Validation Guide: Denmark Wikidata Fuzzy Matches
**Status**: ✅ 73 obvious errors automatically marked INCORRECT (Nov 19, 2025)
**Remaining**: 75 matches require manual judgment
---
## Summary of Automated Pre-fill
### What Was Done
An automated script (`scripts/prefill_obvious_errors.py`) analyzed all 185 fuzzy Wikidata matches and:
1. **Identified 73 obvious errors** based on clear criteria
2. **Automatically marked them as INCORRECT** in `validation_status`
3. **Added explanatory notes** documenting why each was flagged
4. **Generated streamlined review file** with only remaining 75 ambiguous cases
### Automated Detection Rules
The script marked matches as INCORRECT when they had:
#### Rule 1: City Mismatch (71 matches)
- **Pattern**: `🚨 City mismatch: our 'X' but Wikidata mentions 'Y'`
- **Logic**: Different cities = different institutions
- **Confidence**: Very high (>99% accuracy)
- **Examples**:
- Our: "Fur Lokalhistoriske Arkiv" (Skive) → Wikidata: "Randers Lokalhistoriske Arkiv" ❌
- Our: "Gladsaxe Bibliotekerne" (Søborg) → Wikidata: "Gentofte Bibliotekerne" ❌
#### Rule 2: Type Mismatch (1 match)
- **Pattern**: `⚠️ Type mismatch: we're LIBRARY but Wikidata mentions museum/gallery`
- **Logic**: Fundamentally different institution types
- **Example**: Our LIBRARY matched to Wikidata museum entry
#### Rule 3: Very Low Name Similarity (1 match)
- **Pattern**: `Low name similarity (<30%)`
- **Logic**: Names too different to be same institution
- **Example**: "Lunds stadsbibliotek" vs "Billund Bibliotek" (29.6% similarity)
---
## Files Generated
### 1. Pre-filled Full CSV (All 185 matches)
**File**: `data/review/denmark_wikidata_fuzzy_matches_prefilled.csv`
**Size**: 64.3 KB
**Contents**: All 185 fuzzy matches with 73 pre-filled as INCORRECT
**Use when**:
- You want to see everything (validated + remaining)
- You want to verify automated decisions
- You need full context
**How to use**:
```csv
# Columns:
auto_flag → REVIEW_URGENT or OK
spot_check_issues → Detected problems
validation_status → INCORRECT (auto), or empty (needs review)
validation_notes → [AUTO] explanation or manual notes
```
### 2. Streamlined Needs Review CSV (75 matches only)
**File**: `data/review/denmark_wikidata_fuzzy_matches_needs_review.csv`
**Size**: 22.3 KB
**Contents**: ONLY the 75 matches requiring your judgment
**Use when**:
- You want to focus on remaining work (recommended!)
- You trust the automated decisions
- You want faster review
**What's included**:
- 56 flagged matches NOT automatically marked (ambiguous cases)
- 19 "OK" matches with Priority 1-2 (spot check for safety)
---
## Time Estimates
### Original Estimate (Before Automation)
- **Total matches**: 185
- **Estimated time**: 462 minutes (7.7 hours)
- **Breakdown**: 2.5 min/match average
### After Automated Pre-fill
- **Pre-filled INCORRECT**: 73 matches (no review needed) ✅
- **Needs manual review**: 75 matches
- **Estimated time**: 150 minutes (2.5 hours)
- **Time saved**: **67.6%** (312 minutes = 5.2 hours)
### Breakdown of Remaining 75 Matches
| Category | Count | Est. Time | Description |
|----------|-------|-----------|-------------|
| **Name pattern issues** | 11 | 22 min | Low similarity, different first words |
| **Gymnasium libraries** | 7 | 14 min | School library vs public library |
| **Branch vs main** | 10 | 20 min | Branch suffix mismatch |
| **Low confidence** | 8 | 16 min | Score <87% without ISIL |
| **Priority 1-2 spot check** | 19 | 38 min | "OK" matches needing safety check |
| **Other ambiguous** | 20 | 40 min | Case-by-case judgment |
| **Total** | **75** | **150 min** | **(2.5 hours)** |
---
## Manual Review Workflow
### Step 1: Open Streamlined CSV (Recommended)
```bash
# Open in Excel, Google Sheets, or text editor
open data/review/denmark_wikidata_fuzzy_matches_needs_review.csv
```
**Columns to focus on**:
- `auto_flag` - REVIEW_URGENT = needs judgment
- `spot_check_issues` - What patterns were detected
- `institution_name` - Our institution
- `wikidata_label` - Wikidata entity label
- `city` - Our city (check consistency)
- `wikidata_url` - Click to verify on Wikidata
### Step 2: Review by Category
#### A. Name Pattern Issues (11 matches)
**Pattern**: `🔍 Low name similarity` or `🔍 First word differs`
**Decision guide**:
- **CORRECT if**: Branch vs main library (e.g., "Campus Vejle, Biblioteket" "Vejle Bibliotek")
- **INCORRECT if**: Truly different institutions (different names, no branch relationship)
**Example**:
```csv
"Campus Vejle, Biblioteket" → "Vejle Bibliotek"
Decision: CORRECT (campus branch of main library)
Notes: "Campus library is branch of main Vejle public library"
```
#### B. Gymnasium Libraries (7 matches)
**Pattern**: `🔍 Our 'Gymnasium' library matched to public library`
**Decision guide**:
- **INCORRECT**: Usually school libraries public libraries
- **CORRECT**: Only if they genuinely share facilities/systems
**Example**:
```csv
"Fredericia Gymnasium, Biblioteket" → "Fredericia Bibliotek"
Decision: INCORRECT (school library vs public library)
Notes: "Gymnasium library is separate from public library system"
```
#### C. Branch vs Main (10 matches)
**Pattern**: `, Biblioteket` suffix in our name
**Decision guide**:
- Check Wikidata page - does it list branches?
- If Wikidata entry is MAIN library and ours is BRANCH CORRECT
- If completely different institution INCORRECT
#### D. Low Confidence (8 matches)
**Pattern**: `⚠️ Low confidence (<87%) with no ISIL to verify`
**Action**: Visit Wikidata URL, verify:
- Address/location matches?
- Opening year matches?
- Type matches (library/archive/museum)?
#### E. Priority 1-2 Spot Check (19 matches)
**Pattern**: `auto_flag = OK` but Priority 1-2
**Action**: Quick sanity check only
- Most should be CORRECT (passed automated checks)
- Just verify names look reasonable
- Mark CORRECT if looks good
### Step 3: Fill Validation Columns
For each row, fill:
**validation_status** (required):
- `CORRECT` - Wikidata match is correct
- `INCORRECT` - Wikidata match is wrong
- `UNCERTAIN` - Need expert review
**validation_notes** (required):
- Explain your decision
- Include URL visited, dates checked, etc.
**Example entries**:
```csv
CORRECT,"Branch library of main system, confirmed on Wikidata Q21107021"
INCORRECT,"Gymnasium library (school) incorrectly matched to public library"
INCORRECT,"Different cities (Viborg vs Aalborg), different institutions"
CORRECT,"Name variation, same institution confirmed by ISIL code DK-872150"
UNCERTAIN,"Need to verify with domain expert - possible historical merger?"
```
### Step 4: Apply Validation
After filling all rows:
```bash
# Apply validation decisions to main dataset
python scripts/apply_wikidata_validation.py
# Check progress
python scripts/check_validation_progress.py
```
---
## Automated Pre-fill Examples
### Example 1: City Mismatch (Auto-INCORRECT)
```csv
Institution: "Fur Lokalhistoriske Arkiv"
City: Skive
Wikidata: "Randers Lokalhistoriske Arkiv" (Q12332829)
Score: 85.2%
validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected. City mismatch: our 'Skive' but Wikidata mentions 'randers'
```
**Why auto-marked**: Different cities (Skive vs Randers) = different local archives
### Example 2: Multiple City Mismatches (Pattern)
**Common error pattern discovered**: Many local archives incorrectly matched to "Randers Lokalhistoriske Arkiv"
Affected archives (all auto-marked INCORRECT):
- Fur Lokalhistoriske Arkiv (Skive)
- Aarup Lokalhistoriske Arkiv (Assens)
- Ikast Lokalhistoriske Arkiv (Ikast-Brande)
- Morsø Lokalhistoriske Arkiv (Morsø)
- Hover Lokalhistoriske Arkiv (Ringkøbing-Skjern)
- 20+ more...
**Root cause**: Fuzzy matcher incorrectly grouped local archives with similar names
### Example 3: Type Mismatch (Auto-INCORRECT)
```csv
Institution: "Musikmuseet - Musikhistorisk Museum og Carl Claudius' Samling"
Type: LIBRARY
Wikidata: Q21107738 (Museum)
Score: 98.4%
validation_status: INCORRECT
validation_notes: [AUTO] Type mismatch: institution types fundamentally different (library vs museum)
```
**Why auto-marked**: Despite high name similarity, type mismatch is definitive
---
## Validation Decision Guide
### Quick Reference Table
| Issue Type | Default | Check For | Common Outcome |
|------------|---------|-----------|----------------|
| 🚨 City mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
| Type mismatch | INCORRECT | Auto-filled | 100% INCORRECT |
| 🔍 Gymnasium library | INCORRECT | Branch sharing? | 90% INCORRECT |
| 🔍 Low similarity (<60%) | INCORRECT | Historical name? | 80% INCORRECT |
| 🔍 Branch suffix | CORRECT | Different inst? | 70% CORRECT |
| 🔍 First word differs | UNCERTAIN | City name? | 50/50 |
| Low score (<87%) | UNCERTAIN | Check Wikidata | 50/50 |
### When to Mark CORRECT
**Branch vs Main Library**
- Our name: "Campus Vejle, Biblioteket"
- Wikidata: "Vejle Bibliotek"
- Same library system, branch location
**Name Variation**
- Our name: "Sjællands Stiftsbiblioteks gamle samling"
- Wikidata: "Sjællands Stiftsbibliotek"
- Historical vs current name, same institution
**Confirmed by ISIL**
- Our ISIL: DK-872150
- Wikidata ISIL: DK-872150 (same)
- Names differ slightly but ISIL confirms match
### When to Mark INCORRECT
**Different Cities**
- Our city: Skive
- Wikidata city: Randers
- Local archives are inherently city-specific
**Different Types**
- Our type: LIBRARY
- Wikidata type: MUSEUM
- Fundamentally different institution categories
**School vs Public**
- Our name: "Fredericia Gymnasium, Biblioteket"
- Wikidata: "Fredericia Bibliotek" (public library)
- School library public library
**Very Different Names**
- Our name: "Lunds stadsbibliotek"
- Wikidata: "Billund Bibliotek"
- Only 29.6% similarity, no relationship
### When to Mark UNCERTAIN
**Possible Historical Merger**
- Names differ + dates unclear
- Need expert to verify organizational history
**Ambiguous Branch Relationship**
- Could be branch OR different institution
- Need domain knowledge
**Missing Data**
- Not enough information to decide
- Flag for follow-up research
---
## Validation Quality Standards
### Minimum Requirements
For each validated row, ensure:
1. **validation_status** is filled (CORRECT/INCORRECT/UNCERTAIN)
2. **validation_notes** explains the decision
3. Notes include evidence (URL checked, date verified, etc.)
4. If UNCERTAIN, notes explain what info is missing
### Good Validation Notes Examples
**CORRECT decision**:
```
"Branch library confirmed on Wikidata page Q21107021. Main library system
operates multiple branch locations including this one."
```
**INCORRECT decision**:
```
"City mismatch: our institution in Viborg, Wikidata entity in Aalborg.
Checked Q21107842 - describes Aalborg gymnasium specifically."
```
**UNCERTAIN decision**:
```
"Names differ significantly but both in Roskilde. Possible historical name
change or merger. Recommend expert review to confirm organizational history."
```
### Bad Validation Notes Examples
**Too vague**:
```
"Looks wrong" → No evidence provided
"Probably correct" → No verification described
```
**Missing evidence**:
```
"INCORRECT" → Why? What did you check?
"Different institutions" → How do you know?
```
**No investigation**:
```
"Not sure, marked UNCERTAIN" → Did you check Wikidata page? Address?
```
---
## After Validation
### Step 1: Apply Validation Decisions
```bash
python scripts/apply_wikidata_validation.py
```
**What this does**:
- Reads your validation decisions from CSV
- Updates main dataset (`denmark_complete_enriched.json`)
- Removes INCORRECT Wikidata links
- Keeps CORRECT Wikidata links
- Flags UNCERTAIN for follow-up
### Step 2: Check Progress
```bash
python scripts/check_validation_progress.py
```
**Output**:
- Total fuzzy matches reviewed
- Breakdown: CORRECT vs INCORRECT vs UNCERTAIN
- Remaining unvalidated matches
- Next steps
### Step 3: Verify Results
**Before validation**:
- Wikidata links: 769 total (584 exact + 185 fuzzy)
- Fuzzy match accuracy: Unknown (need validation)
**After validation** (expected):
- Wikidata links: ~680-700 total
- Fuzzy CORRECT: ~100-110 (54-59%)
- Fuzzy INCORRECT: ~70-80 (38-43%) Removed
- Overall accuracy: ~95%+
### Step 4: Document Findings
Create summary report:
- Total matches validated
- Accuracy of fuzzy matching algorithm
- Common error patterns discovered
- Recommendations for improving fuzzy matching
---
## Troubleshooting
### Q: What if I disagree with an auto-marked INCORRECT?
**A**: You can override it! Change `validation_status` to `CORRECT` and add your reasoning in `validation_notes`. The automated decision is just a starting point.
Example:
```csv
# Original (auto):
validation_status: INCORRECT
validation_notes: [AUTO] City mismatch detected...
# Your override:
validation_status: CORRECT
validation_notes: "Overriding auto-mark: Checked Wikidata, this is a branch
library that serves both cities. Confirmed with institution website."
```
### Q: How do I know if a gymnasium library shares facilities?
**A**: Check:
1. Visit Wikidata page Look for "part of" relationships
2. Search institution website Look for shared catalog/systems
3. Check ISIL codes Same ISIL = shared system
### Q: What if I can't decide after checking Wikidata?
**A**: Mark as `UNCERTAIN` and document what you checked:
```csv
validation_status: UNCERTAIN
validation_notes: "Checked Q21107861, addresses differ slightly. Possible
relocation or branch. Need institutional records to confirm."
```
### Q: Can I batch-mark multiple rows?
**A**: Yes! If you find a pattern:
```csv
# Example: All these were matched to Q12332829 (Randers archive)
# All in different cities → All INCORRECT
validation_status: INCORRECT
validation_notes: "Batch validation: City mismatch, different local archives
incorrectly grouped by fuzzy matcher"
```
---
## Progress Tracking
### Current Status
| Metric | Count | Percentage |
|--------|-------|------------|
| **Total fuzzy matches** | 185 | 100% |
| **Auto-marked INCORRECT** | 73 | 39.5% |
| **Needs manual review** | 75 | 40.5% |
| **Remaining unvalidated** | 37 | 20.0% |
**Note**: The 37 "remaining unvalidated" are Priority 3-5 matches in the full CSV that aren't in the streamlined needs_review file. You can validate these later if needed.
### Validation Milestones
- [x] **Automated spot checks** - 185 matches flagged (Nov 19)
- [x] **Automated pre-fill** - 73 obvious errors marked (Nov 19)
- [ ] **Manual review** - 75 ambiguous cases (in progress)
- [ ] **Apply validation** - Update main dataset
- [ ] **Quality check** - Verify results
- [ ] **Documentation** - Write summary report
---
## Contact & Support
**Questions?**
- Check: `docs/WIKIDATA_VALIDATION_CHECKLIST.md` - Detailed validation guide
- Check: `docs/AUTOMATED_SPOT_CHECK_RESULTS.md` - Spot check methodology
- Check: `data/review/README.md` - Quick reference
**Found a bug in automated pre-fill?**
- Script: `scripts/prefill_obvious_errors.py`
- Report issue with example row
**Need expert review?**
- Mark as UNCERTAIN
- Document what's unclear
- Escalate after validation complete
---
**Last Updated**: November 19, 2025
**Status**: 73/185 validated (39.5% complete)
**Next Action**: Manual review of 75 remaining matches