glam/docs/WIKIDATA_VALIDATION_CHECKLIST.md
2025-11-19 23:25:22 +01:00

435 lines
12 KiB
Markdown

# Wikidata Fuzzy Match Validation Checklist
## Overview
This checklist helps reviewers manually validate fuzzy Wikidata matches (85-99% confidence) to ensure data quality before RDF publication.
**Dataset**: Danish GLAM institutions
**Fuzzy Matches**: 185 institutions with match scores 85-99%
**Review File**: `data/review/denmark_wikidata_fuzzy_matches.csv`
---
## Quick Start
### 1. Generate Review Report
```bash
python scripts/generate_wikidata_review_report.py
```
**Output**: `data/review/denmark_wikidata_fuzzy_matches.csv`
### 2. Open CSV in Spreadsheet Software
- **Excel**: Open directly, enable UTF-8 encoding
- **Google Sheets**: File → Import → Upload CSV
- **LibreOffice Calc**: Open, select UTF-8 encoding
### 3. Review Priority Order
Start with **Priority 1** (most uncertain) matches and work downward:
| Priority | Score Range | Count | Description |
|----------|-------------|-------|-------------|
| 1 | 85-87% | Varies | Very uncertain - review first |
| 2 | 87-90% | Varies | Uncertain - needs verification |
| 3 | 90-93% | Varies | Moderate confidence |
| 4 | 93-96% | Varies | Fairly confident |
| 5 | 96-99% | Varies | Mostly confident |
---
## Validation Process
### For Each Row:
#### Step 1: Compare Names
**Check**: Does `institution_name` match `wikidata_label`?
**Examples**:
**CORRECT Match**:
- Institution: "Nationalmuseet"
- Wikidata: "National Museum of Denmark"
- Verdict: **Different languages, same entity** → CORRECT
**CORRECT Match**:
- Institution: "Rigsarkivet"
- Wikidata: "Danish National Archives"
- Verdict: **Official name + English translation** → CORRECT
**INCORRECT Match**:
- Institution: "Bibliotek Svendborg"
- Wikidata: "Svendborg Museum"
- Verdict: **Library vs Museum** → INCORRECT
⚠️ **UNCERTAIN Match**:
- Institution: "Herning Bibliotekerne"
- Wikidata: "Herning Libraries"
- Verdict: **Plural form, ambiguous** → UNCERTAIN (needs further research)
---
#### Step 2: Verify on Wikidata
**Click** the `wikidata_url` link to open the Wikidata entity page.
**Check**:
1. **Instance of (P31)**: Does the Wikidata type match `institution_type`?
- LIBRARY → Check for `Q7075` (library) or subclass
- ARCHIVE → Check for `Q166118` (archive) or subclass
- MUSEUM → Check for `Q33506` (museum) or subclass
2. **Country (P17)**: Should be `Q35` (Denmark)
3. **ISIL code (P791)**: If present, compare with `isil_code` column
- ✅ Match = high confidence CORRECT
- ❌ Mismatch = likely INCORRECT
4. **Located in (P131)**: Should match or be near `city` column
5. **Coordinates (P625)**: Check on Wikidata map if location makes sense
6. **Official website (P856)**: Visit to confirm institution identity
---
#### Step 3: Research If Uncertain
**When to research further**:
- Names are similar but not identical
- Institution type unclear on Wikidata
- No ISIL code to cross-reference
- Multiple institutions with similar names in same city
**Research Sources**:
1. **Danish ISIL Registry**: [https://isil.dk](https://isil.dk)
- Official authoritative source
- Search by institution name or ISIL code
2. **Institution Website**: Google `[institution_name] [city] Denmark`
- Check "About" page
- Verify address, type, mission
3. **Danish Library Directory**: [https://bibliotek.dk](https://bibliotek.dk)
- For libraries only
- Search by name or location
4. **Danish National Archives**: [https://www.sa.dk](https://www.sa.dk)
- For archives only
- Check member directory
5. **Danish Agency for Culture and Palaces**: [https://slks.dk](https://slks.dk)
- Government cultural institutions
- Official registries
---
#### Step 4: Record Validation Decision
Fill in the CSV columns:
**`validation_status`**: Choose ONE:
- **`CORRECT`** - Confirmed match, keep Wikidata link
- **`INCORRECT`** - Wrong match, remove Wikidata link
- **`UNCERTAIN`** - Needs expert review, flag for follow-up
**`validation_notes`**: Add explanation (optional but recommended)
**Examples**:
```csv
validation_status,validation_notes
CORRECT,"ISIL codes match (DK-870970), same location"
INCORRECT,"Wrong institution - Wikidata is museum, we have library"
UNCERTAIN,"Similar names but no ISIL to confirm, needs local knowledge"
CORRECT,"Verified on official website, same address"
INCORRECT,"Wikidata entity is defunct branch, ours is current main library"
```
---
## Common Validation Scenarios
### Scenario 1: Danish vs English Names
**Problem**: Institution name in Danish, Wikidata label in English
**Example**:
- Institution: "Kgl. Bibliotek"
- Wikidata: "Royal Danish Library"
**Solution**:
1. Check Wikidata for Danish label (alias)
2. If match found → CORRECT
3. Add note: "Danish/English name variant"
---
### Scenario 2: Historical Name Changes
**Problem**: Institution renamed, Wikidata has old name
**Example**:
- Institution: "Københavns Bibliotek" (current)
- Wikidata: "Copenhagen Public Library" (outdated label)
**Solution**:
1. Check Wikidata history/aliases
2. If same entity, different era → CORRECT
3. Add note: "Historical name, now [current name]"
4. Optionally: Edit Wikidata to add current name as alias
---
### Scenario 3: Merged Institutions
**Problem**: Two institutions merged, Wikidata shows pre-merger entity
**Example**:
- Institution: "Statsbiblioteket" (merged into KB)
- Wikidata: "State and University Library" (defunct)
**Solution**:
1. Check Wikidata for "replaced by" (P1366) or "end time" (P582)
2. If defunct and our data is current → INCORRECT
3. Add note: "Wikidata entity is defunct, merged into [new name] in [year]"
---
### Scenario 4: Branch vs Main Library
**Problem**: Wikidata refers to branch, our data is main library (or vice versa)
**Example**:
- Institution: "Aarhus Bibliotek" (main)
- Wikidata: "Aarhus Public Libraries" (system)
**Solution**:
1. Check Wikidata for "part of" (P361) relationship
2. If Wikidata is parent system → CORRECT (acceptable level)
3. If Wikidata is specific branch → UNCERTAIN
4. Add note: "Wikidata is [parent/branch], check hierarchy"
---
### Scenario 5: Missing ISIL Codes
**Problem**: Institution has no ISIL code, can't cross-reference
**Solution**:
1. Rely on city + name + institution type match
2. If all three match → CORRECT
3. If any mismatch → UNCERTAIN
4. Add note: "No ISIL to verify, based on [city/name/type] match"
---
## Batch Validation Tips
### For Large Datasets (100+ matches):
1. **Sort by Priority**: Review Priority 1 first (most uncertain)
2. **Filter by Institution Type**: Review all libraries, then all archives
- Easier to maintain context
- Faster lookup in type-specific registries
3. **Group by City**: Review all institutions in same city together
- Easier to distinguish similar names
- Local knowledge helps
4. **Use Find & Replace** for common notes:
- "ISIL match confirmed" (for CORRECT + ISIL match)
- "Name variant, same entity" (for CORRECT + translation)
- "Type mismatch" (for INCORRECT + wrong type)
5. **Mark Quick Wins First**:
- ISIL codes match → Instant CORRECT
- Type mismatch (Library vs Museum) → Instant INCORRECT
- Then tackle ambiguous cases
---
## Quality Assurance Checks
### Before Running Update Script:
**Completeness**: All Priority 1-2 rows have `validation_status`
**Consistency**:
- CORRECT matches should have high confidence reasons (ISIL, website, type match)
- INCORRECT matches should have clear explanation (type mismatch, wrong entity)
- UNCERTAIN should be rare (<5% of total)
**Documentation**: At least 50% of rows have `validation_notes`
**Spelling**: validation_status values are exact:
- `CORRECT` (all caps)
- `Correct`, `correct`, `OK`, `YES`
**No Blanks**: Ensure no accidentally deleted data in other columns
---
## Apply Validated Results
### After Completing Manual Review:
```bash
# Apply validation results to dataset
python scripts/apply_wikidata_validation.py
```
**Output**: `data/instances/denmark_complete_validated.json`
### Statistics Expected:
- **CORRECT**: ~85-90% of reviewed (keep Wikidata link)
- **INCORRECT**: ~5-10% of reviewed (remove Wikidata link)
- **UNCERTAIN**: ~5% of reviewed (flag for expert review)
---
## Re-export RDF
### After Validation Applied:
```bash
# Re-export RDF with corrected Wikidata links
python scripts/export_denmark_rdf.py \
--input data/instances/denmark_complete_validated.json \
--output data/rdf/denmark_validated
```
**Result**: New RDF files with only verified Wikidata links
---
## Validation Metrics
### Track Your Progress:
| Metric | Target | Actual |
|--------|--------|--------|
| Priority 1 reviewed | 100% | ___ |
| Priority 2 reviewed | 100% | ___ |
| Priority 3 reviewed | 80% | ___ |
| Priority 4-5 reviewed | 50% | ___ |
| Incorrect matches found | 5-10% | ___ |
| Uncertain matches | <5% | ___ |
| Time spent (hours) | - | ___ |
---
## Common Mistakes to Avoid
**DON'T**: Mark as CORRECT just because names are similar
**DO**: Verify at least 2 data points (city + type, or ISIL + name)
**DON'T**: Mark as INCORRECT without checking Wikidata page
**DO**: Always visit wikidata_url to verify entity details
**DON'T**: Leave validation_notes blank for INCORRECT
**DO**: Explain why match is wrong ("type mismatch", "different location", etc.)
**DON'T**: Use UNCERTAIN as default when unsure
**DO**: Research further using Danish registries before marking UNCERTAIN
**DON'T**: Assume English Wikipedia = Wikidata label
**DO**: Check Wikidata directly for Danish labels (aliases)
---
## Escalation Process
### When to Ask for Help:
1. **>20% marked UNCERTAIN** - May indicate systematic issue with matching algorithm
2. **Local Knowledge Required** - Danish language expertise needed for name variants
3. **Conflicting Sources** - ISIL registry vs Wikidata vs institution website disagree
4. **Legal/Historical Complexity** - Mergers, acquisitions, name changes unclear
**Contact**: [Project maintainer email/Slack/GitHub issue]
---
## Example Validation Session
### Sample Review (5 institutions):
**Row 1**:
- Institution: "Aalborg Bibliotek"
- Wikidata: "Aalborg Public Library"
- Match Score: 86%
- ISIL: DK-820010 (match found on Wikidata)
- **Decision**: CORRECT
- **Notes**: "ISIL match confirmed, Danish/English variant"
**Row 2**:
- Institution: "Arkiv Svendborg"
- Wikidata: "Svendborg Museum"
- Match Score: 87%
- Type: ARCHIVE vs MUSEUM (mismatch)
- **Decision**: INCORRECT
- **Notes**: "Type mismatch - archive vs museum"
**Row 3**:
- Institution: "Herning Bibliotekerne"
- Wikidata: "Herning Centralbibliotek"
- Match Score: 88%
- City: Herning (match)
- **Decision**: UNCERTAIN
- **Notes**: "Central library vs library system, check hierarchy"
**Row 4**:
- Institution: "Rigsarkivet"
- Wikidata: "Danish National Archives"
- Match Score: 92%
- ISIL: DK-011 (match), Official website verified
- **Decision**: CORRECT
- **Notes**: "Official name + English translation, website confirmed"
**Row 5**:
- Institution: "Bibliotek Brønderslev"
- Wikidata: "Brønderslev Library"
- Match Score: 95%
- City: Brønderslev (match), Type: LIBRARY (match)
- **Decision**: CORRECT
- **Notes**: "City and type match, high confidence"
---
## Post-Validation Documentation
### Update Project Documentation:
1. **PROGRESS.md**: Add validation statistics
2. **SESSION_SUMMARY**: Document findings and corrections
3. **data/rdf/README.md**: Note dataset version with validation applied
4. **Commit Message**: "feat: Apply manual Wikidata validation (185 fuzzy matches reviewed)"
---
## References
- **Wikidata SPARQL Query Service**: https://query.wikidata.org
- **Wikidata Property Browser**: https://www.wikidata.org/wiki/Special:ListProperties
- **Danish ISIL Registry**: https://isil.dk
- **RapidFuzz Library** (matching algorithm): https://github.com/maxbachmann/RapidFuzz
- **Fuzzy String Matching**: https://en.wikipedia.org/wiki/Approximate_string_matching
---
**Version**: 1.0
**Last Updated**: 2025-11-19
**Maintained By**: GLAM Data Extraction Project