435 lines
12 KiB
Markdown
435 lines
12 KiB
Markdown
# Wikidata Fuzzy Match Validation Checklist
|
|
|
|
## Overview
|
|
|
|
This checklist helps reviewers manually validate fuzzy Wikidata matches (85-99% confidence) to ensure data quality before RDF publication.
|
|
|
|
**Dataset**: Danish GLAM institutions
|
|
**Fuzzy Matches**: 185 institutions with match scores 85-99%
|
|
**Review File**: `data/review/denmark_wikidata_fuzzy_matches.csv`
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### 1. Generate Review Report
|
|
|
|
```bash
|
|
python scripts/generate_wikidata_review_report.py
|
|
```
|
|
|
|
**Output**: `data/review/denmark_wikidata_fuzzy_matches.csv`
|
|
|
|
### 2. Open CSV in Spreadsheet Software
|
|
|
|
- **Excel**: Open directly, enable UTF-8 encoding
|
|
- **Google Sheets**: File → Import → Upload CSV
|
|
- **LibreOffice Calc**: Open, select UTF-8 encoding
|
|
|
|
### 3. Review Priority Order
|
|
|
|
Start with **Priority 1** (most uncertain) matches and work downward:
|
|
|
|
| Priority | Score Range | Count | Description |
|
|
|----------|-------------|-------|-------------|
|
|
| 1 | 85-87% | Varies | Very uncertain - review first |
|
|
| 2 | 87-90% | Varies | Uncertain - needs verification |
|
|
| 3 | 90-93% | Varies | Moderate confidence |
|
|
| 4 | 93-96% | Varies | Fairly confident |
|
|
| 5 | 96-99% | Varies | Mostly confident |
|
|
|
|
---
|
|
|
|
## Validation Process
|
|
|
|
### For Each Row:
|
|
|
|
#### Step 1: Compare Names
|
|
|
|
**Check**: Does `institution_name` match `wikidata_label`?
|
|
|
|
**Examples**:
|
|
|
|
✅ **CORRECT Match**:
|
|
- Institution: "Nationalmuseet"
|
|
- Wikidata: "National Museum of Denmark"
|
|
- Verdict: **Different languages, same entity** → CORRECT
|
|
|
|
✅ **CORRECT Match**:
|
|
- Institution: "Rigsarkivet"
|
|
- Wikidata: "Danish National Archives"
|
|
- Verdict: **Official name + English translation** → CORRECT
|
|
|
|
❌ **INCORRECT Match**:
|
|
- Institution: "Bibliotek Svendborg"
|
|
- Wikidata: "Svendborg Museum"
|
|
- Verdict: **Library vs Museum** → INCORRECT
|
|
|
|
⚠️ **UNCERTAIN Match**:
|
|
- Institution: "Herning Bibliotekerne"
|
|
- Wikidata: "Herning Libraries"
|
|
- Verdict: **Plural form, ambiguous** → UNCERTAIN (needs further research)
|
|
|
|
---
|
|
|
|
#### Step 2: Verify on Wikidata
|
|
|
|
**Click** the `wikidata_url` link to open the Wikidata entity page.
|
|
|
|
**Check**:
|
|
|
|
1. **Instance of (P31)**: Does the Wikidata type match `institution_type`?
|
|
- LIBRARY → Check for `Q7075` (library) or subclass
|
|
- ARCHIVE → Check for `Q166118` (archive) or subclass
|
|
- MUSEUM → Check for `Q33506` (museum) or subclass
|
|
|
|
2. **Country (P17)**: Should be `Q35` (Denmark)
|
|
|
|
3. **ISIL code (P791)**: If present, compare with `isil_code` column
|
|
- ✅ Match = high confidence CORRECT
|
|
- ❌ Mismatch = likely INCORRECT
|
|
|
|
4. **Located in (P131)**: Should match or be near `city` column
|
|
|
|
5. **Coordinates (P625)**: Check on Wikidata map if location makes sense
|
|
|
|
6. **Official website (P856)**: Visit to confirm institution identity
|
|
|
|
---
|
|
|
|
#### Step 3: Research If Uncertain
|
|
|
|
**When to research further**:
|
|
|
|
- Names are similar but not identical
|
|
- Institution type unclear on Wikidata
|
|
- No ISIL code to cross-reference
|
|
- Multiple institutions with similar names in same city
|
|
|
|
**Research Sources**:
|
|
|
|
1. **Danish ISIL Registry**: [https://isil.dk](https://isil.dk)
|
|
- Official authoritative source
|
|
- Search by institution name or ISIL code
|
|
|
|
2. **Institution Website**: Google `[institution_name] [city] Denmark`
|
|
- Check "About" page
|
|
- Verify address, type, mission
|
|
|
|
3. **Danish Library Directory**: [https://bibliotek.dk](https://bibliotek.dk)
|
|
- For libraries only
|
|
- Search by name or location
|
|
|
|
4. **Danish National Archives**: [https://www.sa.dk](https://www.sa.dk)
|
|
- For archives only
|
|
- Check member directory
|
|
|
|
5. **Danish Agency for Culture and Palaces**: [https://slks.dk](https://slks.dk)
|
|
- Government cultural institutions
|
|
- Official registries
|
|
|
|
---
|
|
|
|
#### Step 4: Record Validation Decision
|
|
|
|
Fill in the CSV columns:
|
|
|
|
**`validation_status`**: Choose ONE:
|
|
|
|
- **`CORRECT`** - Confirmed match, keep Wikidata link
|
|
- **`INCORRECT`** - Wrong match, remove Wikidata link
|
|
- **`UNCERTAIN`** - Needs expert review, flag for follow-up
|
|
|
|
**`validation_notes`**: Add explanation (optional but recommended)
|
|
|
|
**Examples**:
|
|
|
|
```csv
|
|
validation_status,validation_notes
|
|
CORRECT,"ISIL codes match (DK-870970), same location"
|
|
INCORRECT,"Wrong institution - Wikidata is museum, we have library"
|
|
UNCERTAIN,"Similar names but no ISIL to confirm, needs local knowledge"
|
|
CORRECT,"Verified on official website, same address"
|
|
INCORRECT,"Wikidata entity is defunct branch, ours is current main library"
|
|
```
|
|
|
|
---
|
|
|
|
## Common Validation Scenarios
|
|
|
|
### Scenario 1: Danish vs English Names
|
|
|
|
**Problem**: Institution name in Danish, Wikidata label in English
|
|
|
|
**Example**:
|
|
- Institution: "Kgl. Bibliotek"
|
|
- Wikidata: "Royal Danish Library"
|
|
|
|
**Solution**:
|
|
1. Check Wikidata for Danish label (alias)
|
|
2. If match found → CORRECT
|
|
3. Add note: "Danish/English name variant"
|
|
|
|
---
|
|
|
|
### Scenario 2: Historical Name Changes
|
|
|
|
**Problem**: Institution renamed, Wikidata has old name
|
|
|
|
**Example**:
|
|
- Institution: "Københavns Bibliotek" (current)
|
|
- Wikidata: "Copenhagen Public Library" (outdated label)
|
|
|
|
**Solution**:
|
|
1. Check Wikidata history/aliases
|
|
2. If same entity, different era → CORRECT
|
|
3. Add note: "Historical name, now [current name]"
|
|
4. Optionally: Edit Wikidata to add current name as alias
|
|
|
|
---
|
|
|
|
### Scenario 3: Merged Institutions
|
|
|
|
**Problem**: Two institutions merged, Wikidata shows pre-merger entity
|
|
|
|
**Example**:
|
|
- Institution: "Statsbiblioteket" (merged into KB)
|
|
- Wikidata: "State and University Library" (defunct)
|
|
|
|
**Solution**:
|
|
1. Check Wikidata for "replaced by" (P1366) or "end time" (P582)
|
|
2. If defunct and our data is current → INCORRECT
|
|
3. Add note: "Wikidata entity is defunct, merged into [new name] in [year]"
|
|
|
|
---
|
|
|
|
### Scenario 4: Branch vs Main Library
|
|
|
|
**Problem**: Wikidata refers to branch, our data is main library (or vice versa)
|
|
|
|
**Example**:
|
|
- Institution: "Aarhus Bibliotek" (main)
|
|
- Wikidata: "Aarhus Public Libraries" (system)
|
|
|
|
**Solution**:
|
|
1. Check Wikidata for "part of" (P361) relationship
|
|
2. If Wikidata is parent system → CORRECT (acceptable level)
|
|
3. If Wikidata is specific branch → UNCERTAIN
|
|
4. Add note: "Wikidata is [parent/branch], check hierarchy"
|
|
|
|
---
|
|
|
|
### Scenario 5: Missing ISIL Codes
|
|
|
|
**Problem**: Institution has no ISIL code, can't cross-reference
|
|
|
|
**Solution**:
|
|
1. Rely on city + name + institution type match
|
|
2. If all three match → CORRECT
|
|
3. If any mismatch → UNCERTAIN
|
|
4. Add note: "No ISIL to verify, based on [city/name/type] match"
|
|
|
|
---
|
|
|
|
## Batch Validation Tips
|
|
|
|
### For Large Datasets (100+ matches):
|
|
|
|
1. **Sort by Priority**: Review Priority 1 first (most uncertain)
|
|
|
|
2. **Filter by Institution Type**: Review all libraries, then all archives
|
|
- Easier to maintain context
|
|
- Faster lookup in type-specific registries
|
|
|
|
3. **Group by City**: Review all institutions in same city together
|
|
- Easier to distinguish similar names
|
|
- Local knowledge helps
|
|
|
|
4. **Use Find & Replace** for common notes:
|
|
- "ISIL match confirmed" (for CORRECT + ISIL match)
|
|
- "Name variant, same entity" (for CORRECT + translation)
|
|
- "Type mismatch" (for INCORRECT + wrong type)
|
|
|
|
5. **Mark Quick Wins First**:
|
|
- ISIL codes match → Instant CORRECT
|
|
- Type mismatch (Library vs Museum) → Instant INCORRECT
|
|
- Then tackle ambiguous cases
|
|
|
|
---
|
|
|
|
## Quality Assurance Checks
|
|
|
|
### Before Running Update Script:
|
|
|
|
✅ **Completeness**: All Priority 1-2 rows have `validation_status`
|
|
|
|
✅ **Consistency**:
|
|
- CORRECT matches should have high confidence reasons (ISIL, website, type match)
|
|
- INCORRECT matches should have clear explanation (type mismatch, wrong entity)
|
|
- UNCERTAIN should be rare (<5% of total)
|
|
|
|
✅ **Documentation**: At least 50% of rows have `validation_notes`
|
|
|
|
✅ **Spelling**: validation_status values are exact:
|
|
- ✅ `CORRECT` (all caps)
|
|
- ❌ `Correct`, `correct`, `OK`, `YES`
|
|
|
|
✅ **No Blanks**: Ensure no accidentally deleted data in other columns
|
|
|
|
---
|
|
|
|
## Apply Validated Results
|
|
|
|
### After Completing Manual Review:
|
|
|
|
```bash
|
|
# Apply validation results to dataset
|
|
python scripts/apply_wikidata_validation.py
|
|
```
|
|
|
|
**Output**: `data/instances/denmark_complete_validated.json`
|
|
|
|
### Statistics Expected:
|
|
|
|
- **CORRECT**: ~85-90% of reviewed (keep Wikidata link)
|
|
- **INCORRECT**: ~5-10% of reviewed (remove Wikidata link)
|
|
- **UNCERTAIN**: ~5% of reviewed (flag for expert review)
|
|
|
|
---
|
|
|
|
## Re-export RDF
|
|
|
|
### After Validation Applied:
|
|
|
|
```bash
|
|
# Re-export RDF with corrected Wikidata links
|
|
python scripts/export_denmark_rdf.py \
|
|
--input data/instances/denmark_complete_validated.json \
|
|
--output data/rdf/denmark_validated
|
|
```
|
|
|
|
**Result**: New RDF files with only verified Wikidata links
|
|
|
|
---
|
|
|
|
## Validation Metrics
|
|
|
|
### Track Your Progress:
|
|
|
|
| Metric | Target | Actual |
|
|
|--------|--------|--------|
|
|
| Priority 1 reviewed | 100% | ___ |
|
|
| Priority 2 reviewed | 100% | ___ |
|
|
| Priority 3 reviewed | 80% | ___ |
|
|
| Priority 4-5 reviewed | 50% | ___ |
|
|
| Incorrect matches found | 5-10% | ___ |
|
|
| Uncertain matches | <5% | ___ |
|
|
| Time spent (hours) | - | ___ |
|
|
|
|
---
|
|
|
|
## Common Mistakes to Avoid
|
|
|
|
❌ **DON'T**: Mark as CORRECT just because names are similar
|
|
✅ **DO**: Verify at least 2 data points (city + type, or ISIL + name)
|
|
|
|
❌ **DON'T**: Mark as INCORRECT without checking Wikidata page
|
|
✅ **DO**: Always visit wikidata_url to verify entity details
|
|
|
|
❌ **DON'T**: Leave validation_notes blank for INCORRECT
|
|
✅ **DO**: Explain why match is wrong ("type mismatch", "different location", etc.)
|
|
|
|
❌ **DON'T**: Use UNCERTAIN as default when unsure
|
|
✅ **DO**: Research further using Danish registries before marking UNCERTAIN
|
|
|
|
❌ **DON'T**: Assume English Wikipedia = Wikidata label
|
|
✅ **DO**: Check Wikidata directly for Danish labels (aliases)
|
|
|
|
---
|
|
|
|
## Escalation Process
|
|
|
|
### When to Ask for Help:
|
|
|
|
1. **>20% marked UNCERTAIN** - May indicate systematic issue with matching algorithm
|
|
|
|
2. **Local Knowledge Required** - Danish language expertise needed for name variants
|
|
|
|
3. **Conflicting Sources** - ISIL registry vs Wikidata vs institution website disagree
|
|
|
|
4. **Legal/Historical Complexity** - Mergers, acquisitions, name changes unclear
|
|
|
|
**Contact**: [Project maintainer email/Slack/GitHub issue]
|
|
|
|
---
|
|
|
|
## Example Validation Session
|
|
|
|
### Sample Review (5 institutions):
|
|
|
|
**Row 1**:
|
|
- Institution: "Aalborg Bibliotek"
|
|
- Wikidata: "Aalborg Public Library"
|
|
- Match Score: 86%
|
|
- ISIL: DK-820010 (match found on Wikidata)
|
|
- **Decision**: CORRECT
|
|
- **Notes**: "ISIL match confirmed, Danish/English variant"
|
|
|
|
**Row 2**:
|
|
- Institution: "Arkiv Svendborg"
|
|
- Wikidata: "Svendborg Museum"
|
|
- Match Score: 87%
|
|
- Type: ARCHIVE vs MUSEUM (mismatch)
|
|
- **Decision**: INCORRECT
|
|
- **Notes**: "Type mismatch - archive vs museum"
|
|
|
|
**Row 3**:
|
|
- Institution: "Herning Bibliotekerne"
|
|
- Wikidata: "Herning Centralbibliotek"
|
|
- Match Score: 88%
|
|
- City: Herning (match)
|
|
- **Decision**: UNCERTAIN
|
|
- **Notes**: "Central library vs library system, check hierarchy"
|
|
|
|
**Row 4**:
|
|
- Institution: "Rigsarkivet"
|
|
- Wikidata: "Danish National Archives"
|
|
- Match Score: 92%
|
|
- ISIL: DK-011 (match), Official website verified
|
|
- **Decision**: CORRECT
|
|
- **Notes**: "Official name + English translation, website confirmed"
|
|
|
|
**Row 5**:
|
|
- Institution: "Bibliotek Brønderslev"
|
|
- Wikidata: "Brønderslev Library"
|
|
- Match Score: 95%
|
|
- City: Brønderslev (match), Type: LIBRARY (match)
|
|
- **Decision**: CORRECT
|
|
- **Notes**: "City and type match, high confidence"
|
|
|
|
---
|
|
|
|
## Post-Validation Documentation
|
|
|
|
### Update Project Documentation:
|
|
|
|
1. **PROGRESS.md**: Add validation statistics
|
|
2. **SESSION_SUMMARY**: Document findings and corrections
|
|
3. **data/rdf/README.md**: Note dataset version with validation applied
|
|
4. **Commit Message**: "feat: Apply manual Wikidata validation (185 fuzzy matches reviewed)"
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Wikidata SPARQL Query Service**: https://query.wikidata.org
|
|
- **Wikidata Property Browser**: https://www.wikidata.org/wiki/Special:ListProperties
|
|
- **Danish ISIL Registry**: https://isil.dk
|
|
- **RapidFuzz Library** (matching algorithm): https://github.com/maxbachmann/RapidFuzz
|
|
- **Fuzzy String Matching**: https://en.wikipedia.org/wiki/Approximate_string_matching
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Last Updated**: 2025-11-19
|
|
**Maintained By**: GLAM Data Extraction Project
|