glam/docs/WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md
2025-11-19 23:25:22 +01:00

13 KiB

Wikidata Fuzzy Match Review - Danish Dataset

Executive Summary

Dataset: Danish GLAM institutions (2,348 total)
Wikidata Coverage: 769 institutions (32.8%)
Fuzzy Matches Requiring Review: 185 institutions (24.1% of linked)
Match Score Range: 85-99% confidence

Review Status: 🟡 PENDING MANUAL REVIEW


Review Scope

Match Distribution by Priority

Priority Score Range Count % of Fuzzy Description
1 85-87% 58 31.4% Very uncertain - REVIEW FIRST
2 87-90% 62 33.5% Uncertain - needs verification
3 90-93% 44 23.8% Moderate confidence
4 93-96% 14 7.6% Fairly confident
5 96-99% 7 3.8% Mostly confident

Recommended Focus: Priority 1-2 (120 matches = 64.9% of fuzzy matches)

Institution Type Breakdown

Type Count % of Fuzzy
LIBRARY 152 82.2%
ARCHIVE 33 17.8%

Observation: Libraries dominate fuzzy matches (likely due to branch naming variations)


Generated Files

1. Review Report (CSV)

File: data/review/denmark_wikidata_fuzzy_matches.csv
Rows: 185 (header + 185 data rows)
Columns: 13

Column Reference:

Column Description Action
priority 1-5 (1=most uncertain) Sort by this to prioritize
match_score 85.0-99.x% Fuzzy match confidence
institution_name Our dataset name Compare with wikidata_label
wikidata_label Wikidata entity label Compare with institution_name
city Institution location Cross-check with Wikidata
institution_type LIBRARY or ARCHIVE Verify on Wikidata (P31)
isil_code ISIL identifier (if any) Strong validation signal
ghcid Our persistent ID Reference only
wikidata_qid Q-number (e.g. Q12345) Link target
wikidata_url Direct Wikidata link CLICK TO VERIFY
validation_status FILL IN: CORRECT | INCORRECT | UNCERTAIN Your decision
validation_notes FILL IN: Explanation Document reasoning
institution_id W3ID URI For script processing

2. Validation Checklist

File: docs/WIKIDATA_VALIDATION_CHECKLIST.md
Purpose: Step-by-step guide for manual reviewers
Contents:

  • Validation workflow (4 steps per row)
  • 5 common validation scenarios
  • Quality assurance checklist
  • Research sources (Danish registries)
  • Batch validation tips
  • Example validation session

3. Processing Scripts

Generate Report Script

File: scripts/generate_wikidata_review_report.py
Purpose: Extract fuzzy matches from enriched dataset
Status: Already executed
Output: CSV report

Apply Validation Script

File: scripts/apply_wikidata_validation.py
Purpose: Update dataset based on manual review
Status: Ready to run after manual review
Input: CSV with filled validation_status column
Output: denmark_complete_validated.json


Quick Start Guide

For Reviewers (Immediate Action)

  1. Open CSV in spreadsheet software:

    # Option A: Excel
    open data/review/denmark_wikidata_fuzzy_matches.csv
    
    # Option B: Google Sheets
    # Upload data/review/denmark_wikidata_fuzzy_matches.csv
    
    # Option C: LibreOffice Calc
    libreoffice --calc data/review/denmark_wikidata_fuzzy_matches.csv
    
  2. Sort by Priority 1 (most uncertain)

  3. For each row:

    • Compare institution_name vs wikidata_label
    • Click wikidata_url to verify match
    • Check city, institution_type, isil_code
    • Fill validation_status: CORRECT | INCORRECT | UNCERTAIN
    • Add validation_notes (recommended)
  4. Save CSV (preserve column structure)

  5. Run update script:

    python scripts/apply_wikidata_validation.py
    

For Project Managers (Progress Tracking)

Estimated Time:

  • Priority 1-2 (120 matches): ~4-6 hours (2-3 minutes per match)
  • Priority 3-5 (65 matches): ~1-2 hours (1-2 minutes per match)
  • Total: ~5-8 hours

Milestones:

  • Priority 1 complete (58 matches)
  • Priority 2 complete (62 matches)
  • Priority 3 complete (44 matches)
  • Priority 4-5 complete (21 matches)
  • Validation applied to dataset
  • RDF re-exported with corrections

Sample Review Records

Example 1: Priority 1 - Likely Incorrect

priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
1,85.0,"Campus Vejle, Biblioteket",Vejle Bibliotek,Vejle,LIBRARY,DK-861510,INCORRECT,"Wikidata is main library, ours is campus branch"

Analysis:

  • Name similar but ours has ", Biblioteket" suffix
  • Likely branch vs main library mismatch
  • Needs verification on Wikidata (check P361 "part of")

Example 2: Priority 2 - Needs Research

priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
2,87.0,Fur Lokalhistoriske Arkiv,Randers Lokalhistoriske Arkiv,Skive,ARCHIVE,,INCORRECT,"City mismatch: Fur vs Randers, different local archives"

Analysis:

  • City mismatch (Skive vs Randers)
  • Both are local historical archives
  • Likely wrong match due to similar names
  • No ISIL to cross-check

Example 3: Priority 3 - Likely Correct

priority,match_score,institution_name,wikidata_label,city,institution_type,isil_code,validation_status,validation_notes
3,92.5,Aalborg Universitetsbibliotek,Aalborg University Library,Aalborg,LIBRARY,DK-820010,CORRECT,"ISIL match, Danish/English variant, same entity"

Analysis:

  • ISIL code match (DK-820010) = high confidence
  • Danish vs English name
  • City match
  • Type match
  • → Almost certainly CORRECT

Validation Expectations

Predicted Outcomes (Based on Match Scores)

Status Expected % Expected Count Description
CORRECT 85-90% 157-167 Keep Wikidata link, update provenance
INCORRECT 5-10% 9-19 Remove Wikidata link, document reason
UNCERTAIN 5% 9 Flag for expert review, keep tentatively

Quality Thresholds

Acceptable Quality:

  • ≥80% CORRECT
  • ≤15% INCORRECT
  • ≤10% UNCERTAIN

High Quality:

  • ≥90% CORRECT
  • ≤5% INCORRECT
  • ≤5% UNCERTAIN

Red Flags (indicate algorithm issues):

  • <70% CORRECT
  • 20% INCORRECT

  • 15% UNCERTAIN


Known Issues to Watch For

Issue 1: Branch vs Main Library

Pattern: Institution name ends with ", Biblioteket" (the library)

Example:

  • Ours: "Campus Vejle, Biblioteket"
  • Wikidata: "Vejle Bibliotek"

Likely Outcome: INCORRECT (branch matched to main library)

Fix: Check Wikidata for "part of" (P361) relationship


Issue 2: Gymnasium Libraries

Pattern: Institution name starts with "[School Name] Gymnasium, Biblioteket"

Example:

  • Ours: "Fredericia Gymnasium, Biblioteket"
  • Wikidata: "Fredericia Bibliotek"

Likely Outcome: INCORRECT (school library matched to public library)

Fix: Verify institution type on Wikidata (P31)


Issue 3: Location Mismatch

Pattern: City name differs between dataset and Wikidata

Example:

  • Ours: "Fur Lokalhistoriske Arkiv" (Skive)
  • Wikidata: "Randers Lokalhistoriske Arkiv" (Randers)

Likely Outcome: INCORRECT (similar names, different cities)

Fix: Google "[institution name] Denmark" to confirm location


Issue 4: Historical Name Changes

Pattern: Institution renamed, Wikidata has old or new name

Example:

  • Ours: "Statsbiblioteket" (historical)
  • Wikidata: "Royal Danish Library" (current, post-merger)

Likely Outcome: UNCERTAIN (need to check merger date)

Fix: Check Wikidata history, look for "replaced by" (P1366) or "end time" (P582)


Issue 5: Multilingual Variants

Pattern: Danish name vs English Wikidata label

Example:

  • Ours: "Rigsarkivet"
  • Wikidata: "Danish National Archives"

Likely Outcome: CORRECT (same entity, different language)

Fix: Check Wikidata for Danish label/alias


Danish Language Resources

Useful Terms

Danish English Context
Bibliotek Library General library
Bibliotekerne The libraries Library system
Hovedbiblioteket Main library Central/flagship branch
Kombi-bibliotek Combined library Library + community center
Lokalhistoriske Arkiv Local history archive Municipal archive
Rigsarkivet National Archives Denmark's national archive
Statsbiblioteket State Library Historical name (merged)
Universitetsbibliotek University library Academic library
Centralbibliotek Central library Main branch
Filial Branch Library branch
Gymnasium High school Upper secondary school

Danish ISIL Prefixes

  • DK-8xxxxx: Libraries (6-digit codes)
  • DK-01x: National institutions (Rigsarkivet = DK-011)
  • DK-xxx: Archives and special collections

Research Resources

Primary Validation Sources

  1. Danish ISIL Registry: https://isil.dk

    • Authoritative source for ISIL codes
    • Search by institution name or code
    • Official library/archive registry
  2. Wikidata Query Service: https://query.wikidata.org

    • SPARQL endpoint for bulk queries
    • Check P791 (ISIL), P17 (country), P31 (type)
  3. Danish Library Portal: https://bibliotek.dk

    • Public library directory
    • Search by city or name
  4. Danish National Archives: https://www.sa.dk

    • Archive directory
    • Member institution list
  5. Danish Agency for Culture: https://slks.dk

    • Government heritage institutions
    • Official museum/gallery registers

Post-Validation Workflow

After Manual Review Completed

# Step 1: Apply validation results
python scripts/apply_wikidata_validation.py

# Expected output:
# - data/instances/denmark_complete_validated.json
# - Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN

# Step 2: Re-export RDF with corrections
python scripts/export_denmark_rdf.py \
  --input data/instances/denmark_complete_validated.json \
  --output data/rdf/denmark_validated

# Expected output:
# - denmark_validated.ttl
# - denmark_validated.rdf
# - denmark_validated.jsonld
# - denmark_validated.nt

# Step 3: Update documentation
# - PROGRESS.md: Add validation statistics
# - SESSION_SUMMARY: Document findings
# - data/rdf/README.md: Note validated version

# Step 4: Commit changes
git add data/review/denmark_wikidata_fuzzy_matches.csv
git add data/instances/denmark_complete_validated.json
git add data/rdf/denmark_validated.*
git commit -m "feat: Apply manual Wikidata validation to Danish dataset (185 fuzzy matches reviewed)"

Validation Metrics Tracking

Template for Progress Updates

## Wikidata Validation Progress

**Date**: YYYY-MM-DD  
**Reviewer**: [Name]

### Review Status

- [x] Report generated (185 matches)
- [ ] Priority 1 reviewed (58 matches)
- [ ] Priority 2 reviewed (62 matches)
- [ ] Priority 3 reviewed (44 matches)
- [ ] Priority 4-5 reviewed (21 matches)
- [ ] Validation applied to dataset
- [ ] RDF re-exported

### Preliminary Results (after X matches reviewed)

| Status | Count | % |
|--------|-------|---|
| CORRECT | X | X% |
| INCORRECT | X | X% |
| UNCERTAIN | X | X% |
| Not Reviewed | X | X% |

### Common Issues Found

1. [Issue description]
2. [Issue description]

### Time Spent

- Priority 1: X hours
- Priority 2: X hours
- Total: X hours

### Next Steps

- [ ] [Action item]
- [ ] [Action item]

FAQ

Q: Can I skip Priority 4-5 matches?

A: Yes, if time-constrained. Priority 1-2 (64.9% of fuzzy matches) capture most uncertainty. Priority 4-5 have 93-99% confidence and are likely correct.

Q: What if I can't determine CORRECT vs INCORRECT?

A: Mark as UNCERTAIN and add detailed notes. Flag for expert review (Danish language expertise or local knowledge).

Q: How do I handle merged institutions?

A: Check Wikidata for "replaced by" (P1366) property. If our data is post-merger and Wikidata is pre-merger entity → INCORRECT. Document merger date in notes.

Q: Should I edit Wikidata during review?

A: Optional but helpful. If you find missing Danish labels or incorrect data on Wikidata, you can edit (requires Wikidata account). Document edits in validation_notes.

Q: What if ISIL codes don't match?

A: ISIL mismatch = almost always INCORRECT. ISIL is authoritative identifier. Exception: Wikidata may have outdated ISIL after code reassignment.

Q: How do I validate branch libraries?

A: Check Wikidata for "part of" (P361) property. If Wikidata entity is parent system, may still be CORRECT (acceptable abstraction level). If branch-to-branch mismatch → INCORRECT.


Version History

Version Date Changes
1.0 2025-11-19 Initial report generation (185 fuzzy matches)

Contact

Questions? Open an issue on GitHub or contact project maintainer.

Found a bug in scripts? Report at: [GitHub Issues]

Need Danish language help? [Contact Danish institutional partners]


Status: 🟡 Awaiting Manual Review
Next Milestone: Priority 1-2 review completion (120 matches)
Estimated Completion: [Add date after work begins]