glam/docs/WIKIDATA_VALIDATION_CHECKLIST.md
2025-11-19 23:25:22 +01:00

12 KiB

Wikidata Fuzzy Match Validation Checklist

Overview

This checklist helps reviewers manually validate fuzzy Wikidata matches (85-99% confidence) to ensure data quality before RDF publication.

Dataset: Danish GLAM institutions
Fuzzy Matches: 185 institutions with match scores 85-99%
Review File: data/review/denmark_wikidata_fuzzy_matches.csv


Quick Start

1. Generate Review Report

python scripts/generate_wikidata_review_report.py

Output: data/review/denmark_wikidata_fuzzy_matches.csv

2. Open CSV in Spreadsheet Software

  • Excel: Open directly, enable UTF-8 encoding
  • Google Sheets: File → Import → Upload CSV
  • LibreOffice Calc: Open, select UTF-8 encoding

3. Review Priority Order

Start with Priority 1 (most uncertain) matches and work downward:

Priority Score Range Count Description
1 85-87% Varies Very uncertain - review first
2 87-90% Varies Uncertain - needs verification
3 90-93% Varies Moderate confidence
4 93-96% Varies Fairly confident
5 96-99% Varies Mostly confident

Validation Process

For Each Row:

Step 1: Compare Names

Check: Does institution_name match wikidata_label?

Examples:

CORRECT Match:

  • Institution: "Nationalmuseet"
  • Wikidata: "National Museum of Denmark"
  • Verdict: Different languages, same entity → CORRECT

CORRECT Match:

  • Institution: "Rigsarkivet"
  • Wikidata: "Danish National Archives"
  • Verdict: Official name + English translation → CORRECT

INCORRECT Match:

  • Institution: "Bibliotek Svendborg"
  • Wikidata: "Svendborg Museum"
  • Verdict: Library vs Museum → INCORRECT

⚠️ UNCERTAIN Match:

  • Institution: "Herning Bibliotekerne"
  • Wikidata: "Herning Libraries"
  • Verdict: Plural form, ambiguous → UNCERTAIN (needs further research)

Step 2: Verify on Wikidata

Click the wikidata_url link to open the Wikidata entity page.

Check:

  1. Instance of (P31): Does the Wikidata type match institution_type?

    • LIBRARY → Check for Q7075 (library) or subclass
    • ARCHIVE → Check for Q166118 (archive) or subclass
    • MUSEUM → Check for Q33506 (museum) or subclass
  2. Country (P17): Should be Q35 (Denmark)

  3. ISIL code (P791): If present, compare with isil_code column

    • Match = high confidence CORRECT
    • Mismatch = likely INCORRECT
  4. Located in (P131): Should match or be near city column

  5. Coordinates (P625): Check on Wikidata map if location makes sense

  6. Official website (P856): Visit to confirm institution identity


Step 3: Research If Uncertain

When to research further:

  • Names are similar but not identical
  • Institution type unclear on Wikidata
  • No ISIL code to cross-reference
  • Multiple institutions with similar names in same city

Research Sources:

  1. Danish ISIL Registry: https://isil.dk

    • Official authoritative source
    • Search by institution name or ISIL code
  2. Institution Website: Google [institution_name] [city] Denmark

    • Check "About" page
    • Verify address, type, mission
  3. Danish Library Directory: https://bibliotek.dk

    • For libraries only
    • Search by name or location
  4. Danish National Archives: https://www.sa.dk

    • For archives only
    • Check member directory
  5. Danish Agency for Culture and Palaces: https://slks.dk

    • Government cultural institutions
    • Official registries

Step 4: Record Validation Decision

Fill in the CSV columns:

validation_status: Choose ONE:

  • CORRECT - Confirmed match, keep Wikidata link
  • INCORRECT - Wrong match, remove Wikidata link
  • UNCERTAIN - Needs expert review, flag for follow-up

validation_notes: Add explanation (optional but recommended)

Examples:

validation_status,validation_notes
CORRECT,"ISIL codes match (DK-870970), same location"
INCORRECT,"Wrong institution - Wikidata is museum, we have library"
UNCERTAIN,"Similar names but no ISIL to confirm, needs local knowledge"
CORRECT,"Verified on official website, same address"
INCORRECT,"Wikidata entity is defunct branch, ours is current main library"

Common Validation Scenarios

Scenario 1: Danish vs English Names

Problem: Institution name in Danish, Wikidata label in English

Example:

  • Institution: "Kgl. Bibliotek"
  • Wikidata: "Royal Danish Library"

Solution:

  1. Check Wikidata for Danish label (alias)
  2. If match found → CORRECT
  3. Add note: "Danish/English name variant"

Scenario 2: Historical Name Changes

Problem: Institution renamed, Wikidata has old name

Example:

  • Institution: "Københavns Bibliotek" (current)
  • Wikidata: "Copenhagen Public Library" (outdated label)

Solution:

  1. Check Wikidata history/aliases
  2. If same entity, different era → CORRECT
  3. Add note: "Historical name, now [current name]"
  4. Optionally: Edit Wikidata to add current name as alias

Scenario 3: Merged Institutions

Problem: Two institutions merged, Wikidata shows pre-merger entity

Example:

  • Institution: "Statsbiblioteket" (merged into KB)
  • Wikidata: "State and University Library" (defunct)

Solution:

  1. Check Wikidata for "replaced by" (P1366) or "end time" (P582)
  2. If defunct and our data is current → INCORRECT
  3. Add note: "Wikidata entity is defunct, merged into [new name] in [year]"

Scenario 4: Branch vs Main Library

Problem: Wikidata refers to branch, our data is main library (or vice versa)

Example:

  • Institution: "Aarhus Bibliotek" (main)
  • Wikidata: "Aarhus Public Libraries" (system)

Solution:

  1. Check Wikidata for "part of" (P361) relationship
  2. If Wikidata is parent system → CORRECT (acceptable level)
  3. If Wikidata is specific branch → UNCERTAIN
  4. Add note: "Wikidata is [parent/branch], check hierarchy"

Scenario 5: Missing ISIL Codes

Problem: Institution has no ISIL code, can't cross-reference

Solution:

  1. Rely on city + name + institution type match
  2. If all three match → CORRECT
  3. If any mismatch → UNCERTAIN
  4. Add note: "No ISIL to verify, based on [city/name/type] match"

Batch Validation Tips

For Large Datasets (100+ matches):

  1. Sort by Priority: Review Priority 1 first (most uncertain)

  2. Filter by Institution Type: Review all libraries, then all archives

    • Easier to maintain context
    • Faster lookup in type-specific registries
  3. Group by City: Review all institutions in same city together

    • Easier to distinguish similar names
    • Local knowledge helps
  4. Use Find & Replace for common notes:

    • "ISIL match confirmed" (for CORRECT + ISIL match)
    • "Name variant, same entity" (for CORRECT + translation)
    • "Type mismatch" (for INCORRECT + wrong type)
  5. Mark Quick Wins First:

    • ISIL codes match → Instant CORRECT
    • Type mismatch (Library vs Museum) → Instant INCORRECT
    • Then tackle ambiguous cases

Quality Assurance Checks

Before Running Update Script:

Completeness: All Priority 1-2 rows have validation_status

Consistency:

  • CORRECT matches should have high confidence reasons (ISIL, website, type match)
  • INCORRECT matches should have clear explanation (type mismatch, wrong entity)
  • UNCERTAIN should be rare (<5% of total)

Documentation: At least 50% of rows have validation_notes

Spelling: validation_status values are exact:

  • CORRECT (all caps)
  • Correct, correct, OK, YES

No Blanks: Ensure no accidentally deleted data in other columns


Apply Validated Results

After Completing Manual Review:

# Apply validation results to dataset
python scripts/apply_wikidata_validation.py

Output: data/instances/denmark_complete_validated.json

Statistics Expected:

  • CORRECT: ~85-90% of reviewed (keep Wikidata link)
  • INCORRECT: ~5-10% of reviewed (remove Wikidata link)
  • UNCERTAIN: ~5% of reviewed (flag for expert review)

Re-export RDF

After Validation Applied:

# Re-export RDF with corrected Wikidata links
python scripts/export_denmark_rdf.py \
  --input data/instances/denmark_complete_validated.json \
  --output data/rdf/denmark_validated

Result: New RDF files with only verified Wikidata links


Validation Metrics

Track Your Progress:

Metric Target Actual
Priority 1 reviewed 100% ___
Priority 2 reviewed 100% ___
Priority 3 reviewed 80% ___
Priority 4-5 reviewed 50% ___
Incorrect matches found 5-10% ___
Uncertain matches <5% ___
Time spent (hours) - ___

Common Mistakes to Avoid

DON'T: Mark as CORRECT just because names are similar
DO: Verify at least 2 data points (city + type, or ISIL + name)

DON'T: Mark as INCORRECT without checking Wikidata page
DO: Always visit wikidata_url to verify entity details

DON'T: Leave validation_notes blank for INCORRECT
DO: Explain why match is wrong ("type mismatch", "different location", etc.)

DON'T: Use UNCERTAIN as default when unsure
DO: Research further using Danish registries before marking UNCERTAIN

DON'T: Assume English Wikipedia = Wikidata label
DO: Check Wikidata directly for Danish labels (aliases)


Escalation Process

When to Ask for Help:

  1. >20% marked UNCERTAIN - May indicate systematic issue with matching algorithm

  2. Local Knowledge Required - Danish language expertise needed for name variants

  3. Conflicting Sources - ISIL registry vs Wikidata vs institution website disagree

  4. Legal/Historical Complexity - Mergers, acquisitions, name changes unclear

Contact: [Project maintainer email/Slack/GitHub issue]


Example Validation Session

Sample Review (5 institutions):

Row 1:

  • Institution: "Aalborg Bibliotek"
  • Wikidata: "Aalborg Public Library"
  • Match Score: 86%
  • ISIL: DK-820010 (match found on Wikidata)
  • Decision: CORRECT
  • Notes: "ISIL match confirmed, Danish/English variant"

Row 2:

  • Institution: "Arkiv Svendborg"
  • Wikidata: "Svendborg Museum"
  • Match Score: 87%
  • Type: ARCHIVE vs MUSEUM (mismatch)
  • Decision: INCORRECT
  • Notes: "Type mismatch - archive vs museum"

Row 3:

  • Institution: "Herning Bibliotekerne"
  • Wikidata: "Herning Centralbibliotek"
  • Match Score: 88%
  • City: Herning (match)
  • Decision: UNCERTAIN
  • Notes: "Central library vs library system, check hierarchy"

Row 4:

  • Institution: "Rigsarkivet"
  • Wikidata: "Danish National Archives"
  • Match Score: 92%
  • ISIL: DK-011 (match), Official website verified
  • Decision: CORRECT
  • Notes: "Official name + English translation, website confirmed"

Row 5:

  • Institution: "Bibliotek Brønderslev"
  • Wikidata: "Brønderslev Library"
  • Match Score: 95%
  • City: Brønderslev (match), Type: LIBRARY (match)
  • Decision: CORRECT
  • Notes: "City and type match, high confidence"

Post-Validation Documentation

Update Project Documentation:

  1. PROGRESS.md: Add validation statistics
  2. SESSION_SUMMARY: Document findings and corrections
  3. data/rdf/README.md: Note dataset version with validation applied
  4. Commit Message: "feat: Apply manual Wikidata validation (185 fuzzy matches reviewed)"

References


Version: 1.0
Last Updated: 2025-11-19
Maintained By: GLAM Data Extraction Project