glam/data/review
2025-11-19 23:25:22 +01:00
..
denmark_wikidata_fuzzy_matches.csv add isil entries 2025-11-19 23:25:22 +01:00
README.md add isil entries 2025-11-19 23:25:22 +01:00

Wikidata Fuzzy Match Review Package

This directory contains files for manual validation of fuzzy Wikidata matches in the Danish GLAM dataset.

📋 Package Contents

Review Data

  • denmark_wikidata_fuzzy_matches.csv - Main review file (185 fuzzy matches)
    • ACTION REQUIRED: Fill in validation_status and validation_notes columns
    • Prioritized by match score (Priority 1 = most uncertain)

Documentation

See /docs/ directory:

  • WIKIDATA_VALIDATION_CHECKLIST.md - Step-by-step validation guide
  • WIKIDATA_FUZZY_MATCH_REVIEW_SUMMARY.md - Complete overview and FAQ

Scripts

See /scripts/ directory:

  • generate_wikidata_review_report.py - Already run (generated this CSV)
  • check_validation_progress.py - Check review progress anytime
  • apply_wikidata_validation.py - Apply completed review to dataset

🚀 Quick Start

1. Open CSV for Review

# Option A: Excel
open denmark_wikidata_fuzzy_matches.csv

# Option B: Google Sheets
# Upload denmark_wikidata_fuzzy_matches.csv

# Option C: LibreOffice
libreoffice --calc denmark_wikidata_fuzzy_matches.csv

2. Review Process (For Each Row)

  1. Compare names: institution_name vs wikidata_label
  2. Click URL: Visit wikidata_url to verify match
  3. Check metadata: City, institution type, ISIL code
  4. Decide: Fill validation_status with:
    • CORRECT - Keep Wikidata link
    • INCORRECT - Remove Wikidata link
    • UNCERTAIN - Flag for expert review
  5. Document: Add explanation in validation_notes (optional but recommended)

3. Check Progress

python ../scripts/check_validation_progress.py

Output shows:

  • Overall completion percentage
  • Progress by priority level
  • Estimated time remaining
  • Quality indicators

4. Apply Validation Results

# After completing review
python ../scripts/apply_wikidata_validation.py

Output:

  • denmark_complete_validated.json (updated dataset)
  • Statistics: X CORRECT, Y INCORRECT, Z UNCERTAIN

📊 Review Statistics

Generated: 2025-11-19
Total Fuzzy Matches: 185
Match Score Range: 85-99%

Priority Breakdown

Priority Score Range Count Status
1 85-87% 58 Not Started
2 87-90% 62 Not Started
3 90-93% 44 Not Started
4 93-96% 14 Not Started
5 96-99% 7 Not Started

Recommended Focus: Priority 1-2 (120 matches = 64.9%)

Institution Types

  • LIBRARY: 152 (82.2%)
  • ARCHIVE: 33 (17.8%)

⏱️ Time Estimates

  • Priority 1-2 (120 matches): 4-6 hours (~2-3 min per match)
  • Priority 3-5 (65 matches): 1-2 hours (~1-2 min per match)
  • Total: 5-8 hours

Tips for Faster Review:

  • Sort by priority (start with 1)
  • Filter by institution type (review all libraries, then archives)
  • Use keyboard shortcuts in spreadsheet software
  • Mark obvious matches first (ISIL match = instant CORRECT)

Validation Checklist

Before running apply_wikidata_validation.py:

  • All Priority 1 rows have validation_status filled
  • All Priority 2 rows have validation_status filled
  • At least 50% of rows have validation_notes
  • Status values are exact: CORRECT, INCORRECT, or UNCERTAIN
  • No accidentally deleted data in other columns
  • CSV saved in original location

📚 Key Resources

Danish Registries (For Research)

Wikidata Tools

Validation Key Properties

Check these on Wikidata:

  • P31 (instance of) - Should match institution_type
  • P17 (country) - Should be Q35 (Denmark)
  • P791 (ISIL) - Cross-check with isil_code column
  • P131 (located in) - Should match city
  • P625 (coordinates) - Verify location on map

🔍 Common Validation Scenarios

Scenario 1: Danish vs English Names

Example: "Rigsarkivet" vs "Danish National Archives"
Decision: CORRECT (same entity, different language)
Check: Wikidata should have Danish label as alias

Scenario 2: Branch vs Main Library

Example: "Campus Vejle, Biblioteket" vs "Vejle Bibliotek"
Decision: INCORRECT (branch matched to main)
Check: Wikidata P361 (part of) property

Scenario 3: Type Mismatch

Example: Our LIBRARY vs Wikidata MUSEUM
Decision: INCORRECT (wrong entity type)
Check: Wikidata P31 (instance of) property

Scenario 4: Historical Merger

Example: "Statsbiblioteket" vs "Royal Danish Library"
Decision: UNCERTAIN (check merger date)
Check: Wikidata P1366 (replaced by) or P582 (end time)

Scenario 5: ISIL Match

Example: Both have DK-820010
Decision: CORRECT (authoritative identifier match)
Check: No further validation needed


📈 Expected Outcomes

Based on match score distribution:

Status Expected % Expected Count
CORRECT 85-90% 157-167
INCORRECT 5-10% 9-19
UNCERTAIN 5% 9

Quality Thresholds:

  • Acceptable: ≥80% CORRECT, ≤15% INCORRECT
  • 🌟 High Quality: ≥90% CORRECT, ≤5% INCORRECT
  • 🚨 Red Flag: <70% CORRECT, >20% INCORRECT

🛠️ Troubleshooting

Issue: CSV won't open with proper encoding

Solution: Ensure UTF-8 encoding when opening

# Excel: Save As → More Options → UTF-8
# Google Sheets: File → Import → Character encoding: UTF-8
# LibreOffice: Open → Character set: Unicode (UTF-8)

Issue: Progress checker shows 0% but I've reviewed some

Solution: Ensure validation_status values are EXACT (all caps):

  • CORRECT (correct)
  • Correct, correct, OK (incorrect)

Issue: Apply script fails with "No validation results"

Solution: Check that validation_status column has values (not blank)

Issue: Can't decide CORRECT vs INCORRECT

Solution: Mark as UNCERTAIN and add detailed notes. Flag for expert review.


📧 Support

Questions? See /docs/WIKIDATA_VALIDATION_CHECKLIST.md for detailed guidance

Found issues? Open GitHub issue or contact project maintainer

Need Danish expertise? Contact institutional partners for local knowledge


🎯 Current Status

Progress: 0/185 (0.0%)
Priority 1: 0/58 (0.0%)
Priority 2: 0/62 (0.0%)
Next Step: Begin Priority 1 review

Run python ../scripts/check_validation_progress.py to update status.


Last Updated: 2025-11-19
Dataset: denmark_complete_enriched.json (2,348 institutions)
Wikidata Links: 769 total (584 exact + 185 fuzzy)