glam/docs/NDE_WIKIDATA_QUICK_RESUME.md
2025-11-19 23:25:22 +01:00

4.7 KiB

NDE Wikidata Enrichment - Quick Resume Guide

Last Updated: 2025-11-17
Status: Test batch complete, ready to scale


Current State

  • 10 records enriched from Drenthe province
  • 80% success rate (8 matched, 2 no-match)
  • YAML file updated with wikidata_id fields
  • LinkML schema updated
  • All queries logged
  • 📊 1,341 records remaining (99.3% of dataset)

Files Location

/Users/kempersc/apps/glam/
├── data/nde/
│   ├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml  # Enriched data
│   ├── linkml/nde_yaml_target.yaml  # Updated schema
│   └── sparql/  # Query logs
├── scripts/
│   ├── update_nde_yaml_with_wikidata_test_batch.py  # Test batch (done)
│   └── enrich_nde_with_wikidata.py  # Full dataset (ready)
└── docs/
    ├── NDE_WIKIDATA_ENRICHMENT_REPORT.md  # Full report
    └── sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md

To Resume Work

Option 1: Scale to Full Dataset

cd /Users/kempersc/apps/glam
python3 scripts/enrich_nde_with_wikidata.py --start-index 10 --batch-size 50

Expected:

  • Processing time: 2-3 hours
  • Success rate: 70-85% (based on test batch)
  • ~950-1,150 organizations will get Wikidata IDs

Option 2: Check Current Status

python3 -c "
import yaml
with open('data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
    orgs = yaml.safe_load(f)
enriched = len([o for o in orgs if 'wikidata_id' in o])
print(f'Enriched: {enriched}/{len(orgs)} ({enriched/len(orgs)*100:.1f}%)')
"

Option 3: Validate Enrichment

# Check for duplicates and verify Q-numbers
python3 scripts/validate_wikidata_enrichment.py

Key Commands

Search Wikidata Entity

from wikidata_mcp import search_entity
q_number = search_entity("Rijksmuseum Amsterdam")
# Returns: Q190804

Verify Q-Number

from wikidata_mcp import get_metadata
metadata = get_metadata("Q190804", language="nl")
# Returns: {"Label": "Rijksmuseum", "Description": "museum in Amsterdam"}

SPARQL Query (for municipalities)

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q2039348 .  # Dutch municipality
  ?item rdfs:label "Amsterdam"@nl .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
}

Enriched Records (Test Batch)

Organization Wikidata ID URL
Herinneringscentrum Kamp Westerbork Q22246632 https://www.wikidata.org/wiki/Q22246632
Hunebedcentrum Q2679819 https://www.wikidata.org/wiki/Q2679819
Drents Archief Q1978308 https://www.wikidata.org/wiki/Q1978308
Drents Museum Q1258370 https://www.wikidata.org/wiki/Q1258370
Gemeente Aa en Hunze Q300665 https://www.wikidata.org/wiki/Q300665
Gemeente Borger-Odoorn Q835118 https://www.wikidata.org/wiki/Q835118
Gemeente Coevorden Q60453 https://www.wikidata.org/wiki/Q60453
Gemeente De Wolden Q835108 https://www.wikidata.org/wiki/Q835108

Success Patterns

Museums with national/international recognition: 100% match
Municipal archives: 100% match (use municipality Q-number)
Regional archives: 100% match
⚠️ Branch locations: Low coverage (often not in Wikidata)
⚠️ Inter-municipal partnerships: Low coverage
⚠️ Small local societies: Varies (50-70% expected)


Next Milestones

  1. First 100 records enriched → Review success rate, adjust strategy
  2. First 500 records enriched → Identify patterns in no-matches
  3. First 1,000 records enriched → Statistical analysis
  4. All 1,351 records processed → Final report and integration

Important Notes

  • Rate limit: 1,000 requests/hour (Wikidata API)
  • Batch size: 50 records per batch (recommended)
  • Verification: Always verify Q-numbers with get_metadata
  • Backup: Created before enrichment (.backup.*.yaml)
  • Logging: All queries logged in /data/nde/sparql/

Questions to Address

  1. Should we create Wikidata entries for missing institutions?
  2. How to handle branch locations (link to parent or omit)?
  3. Manual review threshold (currently 85% confidence)?
  4. When to integrate with main GLAM project schema?

Documentation

📄 Full Report: /docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md
📄 Dataset Guide: /data/nde/README.md
📄 Session Summary: /docs/sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md
📄 Agent Instructions: /AGENTS.md (Wikidata enrichment section)


Ready to Scale: YES
Confidence: HIGH
Expected Success: 70-85%