# NDE Wikidata Enrichment - Quick Resume Guide **Last Updated**: 2025-11-17 **Status**: Test batch complete, ready to scale --- ## Current State - ✅ **10 records** enriched from Drenthe province - ✅ **80% success rate** (8 matched, 2 no-match) - ✅ YAML file updated with `wikidata_id` fields - ✅ LinkML schema updated - ✅ All queries logged - 📊 **1,341 records remaining** (99.3% of dataset) --- ## Files Location ``` /Users/kempersc/apps/glam/ ├── data/nde/ │ ├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml # Enriched data │ ├── linkml/nde_yaml_target.yaml # Updated schema │ └── sparql/ # Query logs ├── scripts/ │ ├── update_nde_yaml_with_wikidata_test_batch.py # Test batch (done) │ └── enrich_nde_with_wikidata.py # Full dataset (ready) └── docs/ ├── NDE_WIKIDATA_ENRICHMENT_REPORT.md # Full report └── sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md ``` --- ## To Resume Work ### Option 1: Scale to Full Dataset ```bash cd /Users/kempersc/apps/glam python3 scripts/enrich_nde_with_wikidata.py --start-index 10 --batch-size 50 ``` **Expected**: - Processing time: 2-3 hours - Success rate: 70-85% (based on test batch) - ~950-1,150 organizations will get Wikidata IDs ### Option 2: Check Current Status ```bash python3 -c " import yaml with open('data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f: orgs = yaml.safe_load(f) enriched = len([o for o in orgs if 'wikidata_id' in o]) print(f'Enriched: {enriched}/{len(orgs)} ({enriched/len(orgs)*100:.1f}%)') " ``` ### Option 3: Validate Enrichment ```bash # Check for duplicates and verify Q-numbers python3 scripts/validate_wikidata_enrichment.py ``` --- ## Key Commands ### Search Wikidata Entity ```python from wikidata_mcp import search_entity q_number = search_entity("Rijksmuseum Amsterdam") # Returns: Q190804 ``` ### Verify Q-Number ```python from wikidata_mcp import get_metadata metadata = get_metadata("Q190804", language="nl") # Returns: {"Label": "Rijksmuseum", "Description": "museum in Amsterdam"} ``` ### SPARQL Query (for municipalities) ```sparql SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q2039348 . # Dutch municipality ?item rdfs:label "Amsterdam"@nl . SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". } } ``` --- ## Enriched Records (Test Batch) | Organization | Wikidata ID | URL | |--------------|-------------|-----| | Herinneringscentrum Kamp Westerbork | Q22246632 | https://www.wikidata.org/wiki/Q22246632 | | Hunebedcentrum | Q2679819 | https://www.wikidata.org/wiki/Q2679819 | | Drents Archief | Q1978308 | https://www.wikidata.org/wiki/Q1978308 | | Drents Museum | Q1258370 | https://www.wikidata.org/wiki/Q1258370 | | Gemeente Aa en Hunze | Q300665 | https://www.wikidata.org/wiki/Q300665 | | Gemeente Borger-Odoorn | Q835118 | https://www.wikidata.org/wiki/Q835118 | | Gemeente Coevorden | Q60453 | https://www.wikidata.org/wiki/Q60453 | | Gemeente De Wolden | Q835108 | https://www.wikidata.org/wiki/Q835108 | --- ## Success Patterns ✅ **Museums with national/international recognition**: 100% match ✅ **Municipal archives**: 100% match (use municipality Q-number) ✅ **Regional archives**: 100% match ⚠️ **Branch locations**: Low coverage (often not in Wikidata) ⚠️ **Inter-municipal partnerships**: Low coverage ⚠️ **Small local societies**: Varies (50-70% expected) --- ## Next Milestones 1. **First 100 records** enriched → Review success rate, adjust strategy 2. **First 500 records** enriched → Identify patterns in no-matches 3. **First 1,000 records** enriched → Statistical analysis 4. **All 1,351 records** processed → Final report and integration --- ## Important Notes - **Rate limit**: 1,000 requests/hour (Wikidata API) - **Batch size**: 50 records per batch (recommended) - **Verification**: Always verify Q-numbers with `get_metadata` - **Backup**: Created before enrichment (`.backup.*.yaml`) - **Logging**: All queries logged in `/data/nde/sparql/` --- ## Questions to Address 1. Should we create Wikidata entries for missing institutions? 2. How to handle branch locations (link to parent or omit)? 3. Manual review threshold (currently 85% confidence)? 4. When to integrate with main GLAM project schema? --- ## Documentation 📄 **Full Report**: `/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md` 📄 **Dataset Guide**: `/data/nde/README.md` 📄 **Session Summary**: `/docs/sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md` 📄 **Agent Instructions**: `/AGENTS.md` (Wikidata enrichment section) --- **Ready to Scale**: YES ✅ **Confidence**: HIGH **Expected Success**: 70-85%