9.8 KiB
NDE Wikidata Enrichment - Progress Report
Last Updated: 2025-11-17 12:24 UTC
Session: Batch 3 Complete
Status: 19/1,351 records enriched (1.4%)
Progress Summary
Records Enriched
| Batch | Records | Success | Rate | Total Enriched |
|---|---|---|---|---|
| Test Batch | 10 (records 1-10) | 8 | 80% | 8 |
| Batch 2 | 7 (records 11-17) | 7 | 100% | 15 |
| Batch 3 | 9 (records 18-26) | 3 | 33% | 18 |
| TOTAL | 26 | 18 | 69% | 18 |
Remaining Work
- Remaining records: 1,332 (98.6%)
- Estimated time: 15-20 hours at current rate (2-3 seconds per record)
- Success rate projection: 65-75% (adjusted based on specialty museums/societies)
Batch 3 Details
Date: 2025-11-17
Records: 18-26 (9 specialty museums and historical societies in Drenthe)
Method: Direct search for museums, attempted SPARQL for historical societies
Organizations Enriched
| # | Organization | Wikidata ID | Type | Verification |
|---|---|---|---|---|
| 23 | Stichting Exploitatie Industrieel Smalspoormuseum | Q1911968 | museum | ✓ narrow gauge railway museum in Drenthe |
| 24 | Stichting Zeemuseum Miramar | Q22006174 | museum | ✓ maritime museum in Vledder |
| 26 | Stichting Cultuurhistorisch Streek- en Handkarren Museum De Wemme | Q56461228 | museum | ✓ handcart museum in Zuidwolde |
No Matches Found
| # | Organization | Type | Reason |
|---|---|---|---|
| 18 | Stichting Harmonium Museum Nederland | museum | Museum closed/collection relocated to Museum Geelvinck |
| 19 | Historische Vereniging Carspel Oderen | historische vereniging | Small local historical society, not in Wikidata |
| 20 | Historische Vereniging De Wijk Koekange | historische vereniging | Small local historical society, not in Wikidata |
| 21 | Historische Kring Hoogeveen | historische vereniging | Local historical society, not in Wikidata |
| 22 | Historische Vereniging Nijeveen | historische vereniging | Small local historical society, not in Wikidata |
| 25 | Museum de Proefkolonie | museum | UNESCO site (Q64685403) found but not specific museum |
Key Insight: Historical societies (historische vereniging/kring) have very low Wikidata coverage (0/4 = 0%). Specialty niche museums perform better (3/5 = 60%).
Batch 2 Details
Date: 2025-11-17
Records: 11-17 (7 municipal archives in Drenthe)
Method: SPARQL queries for Dutch municipalities (P31: Q2039348)
Organizations Enriched
| # | Organization | Wikidata ID | Type | Verification |
|---|---|---|---|---|
| 11 | Gemeente Hoogeveen | Q208012 | archief | ✓ gemeente in Drenthe |
| 12 | Gemeente Emmen | Q14641 | archief | ✓ gemeente in Drenthe |
| 13 | Gemeente Meppel | Q60425 | archief | ✓ gemeente in Drenthe |
| 14 | Gemeente Midden-Drenthe | Q835125 | archief | ✓ gemeente in Drenthe |
| 15 | Gemeente Noordenveld | Q835083 | archief | ✓ gemeente in Drenthe |
| 16 | Gemeente Westerveld | Q747920 | archief | ✓ gemeente in Drenthe |
| 17 | Gemeente Tynaarlo | Q840457 | archief | ✓ gemeente in Drenthe |
Key Insight: Municipal archives have 100% match rate when using SPARQL with proper class filtering (wdt:P31 wd:Q2039348).
Combined Results (Batches 1-3)
By Organization Type
| Type | Enriched | Total Processed | Success Rate |
|---|---|---|---|
| Museum | 6 | 9 | 67% |
| Archive (municipal) | 12 | 13 | 92% |
| Archive (regional) | 1 | 1 | 100% |
| Historical society | 0 | 4 | 0% |
| TOTAL | 19 | 27 | 70% |
By Province (Drenthe only so far)
| Municipality | Archives | Museums | Societies | Total |
|---|---|---|---|---|
| Assen | 1 | 2 | 0 | 3 |
| Hoogeveen | 1 | 1 | 1 | 3 |
| Other Drenthe | 10 | 4 | 3 | 17 |
| TOTAL | 12 | 7 | 4 | 23 |
Search Strategy Effectiveness
Method 1: Direct Search (wikidata-authenticated_search_entity)
- Success rate: 60% (9/15 records)
- Best for: Well-known museums, national institutions
- Limitations: Often returns wrong entity type (city vs. municipality)
Method 2: SPARQL Queries (wikidata-authenticated_execute_sparql)
- Success rate: 100% (6/6 municipalities)
- Best for: Municipalities, government institutions
- Query pattern:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q2039348 . # Instance of: Dutch municipality
?item rdfs:label ?label .
FILTER(CONTAINS(LCASE(?label), "municipality_name"))
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
}
Recommended Strategy for Full Dataset
- Museums: Direct search first, SPARQL fallback - expect 60-70% success
- Municipal archives: Always use SPARQL (100% success)
- Regional archives: Direct search or ISIL code lookup
- Historical societies: Very low Wikidata coverage (0% so far) - mark as no_match_found
- Libraries: Direct search, expect 70-80% success
No-Match Cases
| Organization | Type | Reason |
|---|---|---|
| Stichting Drents Museum De Buitenplaats | museum | Branch location (not in Wikidata) |
| Samenwerkingsorganisatie De Wolden/Hoogeveen | archief | Inter-municipal partnership (not in Wikidata) |
| Stichting Harmonium Museum Nederland | museum | Museum closed/collection relocated |
| Historische Vereniging Carspel Oderen | historische vereniging | Small local society, not in Wikidata |
| Historische Vereniging De Wijk Koekange | historische vereniging | Small local society, not in Wikidata |
| Historische Kring Hoogeveen | historische vereniging | Local historical society, not in Wikidata |
| Historische Vereniging Nijeveen | historische vereniging | Small local society, not in Wikidata |
| Museum de Proefkolonie | museum | Couldn't find specific museum (UNESCO site found instead) |
Patterns:
- Branch locations and collaborative organizations less likely to have Wikidata entries
- Historical societies (vereniging/kring) have essentially no Wikidata coverage
- Closed/relocated museums may not have current Wikidata entries
Next Steps
Immediate (Next Batch)
- Records 27-50: Continue with remaining Drenthe institutions and move to other provinces
- Estimated time: 45-60 minutes
- Expected success: 50-70% (mix of museum types and organization types)
Short Term (Next 100 Records)
- Records 27-100: Complete Drenthe province, start Friesland/Groningen
- Estimated time: 3-4 hours
- Expected enriched: 50-60 records
Medium Term (Next 500 Records)
- Records 27-500: Multiple provinces, diverse organization types
- Estimated time: 12-15 hours
- Expected enriched: 300-350 records
Long Term (Full Dataset)
- All 1,351 records: Complete enrichment
- Estimated time: 18-22 hours total (adjusted for lower success rate)
- Expected enriched: 850-1,000 records (65-75% success rate)
Technical Notes
Batch Processing Script
Created update scripts for systematic processing:
/scripts/update_nde_batch_2.py- Municipal archives (100% success)/scripts/update_nde_batch_3.py- Specialty museums and historical societies (33% success)- Loads YAML file
- Creates backup before modifications
- Updates records with Q-numbers
- Marks no-match cases with
wikidata_enrichment_status: no_match_found - Saves enriched data
- Reports statistics
Wikidata API Performance
- Search queries: ~1-2 seconds per request
- SPARQL queries: ~2-3 seconds per request
- Verification checks: ~0.5 seconds per request
- Rate limiting: 30-second pause between batches recommended
Data Quality
- All Q-numbers verified with
wikidata-authenticated_get_metadata - No duplicates detected (each Q-number used once)
- No errors in batch processing
Files Updated
Data Files
/data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml(updated with 3 new Q-numbers + 6 no-match markers)/data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.20251117_122408.yaml(backup from Batch 3)- Previous backups: 20251117_121119 (Batch 2), earlier backups from test phase
Scripts
/scripts/update_nde_batch_2.py(municipal archives - 100% success)/scripts/update_nde_batch_3.py(specialty museums/societies - 33% success)/scripts/enrich_nde_full_dataset.py(ready for semi-automation)
Documentation
/docs/NDE_WIKIDATA_ENRICHMENT_PROGRESS.md(this file)
Recommendations
For Efficient Completion
- Batch by organization type: Process all municipalities together, then museums, then libraries, etc.
- Use SPARQL for government institutions: 100% success rate
- Direct search for cultural institutions: Good success rate for well-known organizations
- Manual review queue: Flag low-confidence matches for human verification
- Checkpoint system: Save after every 50 records to enable recovery
For Quality Assurance
- Verify all Q-numbers before final commit
- Check for duplicate Q-numbers across dataset
- Cross-reference with ISIL codes where available
- Manual review for organizations with multiple potential matches
Next Session Checklist
- Process records 18-26 (specialty museums and historical societies) - COMPLETE
- Process records 27-50 (continue Drenthe, start other provinces)
- Process records 51-100 (expand to Friesland/Groningen)
- Create validation script for Q-number verification
- Generate statistics dashboard
- Consider automation for remaining municipal archives
Progress: 1.4% complete (19/1,351)
Estimated completion: 18-22 hours remaining (adjusted)
Success rate: 70% (below 80% target due to historical societies)
Status: ⚠️ Adjusted expectations - historical societies have 0% match rate