165 lines
4.7 KiB
Markdown
165 lines
4.7 KiB
Markdown
# NDE Wikidata Enrichment - Quick Resume Guide
|
|
|
|
**Last Updated**: 2025-11-17
|
|
**Status**: Test batch complete, ready to scale
|
|
|
|
---
|
|
|
|
## Current State
|
|
|
|
- ✅ **10 records** enriched from Drenthe province
|
|
- ✅ **80% success rate** (8 matched, 2 no-match)
|
|
- ✅ YAML file updated with `wikidata_id` fields
|
|
- ✅ LinkML schema updated
|
|
- ✅ All queries logged
|
|
- 📊 **1,341 records remaining** (99.3% of dataset)
|
|
|
|
---
|
|
|
|
## Files Location
|
|
|
|
```
|
|
/Users/kempersc/apps/glam/
|
|
├── data/nde/
|
|
│ ├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml # Enriched data
|
|
│ ├── linkml/nde_yaml_target.yaml # Updated schema
|
|
│ └── sparql/ # Query logs
|
|
├── scripts/
|
|
│ ├── update_nde_yaml_with_wikidata_test_batch.py # Test batch (done)
|
|
│ └── enrich_nde_with_wikidata.py # Full dataset (ready)
|
|
└── docs/
|
|
├── NDE_WIKIDATA_ENRICHMENT_REPORT.md # Full report
|
|
└── sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md
|
|
```
|
|
|
|
---
|
|
|
|
## To Resume Work
|
|
|
|
### Option 1: Scale to Full Dataset
|
|
|
|
```bash
|
|
cd /Users/kempersc/apps/glam
|
|
python3 scripts/enrich_nde_with_wikidata.py --start-index 10 --batch-size 50
|
|
```
|
|
|
|
**Expected**:
|
|
- Processing time: 2-3 hours
|
|
- Success rate: 70-85% (based on test batch)
|
|
- ~950-1,150 organizations will get Wikidata IDs
|
|
|
|
### Option 2: Check Current Status
|
|
|
|
```bash
|
|
python3 -c "
|
|
import yaml
|
|
with open('data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
|
|
orgs = yaml.safe_load(f)
|
|
enriched = len([o for o in orgs if 'wikidata_id' in o])
|
|
print(f'Enriched: {enriched}/{len(orgs)} ({enriched/len(orgs)*100:.1f}%)')
|
|
"
|
|
```
|
|
|
|
### Option 3: Validate Enrichment
|
|
|
|
```bash
|
|
# Check for duplicates and verify Q-numbers
|
|
python3 scripts/validate_wikidata_enrichment.py
|
|
```
|
|
|
|
---
|
|
|
|
## Key Commands
|
|
|
|
### Search Wikidata Entity
|
|
```python
|
|
from wikidata_mcp import search_entity
|
|
q_number = search_entity("Rijksmuseum Amsterdam")
|
|
# Returns: Q190804
|
|
```
|
|
|
|
### Verify Q-Number
|
|
```python
|
|
from wikidata_mcp import get_metadata
|
|
metadata = get_metadata("Q190804", language="nl")
|
|
# Returns: {"Label": "Rijksmuseum", "Description": "museum in Amsterdam"}
|
|
```
|
|
|
|
### SPARQL Query (for municipalities)
|
|
```sparql
|
|
SELECT ?item ?itemLabel WHERE {
|
|
?item wdt:P31 wd:Q2039348 . # Dutch municipality
|
|
?item rdfs:label "Amsterdam"@nl .
|
|
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Enriched Records (Test Batch)
|
|
|
|
| Organization | Wikidata ID | URL |
|
|
|--------------|-------------|-----|
|
|
| Herinneringscentrum Kamp Westerbork | Q22246632 | https://www.wikidata.org/wiki/Q22246632 |
|
|
| Hunebedcentrum | Q2679819 | https://www.wikidata.org/wiki/Q2679819 |
|
|
| Drents Archief | Q1978308 | https://www.wikidata.org/wiki/Q1978308 |
|
|
| Drents Museum | Q1258370 | https://www.wikidata.org/wiki/Q1258370 |
|
|
| Gemeente Aa en Hunze | Q300665 | https://www.wikidata.org/wiki/Q300665 |
|
|
| Gemeente Borger-Odoorn | Q835118 | https://www.wikidata.org/wiki/Q835118 |
|
|
| Gemeente Coevorden | Q60453 | https://www.wikidata.org/wiki/Q60453 |
|
|
| Gemeente De Wolden | Q835108 | https://www.wikidata.org/wiki/Q835108 |
|
|
|
|
---
|
|
|
|
## Success Patterns
|
|
|
|
✅ **Museums with national/international recognition**: 100% match
|
|
✅ **Municipal archives**: 100% match (use municipality Q-number)
|
|
✅ **Regional archives**: 100% match
|
|
⚠️ **Branch locations**: Low coverage (often not in Wikidata)
|
|
⚠️ **Inter-municipal partnerships**: Low coverage
|
|
⚠️ **Small local societies**: Varies (50-70% expected)
|
|
|
|
---
|
|
|
|
## Next Milestones
|
|
|
|
1. **First 100 records** enriched → Review success rate, adjust strategy
|
|
2. **First 500 records** enriched → Identify patterns in no-matches
|
|
3. **First 1,000 records** enriched → Statistical analysis
|
|
4. **All 1,351 records** processed → Final report and integration
|
|
|
|
---
|
|
|
|
## Important Notes
|
|
|
|
- **Rate limit**: 1,000 requests/hour (Wikidata API)
|
|
- **Batch size**: 50 records per batch (recommended)
|
|
- **Verification**: Always verify Q-numbers with `get_metadata`
|
|
- **Backup**: Created before enrichment (`.backup.*.yaml`)
|
|
- **Logging**: All queries logged in `/data/nde/sparql/`
|
|
|
|
---
|
|
|
|
## Questions to Address
|
|
|
|
1. Should we create Wikidata entries for missing institutions?
|
|
2. How to handle branch locations (link to parent or omit)?
|
|
3. Manual review threshold (currently 85% confidence)?
|
|
4. When to integrate with main GLAM project schema?
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
📄 **Full Report**: `/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md`
|
|
📄 **Dataset Guide**: `/data/nde/README.md`
|
|
📄 **Session Summary**: `/docs/sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md`
|
|
📄 **Agent Instructions**: `/AGENTS.md` (Wikidata enrichment section)
|
|
|
|
---
|
|
|
|
**Ready to Scale**: YES ✅
|
|
**Confidence**: HIGH
|
|
**Expected Success**: 70-85%
|
|
|