glam/docs/NDE_WIKIDATA_QUICK_RESUME.md
2025-11-19 23:25:22 +01:00

165 lines
4.7 KiB
Markdown

# NDE Wikidata Enrichment - Quick Resume Guide
**Last Updated**: 2025-11-17
**Status**: Test batch complete, ready to scale
---
## Current State
-**10 records** enriched from Drenthe province
-**80% success rate** (8 matched, 2 no-match)
- ✅ YAML file updated with `wikidata_id` fields
- ✅ LinkML schema updated
- ✅ All queries logged
- 📊 **1,341 records remaining** (99.3% of dataset)
---
## Files Location
```
/Users/kempersc/apps/glam/
├── data/nde/
│ ├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml # Enriched data
│ ├── linkml/nde_yaml_target.yaml # Updated schema
│ └── sparql/ # Query logs
├── scripts/
│ ├── update_nde_yaml_with_wikidata_test_batch.py # Test batch (done)
│ └── enrich_nde_with_wikidata.py # Full dataset (ready)
└── docs/
├── NDE_WIKIDATA_ENRICHMENT_REPORT.md # Full report
└── sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md
```
---
## To Resume Work
### Option 1: Scale to Full Dataset
```bash
cd /Users/kempersc/apps/glam
python3 scripts/enrich_nde_with_wikidata.py --start-index 10 --batch-size 50
```
**Expected**:
- Processing time: 2-3 hours
- Success rate: 70-85% (based on test batch)
- ~950-1,150 organizations will get Wikidata IDs
### Option 2: Check Current Status
```bash
python3 -c "
import yaml
with open('data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
orgs = yaml.safe_load(f)
enriched = len([o for o in orgs if 'wikidata_id' in o])
print(f'Enriched: {enriched}/{len(orgs)} ({enriched/len(orgs)*100:.1f}%)')
"
```
### Option 3: Validate Enrichment
```bash
# Check for duplicates and verify Q-numbers
python3 scripts/validate_wikidata_enrichment.py
```
---
## Key Commands
### Search Wikidata Entity
```python
from wikidata_mcp import search_entity
q_number = search_entity("Rijksmuseum Amsterdam")
# Returns: Q190804
```
### Verify Q-Number
```python
from wikidata_mcp import get_metadata
metadata = get_metadata("Q190804", language="nl")
# Returns: {"Label": "Rijksmuseum", "Description": "museum in Amsterdam"}
```
### SPARQL Query (for municipalities)
```sparql
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q2039348 . # Dutch municipality
?item rdfs:label "Amsterdam"@nl .
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
}
```
---
## Enriched Records (Test Batch)
| Organization | Wikidata ID | URL |
|--------------|-------------|-----|
| Herinneringscentrum Kamp Westerbork | Q22246632 | https://www.wikidata.org/wiki/Q22246632 |
| Hunebedcentrum | Q2679819 | https://www.wikidata.org/wiki/Q2679819 |
| Drents Archief | Q1978308 | https://www.wikidata.org/wiki/Q1978308 |
| Drents Museum | Q1258370 | https://www.wikidata.org/wiki/Q1258370 |
| Gemeente Aa en Hunze | Q300665 | https://www.wikidata.org/wiki/Q300665 |
| Gemeente Borger-Odoorn | Q835118 | https://www.wikidata.org/wiki/Q835118 |
| Gemeente Coevorden | Q60453 | https://www.wikidata.org/wiki/Q60453 |
| Gemeente De Wolden | Q835108 | https://www.wikidata.org/wiki/Q835108 |
---
## Success Patterns
**Museums with national/international recognition**: 100% match
**Municipal archives**: 100% match (use municipality Q-number)
**Regional archives**: 100% match
⚠️ **Branch locations**: Low coverage (often not in Wikidata)
⚠️ **Inter-municipal partnerships**: Low coverage
⚠️ **Small local societies**: Varies (50-70% expected)
---
## Next Milestones
1. **First 100 records** enriched → Review success rate, adjust strategy
2. **First 500 records** enriched → Identify patterns in no-matches
3. **First 1,000 records** enriched → Statistical analysis
4. **All 1,351 records** processed → Final report and integration
---
## Important Notes
- **Rate limit**: 1,000 requests/hour (Wikidata API)
- **Batch size**: 50 records per batch (recommended)
- **Verification**: Always verify Q-numbers with `get_metadata`
- **Backup**: Created before enrichment (`.backup.*.yaml`)
- **Logging**: All queries logged in `/data/nde/sparql/`
---
## Questions to Address
1. Should we create Wikidata entries for missing institutions?
2. How to handle branch locations (link to parent or omit)?
3. Manual review threshold (currently 85% confidence)?
4. When to integrate with main GLAM project schema?
---
## Documentation
📄 **Full Report**: `/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md`
📄 **Dataset Guide**: `/data/nde/README.md`
📄 **Session Summary**: `/docs/sessions/SESSION_SUMMARY_20251117_NDE_WIKIDATA_TEST_BATCH.md`
📄 **Agent Instructions**: `/AGENTS.md` (Wikidata enrichment section)
---
**Ready to Scale**: YES ✅
**Confidence**: HIGH
**Expected Success**: 70-85%