glam/data/nde/README.md
2025-11-19 23:25:22 +01:00

311 lines
8.6 KiB
Markdown

# NDE Dutch Heritage Organizations Dataset
**Dataset Name**: Voorbeeld lijst organisaties en diensten - Totaallijst Nederland
**Source**: Network Digital Heritage (NDE)
**Records**: 1,351 Dutch heritage organizations
**Last Updated**: 2025-11-17
**Enrichment Status**: Test batch complete (10 records with Wikidata IDs)
---
## Dataset Overview
This directory contains the NDE dataset of Dutch heritage organizations, converted from CSV to YAML format with Wikidata enrichment.
### Files
| File | Size | Description |
|------|------|-------------|
| `voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv` | 168 KB | Original CSV source (1,351 records) |
| `voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml` | 259 KB | Converted YAML with enrichment |
| `voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.*.yaml` | 259 KB | Backup before enrichment |
| `sample_yaml_for_validation.yaml` | 2 KB | Sample for validation testing |
### Subdirectories
- **`linkml/`** - LinkML schemas for CSV source, YAML target, and field mappings
- **`sparql/`** - SPARQL query logs and enrichment results
---
## Dataset Statistics
### Record Counts by Type
| Type | Count | Percentage |
|------|-------|------------|
| Archive (archief) | ~600 | 44% |
| Museum | ~500 | 37% |
| Library (bibliotheek) | ~150 | 11% |
| Historical Society (historische vereniging) | ~100 | 7% |
| **Total** | **1,351** | **100%** |
### Geographic Coverage
- **Provinces**: All 12 Dutch provinces
- **Cities**: 475+ municipalities
- **Focus**: Drenthe province (test batch)
### Data Quality
- **ISIL Codes**: 1,119 records (83%)
- **Websites**: 1,200+ records (89%)
- **Digital Platforms**: 1,119 records (83%)
- **Wikidata IDs**: 8 records (0.6%) - *test batch only*
---
## Wikidata Enrichment Status
### Current Progress
- **Test Batch**: 10 records processed ✓
- **Success Rate**: 80% (8/10 matched)
- **Full Dataset**: Pending (1,341 records remaining)
### Enriched Records
See `/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md` for complete enrichment results.
**Sample enriched record**:
```yaml
- plaatsnaam_bezoekadres: Assen
straat_en_huisnummer_bezoekadres: Brink 1
organisatie: Stichting Drents Museum
webadres_organisatie: https://drentsmuseum.nl/
type_organisatie: museum
isil-code_na: NL-AsnDM
wikidata_id: Q1258370 # ← Wikidata enrichment
```
### No-Match Records
Records flagged with `wikidata_enrichment_status: no_match_found`:
1. Branch locations (e.g., museum extensions)
2. Inter-municipal partnerships
3. Small local societies
---
## Schema Documentation
### LinkML Schemas
Located in `linkml/` subdirectory:
1. **`nde_csv_source.yaml`** - Original CSV structure (33 columns)
2. **`nde_yaml_target.yaml`** - Normalized YAML structure (34 fields including Wikidata)
3. **`nde_csv_to_yaml_mapping.yaml`** - Field transformation documentation
### Field Definitions
**Core Fields**:
- `organisatie` - Organization name
- `type_organisatie` - Organization type (museum, archief, bibliotheek, etc.)
- `plaatsnaam_bezoekadres` - City/town
- `straat_en_huisnummer_bezoekadres` - Street address
- `webadres_organisatie` - Website URL
- `isil-code_na` - ISIL identifier (NL-XXX format)
**Enrichment Fields** (NEW):
- `wikidata_id` - Wikidata Q-number (e.g., Q1258370)
- `wikidata_enrichment_status` - Enrichment status flag
**Platform Integration** (40+ fields):
- Collection management systems (Atlantis, MAIS, etc.)
- Aggregation platforms (Collectie Nederland, Archieven.nl, etc.)
- Thematic networks (WO2Net, Modemuze, Van Gogh Worldwide, etc.)
See `/docs/CSV_TO_YAML_QUICK_REFERENCE.md` for complete field reference.
---
## Usage Examples
### Load YAML Data (Python)
```python
import yaml
with open('voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
organizations = yaml.safe_load(f)
# Filter by type
museums = [org for org in organizations if org.get('type_organisatie') == 'museum']
# Find organizations with Wikidata IDs
enriched = [org for org in organizations if 'wikidata_id' in org]
# Filter by ISIL code
with_isil = [org for org in organizations if 'isil-code_na' in org]
```
### Query Wikidata-Enriched Records
```python
import yaml
with open('voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
organizations = yaml.safe_load(f)
# Get all enriched records
enriched = [
org for org in organizations
if org.get('wikidata_id')
]
for org in enriched:
print(f"{org['organisatie']}: https://www.wikidata.org/wiki/{org['wikidata_id']}")
```
### Validate Against LinkML Schema
```bash
linkml-validate \
-s linkml/nde_yaml_target.yaml \
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml
```
---
## Conversion & Enrichment Scripts
Located in `/scripts/`:
### CSV to YAML Conversion
- `convert_nde_csv_to_yaml.py` - Initial CSV → YAML conversion
- `validate_csv_to_yaml_conversion.py` - Validation script (zero data loss verified)
### Wikidata Enrichment
- `update_nde_yaml_with_wikidata_test_batch.py` - Test batch enrichment (10 records) ✓
- `enrich_nde_with_wikidata.py` - Full dataset enrichment (prepared, not yet run)
- `prepare_wikidata_enrichment.py` - Interactive enrichment helper
---
## SPARQL Query Logs
All Wikidata queries logged in `sparql/` subdirectory:
### Query Types
1. **Direct entity search** - By organization name
2. **SPARQL queries** - For municipalities and specialized searches
3. **Metadata verification** - Confirm Q-number matches
### Log Files
- `*_prepared.json` - Prepared SPARQL queries (10 files)
- `enrichment_log_test_batch_*.json` - Enrichment results
- `master_query_log_*.json` - Consolidated query history
### Example SPARQL Query
```sparql
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q2039348 . # Instance of: Dutch municipality
?item wdt:P131 wd:Q770 . # Located in: Drenthe
?item rdfs:label "Coevorden"@nl .
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
}
```
---
## Integration with Main GLAM Project
### Mapping to HeritageCustodian Schema
NDE organizations will be converted to the main project's `HeritageCustodian` LinkML schema:
**Field Mappings**:
```yaml
HeritageCustodian:
name: organisatie
institution_type: type_organisatie # Mapped to GLAMORCUBESFIXPHDNT taxonomy
locations:
- city: plaatsnaam_bezoekadres
street_address: straat_en_huisnummer_bezoekadres
identifiers:
- identifier_scheme: "ISIL"
identifier_value: isil-code_na
- identifier_scheme: "Wikidata"
identifier_value: wikidata_id
```
### GHCID Generation
All NDE organizations will receive Global Heritage Custodian Identifiers:
```
NL-DR-ASN-M-DM # Stichting Drents Museum
NL-DR-ASN-A-DA # Drents Archief
NL-DR-BOR-M-HC # Hunebedcentrum
```
Format: `{Country}-{Province}-{City}-{Type}-{Abbreviation}`
See `/docs/PERSISTENT_IDENTIFIERS.md` for GHCID specification.
---
## Data Quality Notes
### Known Issues
1. **Unnamed first column**: Some records have province/region in unnamed column
2. **ISIL code format**: Some non-standard codes (e.g., "Drente" instead of NL-XXX format)
3. **Multiline addresses**: Some addresses span multiple fields
4. **Closed institutions**: Some organizations marked as closed (check `unnamed_field`)
### Validation Results
From `scripts/validate_csv_to_yaml_conversion.py`:
- ✓ All 33 CSV columns mapped
- ✓ All 6,980 non-empty cells preserved
- ✓ Zero data loss
- ✓ Zero mismatches
---
## Next Steps
### Immediate Tasks
1. **Scale Wikidata enrichment** to full dataset (1,341 records)
2. **Handle ambiguous matches** - Set up manual review queue
3. **Create Wikidata entries** for missing high-priority organizations
4. **Validate all Q-numbers** - Verify they resolve correctly
### Integration Tasks
5. **Convert to HeritageCustodian format** - Map to main LinkML schema
6. **Generate GHCIDs** - Create persistent identifiers
7. **Export to RDF/JSON-LD** - With Wikidata links
8. **Merge with ISIL registry** - Cross-link with Dutch ISIL dataset
### Documentation Updates
9. Update project `PROGRESS.md` with NDE statistics
10. Create NDE-specific extraction guide
11. Document manual Wikidata creation workflow
---
## References
- **Main Documentation**: `/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md`
- **Schema Reference**: `/docs/CSV_TO_YAML_QUICK_REFERENCE.md`
- **Validation Report**: `/docs/NDE_CSV_TO_YAML_LINKML_VALIDATION.md`
- **Project Guide**: `/AGENTS.md` (AI agent instructions)
---
## Contact & Support
**Project**: GLAM Data Extraction Project
**Repository**: `/Users/kempersc/apps/glam`
**Dataset Version**: v1.1 (with Wikidata enrichment)
**Last Enrichment**: 2025-11-17 (test batch)
---
**End of README**