glam/data/nde
kempersc 30162e6526 Add script to validate KB library entries and generate enrichment report
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
2025-11-28 14:48:33 +01:00
..
bu Add script to enrich NDE Register NL entries with Wikidata data 2025-11-27 13:30:00 +01:00
enriched Add script to validate KB library entries and generate enrichment report 2025-11-28 14:48:33 +01:00
linkml add isil entries 2025-11-19 23:25:22 +01:00
sparql add isil entries 2025-11-19 23:25:22 +01:00
enrichment_progress.json Refactor code structure for improved readability and maintainability 2025-11-27 17:43:14 +01:00
nde_register_nl.yaml Add script to enrich NDE Register NL entries with Wikidata data 2025-11-27 13:30:00 +01:00
README.md add isil entries 2025-11-19 23:25:22 +01:00
sample_yaml_for_validation.yaml add isil entries 2025-11-19 23:25:22 +01:00
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.20251117_115940.yaml add isil entries 2025-11-19 23:25:22 +01:00
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.20251117_121119.yaml add isil entries 2025-11-19 23:25:22 +01:00
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.20251117_122408.yaml add isil entries 2025-11-19 23:25:22 +01:00
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv add isil entries 2025-11-19 23:25:22 +01:00
wikidata_candidates.yaml Refactor code structure for improved readability and maintainability 2025-11-28 11:44:21 +01:00
wikidata_matches.yaml Refactor code structure for improved readability and maintainability 2025-11-28 11:44:21 +01:00

NDE Dutch Heritage Organizations Dataset

Dataset Name: Voorbeeld lijst organisaties en diensten - Totaallijst Nederland
Source: Network Digital Heritage (NDE)
Records: 1,351 Dutch heritage organizations
Last Updated: 2025-11-17
Enrichment Status: Test batch complete (10 records with Wikidata IDs)


Dataset Overview

This directory contains the NDE dataset of Dutch heritage organizations, converted from CSV to YAML format with Wikidata enrichment.

Files

File Size Description
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv 168 KB Original CSV source (1,351 records)
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml 259 KB Converted YAML with enrichment
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.*.yaml 259 KB Backup before enrichment
sample_yaml_for_validation.yaml 2 KB Sample for validation testing

Subdirectories

  • linkml/ - LinkML schemas for CSV source, YAML target, and field mappings
  • sparql/ - SPARQL query logs and enrichment results

Dataset Statistics

Record Counts by Type

Type Count Percentage
Archive (archief) ~600 44%
Museum ~500 37%
Library (bibliotheek) ~150 11%
Historical Society (historische vereniging) ~100 7%
Total 1,351 100%

Geographic Coverage

  • Provinces: All 12 Dutch provinces
  • Cities: 475+ municipalities
  • Focus: Drenthe province (test batch)

Data Quality

  • ISIL Codes: 1,119 records (83%)
  • Websites: 1,200+ records (89%)
  • Digital Platforms: 1,119 records (83%)
  • Wikidata IDs: 8 records (0.6%) - test batch only

Wikidata Enrichment Status

Current Progress

  • Test Batch: 10 records processed ✓
  • Success Rate: 80% (8/10 matched)
  • Full Dataset: Pending (1,341 records remaining)

Enriched Records

See /docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md for complete enrichment results.

Sample enriched record:

- plaatsnaam_bezoekadres: Assen
  straat_en_huisnummer_bezoekadres: Brink 1
  organisatie: Stichting Drents Museum
  webadres_organisatie: https://drentsmuseum.nl/
  type_organisatie: museum
  isil-code_na: NL-AsnDM
  wikidata_id: Q1258370  # ← Wikidata enrichment

No-Match Records

Records flagged with wikidata_enrichment_status: no_match_found:

  1. Branch locations (e.g., museum extensions)
  2. Inter-municipal partnerships
  3. Small local societies

Schema Documentation

LinkML Schemas

Located in linkml/ subdirectory:

  1. nde_csv_source.yaml - Original CSV structure (33 columns)
  2. nde_yaml_target.yaml - Normalized YAML structure (34 fields including Wikidata)
  3. nde_csv_to_yaml_mapping.yaml - Field transformation documentation

Field Definitions

Core Fields:

  • organisatie - Organization name
  • type_organisatie - Organization type (museum, archief, bibliotheek, etc.)
  • plaatsnaam_bezoekadres - City/town
  • straat_en_huisnummer_bezoekadres - Street address
  • webadres_organisatie - Website URL
  • isil-code_na - ISIL identifier (NL-XXX format)

Enrichment Fields (NEW):

  • wikidata_id - Wikidata Q-number (e.g., Q1258370)
  • wikidata_enrichment_status - Enrichment status flag

Platform Integration (40+ fields):

  • Collection management systems (Atlantis, MAIS, etc.)
  • Aggregation platforms (Collectie Nederland, Archieven.nl, etc.)
  • Thematic networks (WO2Net, Modemuze, Van Gogh Worldwide, etc.)

See /docs/CSV_TO_YAML_QUICK_REFERENCE.md for complete field reference.


Usage Examples

Load YAML Data (Python)

import yaml

with open('voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
    organizations = yaml.safe_load(f)

# Filter by type
museums = [org for org in organizations if org.get('type_organisatie') == 'museum']

# Find organizations with Wikidata IDs
enriched = [org for org in organizations if 'wikidata_id' in org]

# Filter by ISIL code
with_isil = [org for org in organizations if 'isil-code_na' in org]

Query Wikidata-Enriched Records

import yaml

with open('voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml', 'r') as f:
    organizations = yaml.safe_load(f)

# Get all enriched records
enriched = [
    org for org in organizations 
    if org.get('wikidata_id')
]

for org in enriched:
    print(f"{org['organisatie']}: https://www.wikidata.org/wiki/{org['wikidata_id']}")

Validate Against LinkML Schema

linkml-validate \
  -s linkml/nde_yaml_target.yaml \
  voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml

Conversion & Enrichment Scripts

Located in /scripts/:

CSV to YAML Conversion

  • convert_nde_csv_to_yaml.py - Initial CSV → YAML conversion
  • validate_csv_to_yaml_conversion.py - Validation script (zero data loss verified)

Wikidata Enrichment

  • update_nde_yaml_with_wikidata_test_batch.py - Test batch enrichment (10 records) ✓
  • enrich_nde_with_wikidata.py - Full dataset enrichment (prepared, not yet run)
  • prepare_wikidata_enrichment.py - Interactive enrichment helper

SPARQL Query Logs

All Wikidata queries logged in sparql/ subdirectory:

Query Types

  1. Direct entity search - By organization name
  2. SPARQL queries - For municipalities and specialized searches
  3. Metadata verification - Confirm Q-number matches

Log Files

  • *_prepared.json - Prepared SPARQL queries (10 files)
  • enrichment_log_test_batch_*.json - Enrichment results
  • master_query_log_*.json - Consolidated query history

Example SPARQL Query

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q2039348 .  # Instance of: Dutch municipality
  ?item wdt:P131 wd:Q770 .     # Located in: Drenthe
  ?item rdfs:label "Coevorden"@nl .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
}

Integration with Main GLAM Project

Mapping to HeritageCustodian Schema

NDE organizations will be converted to the main project's HeritageCustodian LinkML schema:

Field Mappings:

HeritageCustodian:
  name: organisatie
  institution_type: type_organisatie  # Mapped to GLAMORCUBESFIXPHDNT taxonomy
  locations:
    - city: plaatsnaam_bezoekadres
      street_address: straat_en_huisnummer_bezoekadres
  identifiers:
    - identifier_scheme: "ISIL"
      identifier_value: isil-code_na
    - identifier_scheme: "Wikidata"
      identifier_value: wikidata_id

GHCID Generation

All NDE organizations will receive Global Heritage Custodian Identifiers:

NL-DR-ASN-M-DM  # Stichting Drents Museum
NL-DR-ASN-A-DA  # Drents Archief
NL-DR-BOR-M-HC  # Hunebedcentrum

Format: {Country}-{Province}-{City}-{Type}-{Abbreviation}

See /docs/PERSISTENT_IDENTIFIERS.md for GHCID specification.


Data Quality Notes

Known Issues

  1. Unnamed first column: Some records have province/region in unnamed column
  2. ISIL code format: Some non-standard codes (e.g., "Drente" instead of NL-XXX format)
  3. Multiline addresses: Some addresses span multiple fields
  4. Closed institutions: Some organizations marked as closed (check unnamed_field)

Validation Results

From scripts/validate_csv_to_yaml_conversion.py:

  • ✓ All 33 CSV columns mapped
  • ✓ All 6,980 non-empty cells preserved
  • ✓ Zero data loss
  • ✓ Zero mismatches

Next Steps

Immediate Tasks

  1. Scale Wikidata enrichment to full dataset (1,341 records)
  2. Handle ambiguous matches - Set up manual review queue
  3. Create Wikidata entries for missing high-priority organizations
  4. Validate all Q-numbers - Verify they resolve correctly

Integration Tasks

  1. Convert to HeritageCustodian format - Map to main LinkML schema
  2. Generate GHCIDs - Create persistent identifiers
  3. Export to RDF/JSON-LD - With Wikidata links
  4. Merge with ISIL registry - Cross-link with Dutch ISIL dataset

Documentation Updates

  1. Update project PROGRESS.md with NDE statistics
  2. Create NDE-specific extraction guide
  3. Document manual Wikidata creation workflow

References

  • Main Documentation: /docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md
  • Schema Reference: /docs/CSV_TO_YAML_QUICK_REFERENCE.md
  • Validation Report: /docs/NDE_CSV_TO_YAML_LINKML_VALIDATION.md
  • Project Guide: /AGENTS.md (AI agent instructions)

Contact & Support

Project: GLAM Data Extraction Project
Repository: /Users/kempersc/apps/glam
Dataset Version: v1.1 (with Wikidata enrichment)
Last Enrichment: 2025-11-17 (test batch)


End of README