glam/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md
2025-11-19 23:25:22 +01:00

6.4 KiB

ISIL CSV to YAML Conversion Report

Date: 2025-11-17
Input: /data/isil/nl/nan/ISIL-codes_2025-11-06.csv
Output: /data/isil/nl/nan/ISIL-codes_2025-11-06.yaml
Script: /scripts/convert_isil_csv_to_yaml.py


Conversion Summary

Records Processed

  • Total records: 371 Dutch ISIL codes
  • Field preservation: 100% (2,226 fields preserved exactly)
  • Value mismatches: 0 (perfect fidelity)

CSV Structure (Original)

The input CSV had a malformed structure:

  • All fields contained in single cell separated by ","
  • Extra trailing semicolons (;;;;)
  • Latin-1 encoding (not UTF-8)
  • Header includes sequence number as first field

Fields:

  1. Row number (sequence)
  2. Plaats (city)
  3. Instelling (institution name)
  4. ISIL code
  5. Toegekend op (assigned date)
  6. Opmerking (remarks)

YAML Structure (Output)

Each record contains:

CSV Fields (preserved exactly):

  • csv_row_number: Original row number
  • csv_plaats: City name
  • csv_instelling: Institution name
  • csv_isil_code: ISIL identifier code
  • csv_toegekend_op: Assignment date (YYYY-MM-DD)
  • csv_opmerking: Remarks/notes (18 records have remarks)

LinkML Mapped Fields:

  • name: Institution name (mapped from csv_instelling)
  • locations: List with city and country (NL)
  • identifiers: ISIL identifier with scheme, value, URL, assigned date
  • provenance: Data source metadata (TIER_1_AUTHORITATIVE)
  • description: Created from opmerking when present (optional)

Data Quality Findings

Geographic Distribution

  • Unique cities: 201 across Netherlands
  • Top cities:
    1. Den Haag: 34 institutions
    2. Amsterdam: 28 institutions
    3. Leiden: 8 institutions
    4. Rotterdam: 8 institutions
    5. Zwolle: 8 institutions

Temporal Coverage

  • Date range: 2008-10-10 to 2025-09-18
  • 18 records with remarks documenting:
    • Organizational mergers (8 cases)
    • Name changes (7 cases)
    • Institutional history (3 cases)

ISIL Code Patterns

  • Total codes: 371 (all unique, no duplicates)
  • Standard format: NL-{CityCode}{InstitutionAbbreviation}
  • Code lengths: 7 to 17 characters
  • Shortest: NL-AhMA (Alkmaarsche Historiën)
  • Longest: NL-LlsBatavialand (Batavialand museum/archief)
  • Non-standard: 1 code with lowercase prefix (Nl-GdSAMH)

Remarks Field Analysis

18 institutions (4.9%) have remarks documenting:

Mergers (8 institutions):

  • Historisch Centrum Limburg (2020: RHCL + Rijckheyt)
  • Archief Gooi- en Vechtstreek (2024: SAGV + Gemeentearchief Gooise Meren)
  • Noord-Veluws Archief (multiple archives consolidated)
  • Stichting OverO (Stadskamer Zwolle + OB Kampen)

Name Changes (7 institutions):

  • Historisch Centrum Overijssel (2021: added vestiging designation)
  • Het Nieuwe Instituut (2024: abbreviation change)
  • Tracé/SHCL (2024: rebranded from Sociaal Historisch Centrum)
  • Nederlands Instituut voor Militaire Historie (2023: name correction)

Deprecated Codes (3 institutions):

  • Marked "in onbruik" (no longer in use) due to merger/renaming
  • References to successor organizations provided

LinkML Schema Compliance

Required Fields

All 371 records contain:

  • name (institution name)
  • locations (city + country)
  • identifiers (ISIL code details)
  • provenance (data source metadata)

Identifier Structure

Each ISIL identifier includes:

identifiers:
  - identifier_scheme: ISIL
    identifier_value: NL-AsdRM
    identifier_url: https://isil.org/NL-AsdRM
    assigned_date: '2013-03-07'

Provenance Metadata

All records marked as:


Validation Results

Field Preservation Test

Total records:       371
Total fields:        2,226
Fields preserved:    2,226
Value mismatches:    0
Preservation rate:   100.0%

VALIDATION PASSED

LinkML Schema Compliance

All required fields present
All CSV fields preserved
No data loss during conversion
YAML structure valid


Use Cases

This YAML file can be used for:

  1. Cross-referencing: Link Dutch heritage institutions to authoritative ISIL codes
  2. Geocoding: City names can be geocoded to coordinates
  3. Merger tracking: Remarks document organizational history
  4. Data integration: Merge with other datasets (NDE organizations, Wikidata)
  5. LinkML validation: Test schema compliance with ISIL registry data

Next Steps

Data Enrichment

  • Geocode city names to latitude/longitude
  • Add institution type classification (museum, archive, library)
  • Cross-link with NDE organization dataset
  • Query Wikidata for Q-numbers
  • Extract merger/name change events into ChangeEvent objects

Schema Enhancement

  • Add institution_type field based on institution name patterns
  • Create change_history entries from opmerking field
  • Link related organizations (predecessors/successors)
  • Add website URLs where available
  • Classify by heritage custodian type (GLAMORCUBESFIXPHDNT taxonomy)

Integration

  • Merge with /data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml
  • Identify institutions with both ISIL codes and NDE platform data
  • Create unified heritage custodian records
  • Generate GHCID identifiers for all institutions

Files Created

Data

  • /data/isil/nl/nan/ISIL-codes_2025-11-06.yaml (8,184 lines, 371 records)

Scripts

  • /scripts/convert_isil_csv_to_yaml.py (conversion + validation)

Documentation

  • /docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md (this file)

Technical Notes

CSV Parsing Strategy

The malformed CSV required custom parsing:

  1. Read with latin-1 encoding (UTF-8 failed)
  2. Split each line on "," delimiter
  3. Strip quotes and trailing semicolons
  4. Handle empty opmerking fields

YAML Generation

Used PyYAML with settings:

  • allow_unicode=True (preserve Dutch characters)
  • default_flow_style=False (readable block style)
  • sort_keys=False (preserve field order)
  • width=120 (line wrapping)

Performance

  • Parsing: ~0.1 seconds
  • Mapping: ~0.2 seconds
  • Validation: ~0.1 seconds
  • YAML write: ~0.5 seconds
  • Total time: < 1 second

Status: Conversion complete
Quality: 100% field preservation
Ready for: Data enrichment and integration