226 lines
6.4 KiB
Markdown
226 lines
6.4 KiB
Markdown
# ISIL CSV to YAML Conversion Report
|
|
|
|
**Date**: 2025-11-17
|
|
**Input**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.csv`
|
|
**Output**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml`
|
|
**Script**: `/scripts/convert_isil_csv_to_yaml.py`
|
|
|
|
---
|
|
|
|
## Conversion Summary
|
|
|
|
### Records Processed
|
|
- **Total records**: 371 Dutch ISIL codes
|
|
- **Field preservation**: 100% (2,226 fields preserved exactly)
|
|
- **Value mismatches**: 0 (perfect fidelity)
|
|
|
|
### CSV Structure (Original)
|
|
The input CSV had a malformed structure:
|
|
- All fields contained in single cell separated by `","`
|
|
- Extra trailing semicolons (;;;;)
|
|
- Latin-1 encoding (not UTF-8)
|
|
- Header includes sequence number as first field
|
|
|
|
**Fields**:
|
|
1. Row number (sequence)
|
|
2. Plaats (city)
|
|
3. Instelling (institution name)
|
|
4. ISIL code
|
|
5. Toegekend op (assigned date)
|
|
6. Opmerking (remarks)
|
|
|
|
### YAML Structure (Output)
|
|
|
|
Each record contains:
|
|
|
|
**CSV Fields (preserved exactly)**:
|
|
- `csv_row_number`: Original row number
|
|
- `csv_plaats`: City name
|
|
- `csv_instelling`: Institution name
|
|
- `csv_isil_code`: ISIL identifier code
|
|
- `csv_toegekend_op`: Assignment date (YYYY-MM-DD)
|
|
- `csv_opmerking`: Remarks/notes (18 records have remarks)
|
|
|
|
**LinkML Mapped Fields**:
|
|
- `name`: Institution name (mapped from csv_instelling)
|
|
- `locations`: List with city and country (NL)
|
|
- `identifiers`: ISIL identifier with scheme, value, URL, assigned date
|
|
- `provenance`: Data source metadata (TIER_1_AUTHORITATIVE)
|
|
- `description`: Created from opmerking when present (optional)
|
|
|
|
---
|
|
|
|
## Data Quality Findings
|
|
|
|
### Geographic Distribution
|
|
- **Unique cities**: 201 across Netherlands
|
|
- **Top cities**:
|
|
1. Den Haag: 34 institutions
|
|
2. Amsterdam: 28 institutions
|
|
3. Leiden: 8 institutions
|
|
4. Rotterdam: 8 institutions
|
|
5. Zwolle: 8 institutions
|
|
|
|
### Temporal Coverage
|
|
- **Date range**: 2008-10-10 to 2025-09-18
|
|
- **18 records with remarks** documenting:
|
|
- Organizational mergers (8 cases)
|
|
- Name changes (7 cases)
|
|
- Institutional history (3 cases)
|
|
|
|
### ISIL Code Patterns
|
|
- **Total codes**: 371 (all unique, no duplicates)
|
|
- **Standard format**: NL-{CityCode}{InstitutionAbbreviation}
|
|
- **Code lengths**: 7 to 17 characters
|
|
- **Shortest**: NL-AhMA (Alkmaarsche Historiën)
|
|
- **Longest**: NL-LlsBatavialand (Batavialand museum/archief)
|
|
- **Non-standard**: 1 code with lowercase prefix (Nl-GdSAMH)
|
|
|
|
### Remarks Field Analysis
|
|
|
|
18 institutions (4.9%) have remarks documenting:
|
|
|
|
**Mergers** (8 institutions):
|
|
- Historisch Centrum Limburg (2020: RHCL + Rijckheyt)
|
|
- Archief Gooi- en Vechtstreek (2024: SAGV + Gemeentearchief Gooise Meren)
|
|
- Noord-Veluws Archief (multiple archives consolidated)
|
|
- Stichting OverO (Stadskamer Zwolle + OB Kampen)
|
|
|
|
**Name Changes** (7 institutions):
|
|
- Historisch Centrum Overijssel (2021: added vestiging designation)
|
|
- Het Nieuwe Instituut (2024: abbreviation change)
|
|
- Tracé/SHCL (2024: rebranded from Sociaal Historisch Centrum)
|
|
- Nederlands Instituut voor Militaire Historie (2023: name correction)
|
|
|
|
**Deprecated Codes** (3 institutions):
|
|
- Marked "in onbruik" (no longer in use) due to merger/renaming
|
|
- References to successor organizations provided
|
|
|
|
---
|
|
|
|
## LinkML Schema Compliance
|
|
|
|
### Required Fields
|
|
✅ All 371 records contain:
|
|
- `name` (institution name)
|
|
- `locations` (city + country)
|
|
- `identifiers` (ISIL code details)
|
|
- `provenance` (data source metadata)
|
|
|
|
### Identifier Structure
|
|
Each ISIL identifier includes:
|
|
```yaml
|
|
identifiers:
|
|
- identifier_scheme: ISIL
|
|
identifier_value: NL-AsdRM
|
|
identifier_url: https://isil.org/NL-AsdRM
|
|
assigned_date: '2013-03-07'
|
|
```
|
|
|
|
### Provenance Metadata
|
|
All records marked as:
|
|
- **Data source**: ISIL_REGISTRY
|
|
- **Data tier**: TIER_1_AUTHORITATIVE
|
|
- **Source URL**: https://www.nationaalarchief.nl/isil
|
|
- **Confidence score**: 1.0 (authoritative)
|
|
|
|
---
|
|
|
|
## Validation Results
|
|
|
|
### Field Preservation Test
|
|
```
|
|
Total records: 371
|
|
Total fields: 2,226
|
|
Fields preserved: 2,226
|
|
Value mismatches: 0
|
|
Preservation rate: 100.0%
|
|
```
|
|
|
|
✅ **VALIDATION PASSED**
|
|
|
|
### LinkML Schema Compliance
|
|
✅ All required fields present
|
|
✅ All CSV fields preserved
|
|
✅ No data loss during conversion
|
|
✅ YAML structure valid
|
|
|
|
---
|
|
|
|
## Use Cases
|
|
|
|
This YAML file can be used for:
|
|
|
|
1. **Cross-referencing**: Link Dutch heritage institutions to authoritative ISIL codes
|
|
2. **Geocoding**: City names can be geocoded to coordinates
|
|
3. **Merger tracking**: Remarks document organizational history
|
|
4. **Data integration**: Merge with other datasets (NDE organizations, Wikidata)
|
|
5. **LinkML validation**: Test schema compliance with ISIL registry data
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Data Enrichment
|
|
- [ ] Geocode city names to latitude/longitude
|
|
- [ ] Add institution type classification (museum, archive, library)
|
|
- [ ] Cross-link with NDE organization dataset
|
|
- [ ] Query Wikidata for Q-numbers
|
|
- [ ] Extract merger/name change events into ChangeEvent objects
|
|
|
|
### Schema Enhancement
|
|
- [ ] Add `institution_type` field based on institution name patterns
|
|
- [ ] Create `change_history` entries from opmerking field
|
|
- [ ] Link related organizations (predecessors/successors)
|
|
- [ ] Add website URLs where available
|
|
- [ ] Classify by heritage custodian type (GLAMORCUBESFIXPHDNT taxonomy)
|
|
|
|
### Integration
|
|
- [ ] Merge with `/data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml`
|
|
- [ ] Identify institutions with both ISIL codes and NDE platform data
|
|
- [ ] Create unified heritage custodian records
|
|
- [ ] Generate GHCID identifiers for all institutions
|
|
|
|
---
|
|
|
|
## Files Created
|
|
|
|
### Data
|
|
- `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml` (8,184 lines, 371 records)
|
|
|
|
### Scripts
|
|
- `/scripts/convert_isil_csv_to_yaml.py` (conversion + validation)
|
|
|
|
### Documentation
|
|
- `/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md` (this file)
|
|
|
|
---
|
|
|
|
## Technical Notes
|
|
|
|
### CSV Parsing Strategy
|
|
The malformed CSV required custom parsing:
|
|
1. Read with `latin-1` encoding (UTF-8 failed)
|
|
2. Split each line on `","` delimiter
|
|
3. Strip quotes and trailing semicolons
|
|
4. Handle empty opmerking fields
|
|
|
|
### YAML Generation
|
|
Used PyYAML with settings:
|
|
- `allow_unicode=True` (preserve Dutch characters)
|
|
- `default_flow_style=False` (readable block style)
|
|
- `sort_keys=False` (preserve field order)
|
|
- `width=120` (line wrapping)
|
|
|
|
### Performance
|
|
- Parsing: ~0.1 seconds
|
|
- Mapping: ~0.2 seconds
|
|
- Validation: ~0.1 seconds
|
|
- YAML write: ~0.5 seconds
|
|
- **Total time**: < 1 second
|
|
|
|
---
|
|
|
|
**Status**: ✅ Conversion complete
|
|
**Quality**: 100% field preservation
|
|
**Ready for**: Data enrichment and integration
|