glam/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md
2025-11-19 23:25:22 +01:00

226 lines
6.4 KiB
Markdown

# ISIL CSV to YAML Conversion Report
**Date**: 2025-11-17
**Input**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.csv`
**Output**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml`
**Script**: `/scripts/convert_isil_csv_to_yaml.py`
---
## Conversion Summary
### Records Processed
- **Total records**: 371 Dutch ISIL codes
- **Field preservation**: 100% (2,226 fields preserved exactly)
- **Value mismatches**: 0 (perfect fidelity)
### CSV Structure (Original)
The input CSV had a malformed structure:
- All fields contained in single cell separated by `","`
- Extra trailing semicolons (;;;;)
- Latin-1 encoding (not UTF-8)
- Header includes sequence number as first field
**Fields**:
1. Row number (sequence)
2. Plaats (city)
3. Instelling (institution name)
4. ISIL code
5. Toegekend op (assigned date)
6. Opmerking (remarks)
### YAML Structure (Output)
Each record contains:
**CSV Fields (preserved exactly)**:
- `csv_row_number`: Original row number
- `csv_plaats`: City name
- `csv_instelling`: Institution name
- `csv_isil_code`: ISIL identifier code
- `csv_toegekend_op`: Assignment date (YYYY-MM-DD)
- `csv_opmerking`: Remarks/notes (18 records have remarks)
**LinkML Mapped Fields**:
- `name`: Institution name (mapped from csv_instelling)
- `locations`: List with city and country (NL)
- `identifiers`: ISIL identifier with scheme, value, URL, assigned date
- `provenance`: Data source metadata (TIER_1_AUTHORITATIVE)
- `description`: Created from opmerking when present (optional)
---
## Data Quality Findings
### Geographic Distribution
- **Unique cities**: 201 across Netherlands
- **Top cities**:
1. Den Haag: 34 institutions
2. Amsterdam: 28 institutions
3. Leiden: 8 institutions
4. Rotterdam: 8 institutions
5. Zwolle: 8 institutions
### Temporal Coverage
- **Date range**: 2008-10-10 to 2025-09-18
- **18 records with remarks** documenting:
- Organizational mergers (8 cases)
- Name changes (7 cases)
- Institutional history (3 cases)
### ISIL Code Patterns
- **Total codes**: 371 (all unique, no duplicates)
- **Standard format**: NL-{CityCode}{InstitutionAbbreviation}
- **Code lengths**: 7 to 17 characters
- **Shortest**: NL-AhMA (Alkmaarsche Historiën)
- **Longest**: NL-LlsBatavialand (Batavialand museum/archief)
- **Non-standard**: 1 code with lowercase prefix (Nl-GdSAMH)
### Remarks Field Analysis
18 institutions (4.9%) have remarks documenting:
**Mergers** (8 institutions):
- Historisch Centrum Limburg (2020: RHCL + Rijckheyt)
- Archief Gooi- en Vechtstreek (2024: SAGV + Gemeentearchief Gooise Meren)
- Noord-Veluws Archief (multiple archives consolidated)
- Stichting OverO (Stadskamer Zwolle + OB Kampen)
**Name Changes** (7 institutions):
- Historisch Centrum Overijssel (2021: added vestiging designation)
- Het Nieuwe Instituut (2024: abbreviation change)
- Tracé/SHCL (2024: rebranded from Sociaal Historisch Centrum)
- Nederlands Instituut voor Militaire Historie (2023: name correction)
**Deprecated Codes** (3 institutions):
- Marked "in onbruik" (no longer in use) due to merger/renaming
- References to successor organizations provided
---
## LinkML Schema Compliance
### Required Fields
✅ All 371 records contain:
- `name` (institution name)
- `locations` (city + country)
- `identifiers` (ISIL code details)
- `provenance` (data source metadata)
### Identifier Structure
Each ISIL identifier includes:
```yaml
identifiers:
- identifier_scheme: ISIL
identifier_value: NL-AsdRM
identifier_url: https://isil.org/NL-AsdRM
assigned_date: '2013-03-07'
```
### Provenance Metadata
All records marked as:
- **Data source**: ISIL_REGISTRY
- **Data tier**: TIER_1_AUTHORITATIVE
- **Source URL**: https://www.nationaalarchief.nl/isil
- **Confidence score**: 1.0 (authoritative)
---
## Validation Results
### Field Preservation Test
```
Total records: 371
Total fields: 2,226
Fields preserved: 2,226
Value mismatches: 0
Preservation rate: 100.0%
```
**VALIDATION PASSED**
### LinkML Schema Compliance
✅ All required fields present
✅ All CSV fields preserved
✅ No data loss during conversion
✅ YAML structure valid
---
## Use Cases
This YAML file can be used for:
1. **Cross-referencing**: Link Dutch heritage institutions to authoritative ISIL codes
2. **Geocoding**: City names can be geocoded to coordinates
3. **Merger tracking**: Remarks document organizational history
4. **Data integration**: Merge with other datasets (NDE organizations, Wikidata)
5. **LinkML validation**: Test schema compliance with ISIL registry data
---
## Next Steps
### Data Enrichment
- [ ] Geocode city names to latitude/longitude
- [ ] Add institution type classification (museum, archive, library)
- [ ] Cross-link with NDE organization dataset
- [ ] Query Wikidata for Q-numbers
- [ ] Extract merger/name change events into ChangeEvent objects
### Schema Enhancement
- [ ] Add `institution_type` field based on institution name patterns
- [ ] Create `change_history` entries from opmerking field
- [ ] Link related organizations (predecessors/successors)
- [ ] Add website URLs where available
- [ ] Classify by heritage custodian type (GLAMORCUBESFIXPHDNT taxonomy)
### Integration
- [ ] Merge with `/data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml`
- [ ] Identify institutions with both ISIL codes and NDE platform data
- [ ] Create unified heritage custodian records
- [ ] Generate GHCID identifiers for all institutions
---
## Files Created
### Data
- `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml` (8,184 lines, 371 records)
### Scripts
- `/scripts/convert_isil_csv_to_yaml.py` (conversion + validation)
### Documentation
- `/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md` (this file)
---
## Technical Notes
### CSV Parsing Strategy
The malformed CSV required custom parsing:
1. Read with `latin-1` encoding (UTF-8 failed)
2. Split each line on `","` delimiter
3. Strip quotes and trailing semicolons
4. Handle empty opmerking fields
### YAML Generation
Used PyYAML with settings:
- `allow_unicode=True` (preserve Dutch characters)
- `default_flow_style=False` (readable block style)
- `sort_keys=False` (preserve field order)
- `width=120` (line wrapping)
### Performance
- Parsing: ~0.1 seconds
- Mapping: ~0.2 seconds
- Validation: ~0.1 seconds
- YAML write: ~0.5 seconds
- **Total time**: < 1 second
---
**Status**: Conversion complete
**Quality**: 100% field preservation
**Ready for**: Data enrichment and integration