241 lines
7.8 KiB
Markdown
241 lines
7.8 KiB
Markdown
# Dutch National Archive ISIL Registry - LinkML Documentation
|
|
|
|
This directory contains LinkML schema documentation for the Dutch National Archive ISIL registry conversion from CSV to YAML format.
|
|
|
|
## Overview
|
|
|
|
**Source**: [Nationaal Archief ISIL Registry](https://www.nationaalarchief.nl/isil)
|
|
**Records**: 371 Dutch heritage institutions
|
|
**Date Range**: 2008-10-10 to 2025-09-18
|
|
**Geographic Coverage**: 201 unique cities across the Netherlands
|
|
**Data Quality**: TIER_1_AUTHORITATIVE (confidence score: 1.0)
|
|
|
|
## Files in This Directory
|
|
|
|
### `schema.yaml`
|
|
LinkML schema definition documenting the structure of ISIL registry records after conversion to HeritageCustodian format.
|
|
|
|
**Key classes**:
|
|
- `ISILRegistryRecord` - Main record structure with CSV fields and LinkML mappings
|
|
- `Location` - Geographic location (city, country)
|
|
- `Identifier` - ISIL code structure (scheme, value, URL, assigned_date)
|
|
- `Provenance` - Data source and quality metadata
|
|
|
|
**Enumerations**:
|
|
- `InstitutionTypeEnum` - Heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
|
|
- `DataSourceEnum` - ISIL_REGISTRY
|
|
- `DataTierEnum` - TIER_1_AUTHORITATIVE
|
|
|
|
### `mapping.yaml`
|
|
Complete field-by-field mapping documentation showing how each CSV column transforms into LinkML YAML structure.
|
|
|
|
**Covers**:
|
|
- CSV structure and parsing challenges (latin-1 encoding, malformed cells)
|
|
- Field mappings with examples (6 CSV columns → LinkML attributes)
|
|
- Transformation rules (date parsing, URL generation, description formatting)
|
|
- Data quality metrics (100% field preservation, 2,226 fields)
|
|
- Organizational change event detection (18 records with merger/closure notes)
|
|
|
|
## Dataset Characteristics
|
|
|
|
### ISIL Code Format
|
|
- **Pattern**: `NL-{CityAbbrev}{InstitutionAbbrev}`
|
|
- **Length**: Variable (7-17 characters)
|
|
- **Encoding**: Semantic (city + institution abbreviations)
|
|
- **Examples**:
|
|
- `NL-AsdRM` - Rijksmuseum (Amsterdam)
|
|
- `NL-HaNa` - Nationaal Archief (Den Haag)
|
|
- `NL-LlsBatavialand` - Batavialand (Lelystad)
|
|
|
|
### Data Completeness
|
|
| Field | Coverage | Notes |
|
|
|-------|----------|-------|
|
|
| Row number | 100% (371/371) | Sequential 1-371 |
|
|
| City (Plaats) | 100% (371/371) | 201 unique cities |
|
|
| Institution (Instelling) | 100% (371/371) | All institution names present |
|
|
| ISIL code | 100% (371/371) | All unique, no duplicates |
|
|
| Assignment date (Toegekend op) | ~95% | Most have dates, some empty |
|
|
| Remarks (Opmerking) | 4.9% (18/371) | Organizational history notes |
|
|
|
|
### Top Cities by Institution Count
|
|
1. **Den Haag** - 38 institutions (10.2%)
|
|
2. **Amsterdam** - 29 institutions (7.8%)
|
|
3. **Deventer** - 11 institutions (3.0%)
|
|
4. **Groningen** - 10 institutions (2.7%)
|
|
|
|
### Organizational Change Events
|
|
18 records (4.9%) contain organizational history in the `csv_opmerking` field:
|
|
|
|
**Event types detected**:
|
|
- **MERGER**: "fusie tussen", "samenvoeging"
|
|
- Example: RHCL-Rijckheyt merger (2020)
|
|
- **NAME_CHANGE**: "naamswijziging", "hernoemd"
|
|
- **CLOSURE**: "in onbruik", "gesloten"
|
|
- **RELOCATION**: "verhuisd naar", "overgebracht naar"
|
|
|
|
**Future processing**: These remarks can be extracted as structured `ChangeEvent` objects in the HeritageCustodian schema.
|
|
|
|
## CSV Parsing Challenges
|
|
|
|
The original CSV file had several issues requiring custom parsing:
|
|
|
|
### Encoding
|
|
- **Issue**: File uses `latin-1` encoding (not UTF-8)
|
|
- **Solution**: `encoding='latin-1'` parameter in file reader
|
|
|
|
### Malformed Structure
|
|
- **Issue**: All fields stored in single CSV cell with `","` delimiter
|
|
- **Solution**: Split on `","` pattern, strip quotes and semicolons
|
|
|
|
### Header Row
|
|
- **Issue**: Contains sequence number as first field before actual headers
|
|
- **Solution**: Extract headers from indices 1-5, skip index 0
|
|
|
|
### Example Raw CSV Row
|
|
```
|
|
"1","Amsterdam","Rijksmuseum","NL-AsdRM","2013-03-07","";"
|
|
```
|
|
|
|
### After Parsing
|
|
```yaml
|
|
csv_row_number: 1
|
|
csv_plaats: Amsterdam
|
|
csv_instelling: Rijksmuseum
|
|
csv_isil_code: NL-AsdRM
|
|
csv_toegekend_op: "2013-03-07"
|
|
csv_opmerking: ""
|
|
```
|
|
|
|
## Conversion Process
|
|
|
|
### Input
|
|
```
|
|
/data/isil/nl/nan/ISIL-codes_2025-11-06.csv
|
|
```
|
|
|
|
### Conversion Script
|
|
```
|
|
/scripts/convert_isil_csv_to_yaml.py
|
|
```
|
|
|
|
### Output
|
|
```
|
|
/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml
|
|
```
|
|
|
|
### Validation
|
|
- ✅ 371 records converted
|
|
- ✅ 2,226 fields preserved (100% preservation)
|
|
- ✅ 0 validation errors
|
|
- ✅ All ISIL codes match pattern `^NL-[A-Za-z0-9]+`
|
|
- ✅ All dates parse as ISO 8601 (YYYY-MM-DD)
|
|
- ✅ No duplicate ISIL codes
|
|
|
|
## LinkML Schema Compliance
|
|
|
|
All converted records conform to the HeritageCustodian schema:
|
|
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/nl/{slug}
|
|
name: {csv_instelling}
|
|
institution_type: {ARCHIVE|MUSEUM|LIBRARY|...} # Requires classification
|
|
locations:
|
|
- city: {csv_plaats}
|
|
country: NL
|
|
identifiers:
|
|
- identifier_scheme: ISIL
|
|
identifier_value: {csv_isil_code}
|
|
identifier_url: https://isil.org/{csv_isil_code}
|
|
assigned_date: {csv_toegekend_op}
|
|
description: "Opmerking: {csv_opmerking}" # If present
|
|
provenance:
|
|
data_source: ISIL_REGISTRY
|
|
data_tier: TIER_1_AUTHORITATIVE
|
|
extraction_date: {timestamp}
|
|
extraction_method: "CSV to YAML conversion (National Archive ISIL codes)"
|
|
source_url: https://www.nationaalarchief.nl/isil
|
|
confidence_score: 1.0
|
|
```
|
|
|
|
## Statistics
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Total records | 371 |
|
|
| Total fields preserved | 2,226 (100%) |
|
|
| Unique cities | 201 |
|
|
| Unique ISIL codes | 371 (no duplicates) |
|
|
| Records with remarks | 18 (4.9%) |
|
|
| ISIL code length (min) | 7 characters |
|
|
| ISIL code length (max) | 17 characters |
|
|
| ISIL code length (mean) | 10.3 characters |
|
|
| Earliest assignment date | 2008-10-10 |
|
|
| Latest assignment date | 2025-09-18 |
|
|
|
|
## Related Documentation
|
|
|
|
- **Conversion Report**: `/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md`
|
|
- **Source CSV**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.csv`
|
|
- **Output YAML**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml`
|
|
- **Conversion Script**: `/scripts/convert_isil_csv_to_yaml.py`
|
|
- **Main Schema**: `/schemas/heritage_custodian.yaml`
|
|
|
|
## Usage Examples
|
|
|
|
### Load YAML Data
|
|
```python
|
|
import yaml
|
|
|
|
with open('data/isil/nl/nan/ISIL-codes_2025-11-06.yaml', 'r') as f:
|
|
records = yaml.safe_load(f)
|
|
|
|
print(f"Loaded {len(records)} institutions")
|
|
```
|
|
|
|
### Query by City
|
|
```python
|
|
amsterdam_records = [r for r in records if r['csv_plaats'] == 'Amsterdam']
|
|
print(f"Amsterdam has {len(amsterdam_records)} institutions")
|
|
```
|
|
|
|
### Extract Change Events
|
|
```python
|
|
change_events = [
|
|
r for r in records
|
|
if r.get('csv_opmerking') and any(
|
|
keyword in r['csv_opmerking'].lower()
|
|
for keyword in ['fusie', 'naamswijziging', 'in onbruik']
|
|
)
|
|
]
|
|
print(f"Found {len(change_events)} institutions with organizational changes")
|
|
```
|
|
|
|
### SPARQL Query (Future RDF Export)
|
|
```sparql
|
|
PREFIX hc: <https://w3id.org/heritage/custodian/>
|
|
PREFIX isil: <https://isil.org/>
|
|
|
|
SELECT ?institution ?name ?isil_code WHERE {
|
|
?institution hc:name ?name ;
|
|
hc:identifier ?id .
|
|
?id dcterms:identifier ?isil_code ;
|
|
dcterms:type "ISIL" .
|
|
FILTER(STRSTARTS(?isil_code, "NL-Asd")) # Amsterdam institutions
|
|
}
|
|
```
|
|
|
|
## Future Work
|
|
|
|
1. **Institution Type Classification**: Assign institution_type (ARCHIVE, MUSEUM, LIBRARY) using NLP or manual review
|
|
2. **Change Event Extraction**: Parse organizational history from csv_opmerking into structured ChangeEvent objects
|
|
3. **Geocoding**: Add latitude/longitude to Location objects using Nominatim API
|
|
4. **Wikidata Enrichment**: Link institutions to Wikidata entities (Q-numbers)
|
|
5. **Cross-linking**: Match with KB library ISIL dataset (524 total Dutch ISIL codes)
|
|
6. **RDF Export**: Generate RDF/Turtle serialization for SPARQL querying
|
|
|
|
## Contact
|
|
|
|
For questions about the ISIL registry conversion or schema:
|
|
- **Data Source**: [Nationaal Archief ISIL](https://www.nationaalarchief.nl/isil)
|
|
- **Project**: GLAM Heritage Custodian Data Pipeline
|
|
- **Schema Version**: v0.2.1 (modular LinkML)
|