glam/data/isil/nl/nan/linkml/README.md
2025-11-19 23:25:22 +01:00

241 lines
7.8 KiB
Markdown

# Dutch National Archive ISIL Registry - LinkML Documentation
This directory contains LinkML schema documentation for the Dutch National Archive ISIL registry conversion from CSV to YAML format.
## Overview
**Source**: [Nationaal Archief ISIL Registry](https://www.nationaalarchief.nl/isil)
**Records**: 371 Dutch heritage institutions
**Date Range**: 2008-10-10 to 2025-09-18
**Geographic Coverage**: 201 unique cities across the Netherlands
**Data Quality**: TIER_1_AUTHORITATIVE (confidence score: 1.0)
## Files in This Directory
### `schema.yaml`
LinkML schema definition documenting the structure of ISIL registry records after conversion to HeritageCustodian format.
**Key classes**:
- `ISILRegistryRecord` - Main record structure with CSV fields and LinkML mappings
- `Location` - Geographic location (city, country)
- `Identifier` - ISIL code structure (scheme, value, URL, assigned_date)
- `Provenance` - Data source and quality metadata
**Enumerations**:
- `InstitutionTypeEnum` - Heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
- `DataSourceEnum` - ISIL_REGISTRY
- `DataTierEnum` - TIER_1_AUTHORITATIVE
### `mapping.yaml`
Complete field-by-field mapping documentation showing how each CSV column transforms into LinkML YAML structure.
**Covers**:
- CSV structure and parsing challenges (latin-1 encoding, malformed cells)
- Field mappings with examples (6 CSV columns → LinkML attributes)
- Transformation rules (date parsing, URL generation, description formatting)
- Data quality metrics (100% field preservation, 2,226 fields)
- Organizational change event detection (18 records with merger/closure notes)
## Dataset Characteristics
### ISIL Code Format
- **Pattern**: `NL-{CityAbbrev}{InstitutionAbbrev}`
- **Length**: Variable (7-17 characters)
- **Encoding**: Semantic (city + institution abbreviations)
- **Examples**:
- `NL-AsdRM` - Rijksmuseum (Amsterdam)
- `NL-HaNa` - Nationaal Archief (Den Haag)
- `NL-LlsBatavialand` - Batavialand (Lelystad)
### Data Completeness
| Field | Coverage | Notes |
|-------|----------|-------|
| Row number | 100% (371/371) | Sequential 1-371 |
| City (Plaats) | 100% (371/371) | 201 unique cities |
| Institution (Instelling) | 100% (371/371) | All institution names present |
| ISIL code | 100% (371/371) | All unique, no duplicates |
| Assignment date (Toegekend op) | ~95% | Most have dates, some empty |
| Remarks (Opmerking) | 4.9% (18/371) | Organizational history notes |
### Top Cities by Institution Count
1. **Den Haag** - 38 institutions (10.2%)
2. **Amsterdam** - 29 institutions (7.8%)
3. **Deventer** - 11 institutions (3.0%)
4. **Groningen** - 10 institutions (2.7%)
### Organizational Change Events
18 records (4.9%) contain organizational history in the `csv_opmerking` field:
**Event types detected**:
- **MERGER**: "fusie tussen", "samenvoeging"
- Example: RHCL-Rijckheyt merger (2020)
- **NAME_CHANGE**: "naamswijziging", "hernoemd"
- **CLOSURE**: "in onbruik", "gesloten"
- **RELOCATION**: "verhuisd naar", "overgebracht naar"
**Future processing**: These remarks can be extracted as structured `ChangeEvent` objects in the HeritageCustodian schema.
## CSV Parsing Challenges
The original CSV file had several issues requiring custom parsing:
### Encoding
- **Issue**: File uses `latin-1` encoding (not UTF-8)
- **Solution**: `encoding='latin-1'` parameter in file reader
### Malformed Structure
- **Issue**: All fields stored in single CSV cell with `","` delimiter
- **Solution**: Split on `","` pattern, strip quotes and semicolons
### Header Row
- **Issue**: Contains sequence number as first field before actual headers
- **Solution**: Extract headers from indices 1-5, skip index 0
### Example Raw CSV Row
```
"1","Amsterdam","Rijksmuseum","NL-AsdRM","2013-03-07","";"
```
### After Parsing
```yaml
csv_row_number: 1
csv_plaats: Amsterdam
csv_instelling: Rijksmuseum
csv_isil_code: NL-AsdRM
csv_toegekend_op: "2013-03-07"
csv_opmerking: ""
```
## Conversion Process
### Input
```
/data/isil/nl/nan/ISIL-codes_2025-11-06.csv
```
### Conversion Script
```
/scripts/convert_isil_csv_to_yaml.py
```
### Output
```
/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml
```
### Validation
- ✅ 371 records converted
- ✅ 2,226 fields preserved (100% preservation)
- ✅ 0 validation errors
- ✅ All ISIL codes match pattern `^NL-[A-Za-z0-9]+`
- ✅ All dates parse as ISO 8601 (YYYY-MM-DD)
- ✅ No duplicate ISIL codes
## LinkML Schema Compliance
All converted records conform to the HeritageCustodian schema:
```yaml
- id: https://w3id.org/heritage/custodian/nl/{slug}
name: {csv_instelling}
institution_type: {ARCHIVE|MUSEUM|LIBRARY|...} # Requires classification
locations:
- city: {csv_plaats}
country: NL
identifiers:
- identifier_scheme: ISIL
identifier_value: {csv_isil_code}
identifier_url: https://isil.org/{csv_isil_code}
assigned_date: {csv_toegekend_op}
description: "Opmerking: {csv_opmerking}" # If present
provenance:
data_source: ISIL_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_date: {timestamp}
extraction_method: "CSV to YAML conversion (National Archive ISIL codes)"
source_url: https://www.nationaalarchief.nl/isil
confidence_score: 1.0
```
## Statistics
| Metric | Value |
|--------|-------|
| Total records | 371 |
| Total fields preserved | 2,226 (100%) |
| Unique cities | 201 |
| Unique ISIL codes | 371 (no duplicates) |
| Records with remarks | 18 (4.9%) |
| ISIL code length (min) | 7 characters |
| ISIL code length (max) | 17 characters |
| ISIL code length (mean) | 10.3 characters |
| Earliest assignment date | 2008-10-10 |
| Latest assignment date | 2025-09-18 |
## Related Documentation
- **Conversion Report**: `/docs/ISIL_CSV_TO_YAML_CONVERSION_REPORT.md`
- **Source CSV**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.csv`
- **Output YAML**: `/data/isil/nl/nan/ISIL-codes_2025-11-06.yaml`
- **Conversion Script**: `/scripts/convert_isil_csv_to_yaml.py`
- **Main Schema**: `/schemas/heritage_custodian.yaml`
## Usage Examples
### Load YAML Data
```python
import yaml
with open('data/isil/nl/nan/ISIL-codes_2025-11-06.yaml', 'r') as f:
records = yaml.safe_load(f)
print(f"Loaded {len(records)} institutions")
```
### Query by City
```python
amsterdam_records = [r for r in records if r['csv_plaats'] == 'Amsterdam']
print(f"Amsterdam has {len(amsterdam_records)} institutions")
```
### Extract Change Events
```python
change_events = [
r for r in records
if r.get('csv_opmerking') and any(
keyword in r['csv_opmerking'].lower()
for keyword in ['fusie', 'naamswijziging', 'in onbruik']
)
]
print(f"Found {len(change_events)} institutions with organizational changes")
```
### SPARQL Query (Future RDF Export)
```sparql
PREFIX hc: <https://w3id.org/heritage/custodian/>
PREFIX isil: <https://isil.org/>
SELECT ?institution ?name ?isil_code WHERE {
?institution hc:name ?name ;
hc:identifier ?id .
?id dcterms:identifier ?isil_code ;
dcterms:type "ISIL" .
FILTER(STRSTARTS(?isil_code, "NL-Asd")) # Amsterdam institutions
}
```
## Future Work
1. **Institution Type Classification**: Assign institution_type (ARCHIVE, MUSEUM, LIBRARY) using NLP or manual review
2. **Change Event Extraction**: Parse organizational history from csv_opmerking into structured ChangeEvent objects
3. **Geocoding**: Add latitude/longitude to Location objects using Nominatim API
4. **Wikidata Enrichment**: Link institutions to Wikidata entities (Q-numbers)
5. **Cross-linking**: Match with KB library ISIL dataset (524 total Dutch ISIL codes)
6. **RDF Export**: Generate RDF/Turtle serialization for SPARQL querying
## Contact
For questions about the ISIL registry conversion or schema:
- **Data Source**: [Nationaal Archief ISIL](https://www.nationaalarchief.nl/isil)
- **Project**: GLAM Heritage Custodian Data Pipeline
- **Schema Version**: v0.2.1 (modular LinkML)