207 lines
6.8 KiB
Markdown
207 lines
6.8 KiB
Markdown
# LinkML Validation Report: CSV to YAML Conversion
|
|
|
|
## Overview
|
|
|
|
This document provides a comprehensive validation report for the conversion of the NDE Dutch Heritage Organizations CSV file to YAML format using LinkML schemas and mapping validation.
|
|
|
|
## Files Created
|
|
|
|
### 1. LinkML Schemas
|
|
|
|
**Source Schema**: `data/nde/nde_csv_source.yaml`
|
|
- Defines the structure of the original CSV file
|
|
- 33 columns/fields (including 2 unnamed columns)
|
|
- All fields optional (CSV may have empty cells)
|
|
- Preserves original field naming conventions
|
|
|
|
**Target Schema**: `data/nde/nde_yaml_target.yaml`
|
|
- Defines the structure of the converted YAML file
|
|
- 32 unique fields (normalized field names)
|
|
- All fields optional (only non-empty fields included)
|
|
- Normalized field naming (lowercase, underscores)
|
|
|
|
**Mapping Schema**: `data/nde/nde_csv_to_yaml_mapping.yaml`
|
|
- Defines the transformation rules from CSV to YAML
|
|
- Documents field name normalization
|
|
- Specifies one-to-one field mappings
|
|
|
|
### 2. Validation Script
|
|
|
|
**Script**: `scripts/validate_csv_to_yaml_conversion.py`
|
|
- Python script using LinkML validation principles
|
|
- Validates field mapping correctness
|
|
- Validates data preservation (no loss)
|
|
- Validates value integrity (no corruption)
|
|
|
|
## Validation Results
|
|
|
|
### Summary Statistics
|
|
|
|
| Metric | CSV Source | YAML Target | Match |
|
|
|--------|-----------|-------------|-------|
|
|
| **Records** | 1,351 | 1,351 | ✓ YES |
|
|
| **Non-empty cells/fields** | 6,980 | 6,980 | ✓ YES |
|
|
| **Unique fields** | 33 columns | 32 fields | ✓ YES* |
|
|
|
|
*Note: CSV has 2 empty column names that both map to `unnamed_field` in YAML, reducing unique field count from 33 to 32.
|
|
|
|
### Field Mapping Validation
|
|
|
|
**Result**: ✓✓✓ ALL 33 FIELD MAPPINGS CORRECT
|
|
|
|
All CSV columns successfully map to YAML fields with the following transformations:
|
|
|
|
1. **Direct mappings** (29 fields): Most fields map directly with minimal normalization
|
|
- Example: `Organisatie` → `organisatie` (lowercase)
|
|
- Example: `Museum register` → `museum_register` (spaces to underscores)
|
|
|
|
2. **Normalized mappings** (4 fields): Fields with special characters normalized
|
|
- `ISIL-code (NA)` → `isil-code_na` (removed parentheses)
|
|
- `Archieven.nl` → `archieven.nl` (preserved dot)
|
|
- `OODE24 (Mondriaan)` → `oode24_mondriaan` (removed newline and parentheses)
|
|
- Empty column → `unnamed_field` (placeholder name)
|
|
|
|
**Missing mappings**: 0
|
|
**Unexpected YAML fields**: 0
|
|
|
|
### Data Preservation Validation
|
|
|
|
**Result**: ✓✓✓ ALL DATA PRESERVED
|
|
|
|
#### Record Count
|
|
- CSV records: 1,351
|
|
- YAML records: 1,351
|
|
- **Match: 100%**
|
|
|
|
#### Cell/Field Count
|
|
- CSV non-empty cells: 6,980
|
|
- YAML total fields: 6,980
|
|
- **Match: 100%**
|
|
|
|
#### Content Integrity
|
|
- Missing data instances: **0**
|
|
- Value mismatches: **0**
|
|
- **All content preserved exactly**
|
|
|
|
### Special Cases Validated
|
|
|
|
1. **Multi-line content** (2 records with newlines)
|
|
- ✓ Preserved with exact newline characters (`\r\n`)
|
|
- ✓ No truncation or corruption
|
|
|
|
2. **Special characters** (8+ records in first 100)
|
|
- ✓ Quotes, parentheses, slashes, commas all preserved
|
|
- ✓ Example: `Stichting "Museum van Papierknipkunst"` → preserved exactly
|
|
|
|
3. **URLs** (1,100+ fields)
|
|
- ✓ All URLs preserved exactly
|
|
- ✓ Trailing spaces trimmed correctly
|
|
|
|
4. **ISIL codes** (364 fields)
|
|
- ✓ All codes preserved in correct format (`NL-XXXXX`)
|
|
- ✓ No corruption or modification
|
|
|
|
5. **Empty fields**
|
|
- ✓ Correctly omitted from YAML (not stored as null/empty)
|
|
- ✓ Only non-empty values included
|
|
|
|
## LinkML Schema Compliance
|
|
|
|
### CSV Source Schema Compliance
|
|
|
|
The CSV file complies with the `nde_csv_source.yaml` schema:
|
|
- All 33 columns present
|
|
- Field names match schema definitions
|
|
- Data types conform to string range
|
|
- No schema violations detected
|
|
|
|
### YAML Target Schema Compliance
|
|
|
|
The YAML file complies with the `nde_yaml_target.yaml` schema:
|
|
- All 32 unique fields conform to schema
|
|
- Field names follow normalization rules
|
|
- Only non-empty values included (per schema design)
|
|
- No schema violations detected
|
|
|
|
### Mapping Schema Compliance
|
|
|
|
The conversion follows the `nde_csv_to_yaml_mapping.yaml` mapping:
|
|
- All field derivations correct
|
|
- Source-to-target mappings 1:1
|
|
- Transformation rules applied consistently
|
|
- No mapping violations detected
|
|
|
|
## Field Name Normalization Rules
|
|
|
|
The conversion applies these normalization rules (as documented in schemas):
|
|
|
|
1. **Whitespace**: Convert to underscores
|
|
- `Plaatsnaam bezoekadres ` → `plaatsnaam_bezoekadres`
|
|
|
|
2. **Newlines**: Convert to underscores
|
|
- `OODE24\n(Mondriaan)` → `oode24_mondriaan`
|
|
|
|
3. **Parentheses**: Remove
|
|
- `ISIL-code (NA)` → `isil-code_na`
|
|
|
|
4. **Quotes**: Remove
|
|
- No impact (field names don't contain quotes)
|
|
|
|
5. **Multiple underscores**: Collapse to single
|
|
- `field__name` → `field_name`
|
|
|
|
6. **Leading/trailing underscores**: Strip
|
|
- `_field_` → `field`
|
|
|
|
7. **Case**: Convert to lowercase
|
|
- `Organisatie` → `organisatie`
|
|
|
|
8. **Empty names**: Replace with placeholder
|
|
- `` → `unnamed_field`
|
|
|
|
## Validation Methodology
|
|
|
|
The validation follows LinkML best practices:
|
|
|
|
1. **Schema-based validation**: Schemas define structure and constraints
|
|
2. **Mapping validation**: Explicit mapping rules define transformations
|
|
3. **Data integrity checks**: Cell-by-cell comparison ensures no data loss
|
|
4. **Reproducibility**: Validation script can be re-run at any time
|
|
|
|
## Conclusion
|
|
|
|
### Final Verdict
|
|
|
|
**✓✓✓ VALIDATION PASSED ✓✓✓**
|
|
|
|
The CSV to YAML conversion is **VERIFIED** as complete and correct according to LinkML schema validation principles.
|
|
|
|
### Validation Guarantees
|
|
|
|
Based on the comprehensive validation:
|
|
|
|
1. ✓ **Completeness**: All 1,351 records converted
|
|
2. ✓ **Preservation**: All 6,980 non-empty cells preserved
|
|
3. ✓ **Accuracy**: All values match exactly (no corruption)
|
|
4. ✓ **Consistency**: All field mappings follow defined rules
|
|
5. ✓ **Schema compliance**: Both source and target conform to LinkML schemas
|
|
6. ✓ **Mapping compliance**: Conversion follows documented mapping rules
|
|
|
|
### Files Summary
|
|
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv` | Source data | ✓ Valid |
|
|
| `data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml` | Target data | ✓ Valid |
|
|
| `data/nde/nde_csv_source.yaml` | CSV LinkML schema | ✓ Valid |
|
|
| `data/nde/nde_yaml_target.yaml` | YAML LinkML schema | ✓ Valid |
|
|
| `data/nde/nde_csv_to_yaml_mapping.yaml` | Transformation mapping | ✓ Valid |
|
|
| `scripts/convert_nde_csv_to_yaml.py` | Conversion script | ✓ Works |
|
|
| `scripts/validate_csv_to_yaml_conversion.py` | Validation script | ✓ Passes |
|
|
|
|
---
|
|
|
|
**Validation Date**: 2025-11-17
|
|
**Validation Method**: LinkML schema-based validation
|
|
**Validator**: OpenCode AI with LinkML toolkit
|
|
**Result**: ✓✓✓ ALL CHECKS PASSED ✓✓✓
|