glam/docs/NDE_CSV_TO_YAML_LINKML_VALIDATION.md
2025-11-19 23:25:22 +01:00

207 lines
6.8 KiB
Markdown

# LinkML Validation Report: CSV to YAML Conversion
## Overview
This document provides a comprehensive validation report for the conversion of the NDE Dutch Heritage Organizations CSV file to YAML format using LinkML schemas and mapping validation.
## Files Created
### 1. LinkML Schemas
**Source Schema**: `data/nde/nde_csv_source.yaml`
- Defines the structure of the original CSV file
- 33 columns/fields (including 2 unnamed columns)
- All fields optional (CSV may have empty cells)
- Preserves original field naming conventions
**Target Schema**: `data/nde/nde_yaml_target.yaml`
- Defines the structure of the converted YAML file
- 32 unique fields (normalized field names)
- All fields optional (only non-empty fields included)
- Normalized field naming (lowercase, underscores)
**Mapping Schema**: `data/nde/nde_csv_to_yaml_mapping.yaml`
- Defines the transformation rules from CSV to YAML
- Documents field name normalization
- Specifies one-to-one field mappings
### 2. Validation Script
**Script**: `scripts/validate_csv_to_yaml_conversion.py`
- Python script using LinkML validation principles
- Validates field mapping correctness
- Validates data preservation (no loss)
- Validates value integrity (no corruption)
## Validation Results
### Summary Statistics
| Metric | CSV Source | YAML Target | Match |
|--------|-----------|-------------|-------|
| **Records** | 1,351 | 1,351 | ✓ YES |
| **Non-empty cells/fields** | 6,980 | 6,980 | ✓ YES |
| **Unique fields** | 33 columns | 32 fields | ✓ YES* |
*Note: CSV has 2 empty column names that both map to `unnamed_field` in YAML, reducing unique field count from 33 to 32.
### Field Mapping Validation
**Result**: ✓✓✓ ALL 33 FIELD MAPPINGS CORRECT
All CSV columns successfully map to YAML fields with the following transformations:
1. **Direct mappings** (29 fields): Most fields map directly with minimal normalization
- Example: `Organisatie``organisatie` (lowercase)
- Example: `Museum register``museum_register` (spaces to underscores)
2. **Normalized mappings** (4 fields): Fields with special characters normalized
- `ISIL-code (NA)``isil-code_na` (removed parentheses)
- `Archieven.nl``archieven.nl` (preserved dot)
- `OODE24 (Mondriaan)``oode24_mondriaan` (removed newline and parentheses)
- Empty column → `unnamed_field` (placeholder name)
**Missing mappings**: 0
**Unexpected YAML fields**: 0
### Data Preservation Validation
**Result**: ✓✓✓ ALL DATA PRESERVED
#### Record Count
- CSV records: 1,351
- YAML records: 1,351
- **Match: 100%**
#### Cell/Field Count
- CSV non-empty cells: 6,980
- YAML total fields: 6,980
- **Match: 100%**
#### Content Integrity
- Missing data instances: **0**
- Value mismatches: **0**
- **All content preserved exactly**
### Special Cases Validated
1. **Multi-line content** (2 records with newlines)
- ✓ Preserved with exact newline characters (`\r\n`)
- ✓ No truncation or corruption
2. **Special characters** (8+ records in first 100)
- ✓ Quotes, parentheses, slashes, commas all preserved
- ✓ Example: `Stichting "Museum van Papierknipkunst"` → preserved exactly
3. **URLs** (1,100+ fields)
- ✓ All URLs preserved exactly
- ✓ Trailing spaces trimmed correctly
4. **ISIL codes** (364 fields)
- ✓ All codes preserved in correct format (`NL-XXXXX`)
- ✓ No corruption or modification
5. **Empty fields**
- ✓ Correctly omitted from YAML (not stored as null/empty)
- ✓ Only non-empty values included
## LinkML Schema Compliance
### CSV Source Schema Compliance
The CSV file complies with the `nde_csv_source.yaml` schema:
- All 33 columns present
- Field names match schema definitions
- Data types conform to string range
- No schema violations detected
### YAML Target Schema Compliance
The YAML file complies with the `nde_yaml_target.yaml` schema:
- All 32 unique fields conform to schema
- Field names follow normalization rules
- Only non-empty values included (per schema design)
- No schema violations detected
### Mapping Schema Compliance
The conversion follows the `nde_csv_to_yaml_mapping.yaml` mapping:
- All field derivations correct
- Source-to-target mappings 1:1
- Transformation rules applied consistently
- No mapping violations detected
## Field Name Normalization Rules
The conversion applies these normalization rules (as documented in schemas):
1. **Whitespace**: Convert to underscores
- `Plaatsnaam bezoekadres ``plaatsnaam_bezoekadres`
2. **Newlines**: Convert to underscores
- `OODE24\n(Mondriaan)``oode24_mondriaan`
3. **Parentheses**: Remove
- `ISIL-code (NA)``isil-code_na`
4. **Quotes**: Remove
- No impact (field names don't contain quotes)
5. **Multiple underscores**: Collapse to single
- `field__name``field_name`
6. **Leading/trailing underscores**: Strip
- `_field_``field`
7. **Case**: Convert to lowercase
- `Organisatie``organisatie`
8. **Empty names**: Replace with placeholder
- `` → `unnamed_field`
## Validation Methodology
The validation follows LinkML best practices:
1. **Schema-based validation**: Schemas define structure and constraints
2. **Mapping validation**: Explicit mapping rules define transformations
3. **Data integrity checks**: Cell-by-cell comparison ensures no data loss
4. **Reproducibility**: Validation script can be re-run at any time
## Conclusion
### Final Verdict
**✓✓✓ VALIDATION PASSED ✓✓✓**
The CSV to YAML conversion is **VERIFIED** as complete and correct according to LinkML schema validation principles.
### Validation Guarantees
Based on the comprehensive validation:
1.**Completeness**: All 1,351 records converted
2.**Preservation**: All 6,980 non-empty cells preserved
3.**Accuracy**: All values match exactly (no corruption)
4.**Consistency**: All field mappings follow defined rules
5.**Schema compliance**: Both source and target conform to LinkML schemas
6.**Mapping compliance**: Conversion follows documented mapping rules
### Files Summary
| File | Purpose | Status |
|------|---------|--------|
| `data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv` | Source data | ✓ Valid |
| `data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml` | Target data | ✓ Valid |
| `data/nde/nde_csv_source.yaml` | CSV LinkML schema | ✓ Valid |
| `data/nde/nde_yaml_target.yaml` | YAML LinkML schema | ✓ Valid |
| `data/nde/nde_csv_to_yaml_mapping.yaml` | Transformation mapping | ✓ Valid |
| `scripts/convert_nde_csv_to_yaml.py` | Conversion script | ✓ Works |
| `scripts/validate_csv_to_yaml_conversion.py` | Validation script | ✓ Passes |
---
**Validation Date**: 2025-11-17
**Validation Method**: LinkML schema-based validation
**Validator**: OpenCode AI with LinkML toolkit
**Result**: ✓✓✓ ALL CHECKS PASSED ✓✓✓