- Created the Country class with ISO 3166-1 alpha-2 and alpha-3 codes, ensuring minimal design without additional metadata. - Integrated the Country class into CustodianPlace and LegalForm schemas to support country-specific feature types and legal forms. - Removed duplicate keys in FeatureTypeEnum.yaml, resulting in 294 unique feature types. - Eliminated "Hypernyms:" text from FeatureTypeEnum descriptions, verifying that semantic relationships are now conveyed through ontology mappings. - Created example instance file demonstrating integration of Country with CustodianPlace and LegalForm. - Updated documentation to reflect the completion of the Country class implementation and hypernyms removal.
1000 lines
28 KiB
Markdown
1000 lines
28 KiB
Markdown
# LinkML Constraints and Validation
|
|
|
|
**Version**: 1.0
|
|
**Date**: 2025-11-22
|
|
**Status**: Phase 8 Complete
|
|
|
|
This document describes the LinkML-level validation approach for the Heritage Custodian Ontology, including built-in constraints, custom validators, and integration patterns.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [Three-Layer Validation Strategy](#three-layer-validation-strategy)
|
|
3. [LinkML Built-in Constraints](#linkml-built-in-constraints)
|
|
4. [Custom Python Validators](#custom-python-validators)
|
|
5. [Usage Examples](#usage-examples)
|
|
6. [Validation Test Suite](#validation-test-suite)
|
|
7. [Integration Patterns](#integration-patterns)
|
|
8. [Comparison with Other Approaches](#comparison-with-other-approaches)
|
|
9. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
**Goal**: Validate heritage custodian data at the **YAML instance level** BEFORE converting to RDF.
|
|
|
|
**Why Validate at LinkML Level?**
|
|
|
|
- ✅ **Early Detection**: Catch errors before expensive RDF conversion
|
|
- ✅ **Fast Feedback**: YAML validation is faster than RDF/SHACL validation
|
|
- ✅ **Developer-Friendly**: Error messages reference YAML structure (not RDF triples)
|
|
- ✅ **CI/CD Integration**: Validate data pipelines before publishing
|
|
|
|
**What LinkML Validates**:
|
|
|
|
1. **Schema Compliance**: Data types, required fields, cardinality
|
|
2. **Format Constraints**: Date formats, regex patterns, enumerations
|
|
3. **Custom Business Rules**: Temporal consistency, bidirectional relationships (via Python validators)
|
|
|
|
---
|
|
|
|
## Three-Layer Validation Strategy
|
|
|
|
The Heritage Custodian Ontology uses **complementary validation at three levels**:
|
|
|
|
| Layer | Technology | When | Purpose | Speed |
|
|
|-------|------------|------|---------|-------|
|
|
| **Layer 1: LinkML** | Python validators | YAML loading | Validate BEFORE RDF conversion | ⚡ Fast (ms) |
|
|
| **Layer 2: SHACL** | RDF shapes | RDF ingestion | Validate DURING triple store loading | 🐢 Moderate (sec) |
|
|
| **Layer 3: SPARQL** | Query-based | Runtime | Validate AFTER data is stored | 🐢 Slow (sec-min) |
|
|
|
|
**Recommended Workflow**:
|
|
|
|
```
|
|
1. Create YAML instance
|
|
↓
|
|
2. Validate with LinkML (Layer 1) ← THIS DOCUMENT
|
|
↓
|
|
3. If valid → Convert to RDF
|
|
↓
|
|
4. Validate with SHACL (Layer 2)
|
|
↓
|
|
5. If valid → Load into triple store
|
|
↓
|
|
6. Monitor with SPARQL (Layer 3)
|
|
```
|
|
|
|
**See Also**:
|
|
- Layer 2: `docs/SHACL_VALIDATION_SHAPES.md`
|
|
- Layer 3: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md`
|
|
|
|
---
|
|
|
|
## LinkML Built-in Constraints
|
|
|
|
LinkML provides **declarative constraints** that can be embedded directly in schema YAML files.
|
|
|
|
### 1. Required Fields
|
|
|
|
**Schema Syntax**:
|
|
```yaml
|
|
# schemas/20251121/linkml/modules/classes/HeritageCustodian.yaml
|
|
slots:
|
|
- name
|
|
- custodian_aspect # ← Required
|
|
|
|
slot_definitions:
|
|
name:
|
|
required: true # ← Must be present
|
|
```
|
|
|
|
**Validation**:
|
|
```python
|
|
from linkml_runtime.loaders import yaml_loader
|
|
|
|
# ❌ This will fail validation (missing required field)
|
|
data = {"id": "test", "description": "No name provided"}
|
|
try:
|
|
instance = yaml_loader.load(data, target_class=HeritageCustodian)
|
|
except ValueError as e:
|
|
print(f"Error: {e}") # "Missing required field: name"
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Data Type Constraints
|
|
|
|
**Schema Syntax**:
|
|
```yaml
|
|
slots:
|
|
valid_from:
|
|
range: date # ← Must be a valid date
|
|
|
|
latitude:
|
|
range: float # ← Must be a float
|
|
|
|
institution_type:
|
|
range: InstitutionTypeEnum # ← Must be one of enum values
|
|
```
|
|
|
|
**Validation**:
|
|
```python
|
|
# ❌ This will fail (invalid date format)
|
|
data = {
|
|
"valid_from": "not-a-date" # Should be "YYYY-MM-DD"
|
|
}
|
|
# Error: "Expected date, got string 'not-a-date'"
|
|
|
|
# ❌ This will fail (invalid enum value)
|
|
data = {
|
|
"institution_type": "FAKE_TYPE" # Should be MUSEUM, LIBRARY, etc.
|
|
}
|
|
# Error: "Value 'FAKE_TYPE' not in InstitutionTypeEnum"
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Pattern Constraints (Regex)
|
|
|
|
**Schema Syntax**:
|
|
```yaml
|
|
# schemas/20251121/linkml/modules/slots/valid_from.yaml
|
|
valid_from:
|
|
description: Start date of temporal validity (ISO 8601 format)
|
|
range: date
|
|
pattern: "^\\d{4}-\\d{2}-\\d{2}$" # ← Regex pattern for YYYY-MM-DD
|
|
examples:
|
|
- value: "2000-01-01"
|
|
- value: "1923-05-15"
|
|
```
|
|
|
|
**Validation**:
|
|
```python
|
|
# ✅ Valid dates
|
|
"2000-01-01" # Pass
|
|
"1923-05-15" # Pass
|
|
|
|
# ❌ Invalid dates
|
|
"2000/01/01" # Fail (wrong separator)
|
|
"Jan 1, 2000" # Fail (wrong format)
|
|
"2000-1-1" # Fail (missing leading zeros)
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Cardinality Constraints
|
|
|
|
**Schema Syntax**:
|
|
```yaml
|
|
slots:
|
|
locations:
|
|
multivalued: true # ← Can have multiple values
|
|
required: false # ← But list can be empty
|
|
|
|
custodian_aspect:
|
|
multivalued: false # ← Only one value allowed
|
|
required: true # ← Must be present
|
|
```
|
|
|
|
**Validation**:
|
|
```python
|
|
# ✅ Valid: Multiple locations
|
|
data = {
|
|
"locations": [
|
|
{"city": "Amsterdam", "country": "NL"},
|
|
{"city": "The Hague", "country": "NL"}
|
|
]
|
|
}
|
|
|
|
# ❌ Invalid: Multiple custodian_aspect (should be single)
|
|
data = {
|
|
"custodian_aspect": [
|
|
{"name": "Museum A"},
|
|
{"name": "Museum B"}
|
|
]
|
|
}
|
|
# Error: "custodian_aspect must be single-valued"
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Minimum/Maximum Value Constraints
|
|
|
|
**Schema Syntax** (example for future use):
|
|
```yaml
|
|
latitude:
|
|
range: float
|
|
minimum_value: -90.0 # ← Latitude bounds
|
|
maximum_value: 90.0
|
|
|
|
longitude:
|
|
range: float
|
|
minimum_value: -180.0
|
|
maximum_value: 180.0
|
|
|
|
confidence_score:
|
|
range: float
|
|
minimum_value: 0.0 # ← Confidence between 0.0 and 1.0
|
|
maximum_value: 1.0
|
|
```
|
|
|
|
---
|
|
|
|
## Custom Python Validators
|
|
|
|
For **complex business rules** that can't be expressed with built-in constraints, use **custom Python validators**.
|
|
|
|
### Location: `scripts/linkml_validators.py`
|
|
|
|
This script provides 5 custom validation functions implementing organizational structure rules:
|
|
|
|
---
|
|
|
|
### Validator 1: Collection-Unit Temporal Consistency
|
|
|
|
**Rule**: A collection's `valid_from` date must be >= its managing unit's `valid_from` date.
|
|
|
|
**Rationale**: A collection cannot be managed by a unit that doesn't yet exist.
|
|
|
|
**Function**:
|
|
```python
|
|
def validate_collection_unit_temporal(data: Dict[str, Any]) -> List[ValidationError]:
|
|
"""
|
|
Validate that collections are not founded before their managing units.
|
|
|
|
Rule 1: collection.valid_from >= unit.valid_from
|
|
"""
|
|
errors = []
|
|
|
|
# Extract organizational units
|
|
units = data.get('organizational_structure', [])
|
|
unit_dates = {unit['id']: unit.get('valid_from') for unit in units}
|
|
|
|
# Extract collections
|
|
collections = data.get('collections_aspect', [])
|
|
|
|
for collection in collections:
|
|
collection_valid_from = collection.get('valid_from')
|
|
managing_units = collection.get('managed_by_unit', [])
|
|
|
|
for unit_id in managing_units:
|
|
unit_valid_from = unit_dates.get(unit_id)
|
|
|
|
if collection_valid_from and unit_valid_from:
|
|
if collection_valid_from < unit_valid_from:
|
|
errors.append(ValidationError(
|
|
rule="COLLECTION_UNIT_TEMPORAL",
|
|
severity="ERROR",
|
|
message=f"Collection founded before its managing unit",
|
|
context={
|
|
"collection_id": collection.get('id'),
|
|
"collection_valid_from": collection_valid_from,
|
|
"unit_id": unit_id,
|
|
"unit_valid_from": unit_valid_from
|
|
}
|
|
))
|
|
|
|
return errors
|
|
```
|
|
|
|
**Example Violation**:
|
|
```yaml
|
|
# ❌ Collection founded in 2002, but unit not established until 2005
|
|
organizational_structure:
|
|
- id: unit-001
|
|
valid_from: "2005-01-01" # Unit founded 2005
|
|
|
|
collections_aspect:
|
|
- id: collection-001
|
|
valid_from: "2002-03-15" # ❌ Collection founded 2002 (before unit!)
|
|
managed_by_unit:
|
|
- unit-001
|
|
```
|
|
|
|
**Expected Error**:
|
|
```
|
|
ERROR: Collection founded before its managing unit
|
|
Collection: collection-001 (valid_from: 2002-03-15)
|
|
Managing Unit: unit-001 (valid_from: 2005-01-01)
|
|
Violation: 2002-03-15 < 2005-01-01
|
|
```
|
|
|
|
---
|
|
|
|
### Validator 2: Collection-Unit Bidirectional Consistency
|
|
|
|
**Rule**: If a collection references a unit via `managed_by_unit`, the unit must reference the collection back via `manages_collections`.
|
|
|
|
**Rationale**: Bidirectional relationships ensure graph consistency (required for W3C Org Ontology).
|
|
|
|
**Function**:
|
|
```python
|
|
def validate_collection_unit_bidirectional(data: Dict[str, Any]) -> List[ValidationError]:
|
|
"""
|
|
Validate bidirectional relationships between collections and units.
|
|
|
|
Rule 2: If collection → unit, then unit → collection (inverse).
|
|
"""
|
|
errors = []
|
|
|
|
# Build inverse mapping: unit_id → collections managed by unit
|
|
units = data.get('organizational_structure', [])
|
|
unit_collections = {unit['id']: unit.get('manages_collections', []) for unit in units}
|
|
|
|
# Check collections
|
|
collections = data.get('collections_aspect', [])
|
|
|
|
for collection in collections:
|
|
collection_id = collection.get('id')
|
|
managing_units = collection.get('managed_by_unit', [])
|
|
|
|
for unit_id in managing_units:
|
|
# Check if unit references collection back
|
|
if collection_id not in unit_collections.get(unit_id, []):
|
|
errors.append(ValidationError(
|
|
rule="COLLECTION_UNIT_BIDIRECTIONAL",
|
|
severity="ERROR",
|
|
message=f"Collection references unit, but unit doesn't reference collection",
|
|
context={
|
|
"collection_id": collection_id,
|
|
"unit_id": unit_id,
|
|
"unit_manages_collections": unit_collections.get(unit_id, [])
|
|
}
|
|
))
|
|
|
|
return errors
|
|
```
|
|
|
|
**Example Violation**:
|
|
```yaml
|
|
# ❌ Collection → Unit exists, but Unit → Collection missing
|
|
organizational_structure:
|
|
- id: unit-001
|
|
# Missing: manages_collections: [collection-001]
|
|
|
|
collections_aspect:
|
|
- id: collection-001
|
|
managed_by_unit:
|
|
- unit-001 # ✓ Forward reference exists
|
|
```
|
|
|
|
**Expected Error**:
|
|
```
|
|
ERROR: Collection references unit, but unit doesn't reference collection
|
|
Collection: collection-001
|
|
Unit: unit-001
|
|
Unit's manages_collections: [] (empty - should include collection-001)
|
|
```
|
|
|
|
---
|
|
|
|
### Validator 3: Staff-Unit Temporal Consistency
|
|
|
|
**Rule**: A staff member's `valid_from` date must be >= their employing unit's `valid_from` date.
|
|
|
|
**Rationale**: A person cannot be employed by a unit that doesn't yet exist.
|
|
|
|
**Function**:
|
|
```python
|
|
def validate_staff_unit_temporal(data: Dict[str, Any]) -> List[ValidationError]:
|
|
"""
|
|
Validate that staff employment dates are consistent with unit founding dates.
|
|
|
|
Rule 4: staff.valid_from >= unit.valid_from
|
|
"""
|
|
errors = []
|
|
|
|
# Extract organizational units
|
|
units = data.get('organizational_structure', [])
|
|
unit_dates = {unit['id']: unit.get('valid_from') for unit in units}
|
|
|
|
# Extract staff
|
|
staff = data.get('staff_aspect', [])
|
|
|
|
for person in staff:
|
|
person_obs = person.get('person_observation', {})
|
|
person_valid_from = person_obs.get('valid_from')
|
|
employing_units = person.get('employed_by_unit', [])
|
|
|
|
for unit_id in employing_units:
|
|
unit_valid_from = unit_dates.get(unit_id)
|
|
|
|
if person_valid_from and unit_valid_from:
|
|
if person_valid_from < unit_valid_from:
|
|
errors.append(ValidationError(
|
|
rule="STAFF_UNIT_TEMPORAL",
|
|
severity="ERROR",
|
|
message=f"Staff employment started before unit existed",
|
|
context={
|
|
"staff_id": person.get('id'),
|
|
"staff_valid_from": person_valid_from,
|
|
"unit_id": unit_id,
|
|
"unit_valid_from": unit_valid_from
|
|
}
|
|
))
|
|
|
|
return errors
|
|
```
|
|
|
|
---
|
|
|
|
### Validator 4: Staff-Unit Bidirectional Consistency
|
|
|
|
**Rule**: If staff references a unit via `employed_by_unit`, the unit must reference the staff back via `employs_staff`.
|
|
|
|
**Function**: Similar structure to Validator 2 (see `scripts/linkml_validators.py` for implementation).
|
|
|
|
---
|
|
|
|
### Validator 5: Batch Validation
|
|
|
|
**Function**: Run all validators at once and return combined results.
|
|
|
|
```python
|
|
def validate_all(data: Dict[str, Any]) -> List[ValidationError]:
|
|
"""
|
|
Run all validation rules and return combined results.
|
|
"""
|
|
errors = []
|
|
errors.extend(validate_collection_unit_temporal(data))
|
|
errors.extend(validate_collection_unit_bidirectional(data))
|
|
errors.extend(validate_staff_unit_temporal(data))
|
|
errors.extend(validate_staff_unit_bidirectional(data))
|
|
return errors
|
|
```
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Command-Line Interface
|
|
|
|
The `linkml_validators.py` script provides a CLI for standalone validation:
|
|
|
|
```bash
|
|
# Validate a single YAML file
|
|
python scripts/linkml_validators.py \
|
|
schemas/20251121/examples/validation_tests/valid_complete_example.yaml
|
|
|
|
# ✅ Output (valid file):
|
|
# Validation successful! No errors found.
|
|
# File: valid_complete_example.yaml
|
|
|
|
# Validate an invalid file
|
|
python scripts/linkml_validators.py \
|
|
schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml
|
|
|
|
# ❌ Output (invalid file):
|
|
# Validation failed with 4 errors:
|
|
#
|
|
# ERROR: Collection founded before its managing unit
|
|
# Collection: early-collection (valid_from: 2002-03-15)
|
|
# Unit: curatorial-dept-002 (valid_from: 2005-01-01)
|
|
#
|
|
# ERROR: Collection founded before its managing unit
|
|
# Collection: another-early-collection (valid_from: 2008-09-01)
|
|
# Unit: research-dept-002 (valid_from: 2010-06-01)
|
|
#
|
|
# ERROR: Staff employment started before unit existed
|
|
# Staff: early-curator (valid_from: 2003-01-15)
|
|
# Unit: curatorial-dept-002 (valid_from: 2005-01-01)
|
|
#
|
|
# ERROR: Staff employment started before unit existed
|
|
# Staff: early-researcher (valid_from: 2009-03-01)
|
|
# Unit: research-dept-002 (valid_from: 2010-06-01)
|
|
```
|
|
|
|
---
|
|
|
|
### Python API
|
|
|
|
Import and use validators in your Python code:
|
|
|
|
```python
|
|
from linkml_validators import validate_all, ValidationError
|
|
import yaml
|
|
|
|
# Load YAML data
|
|
with open('data/instance.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
|
|
# Run validation
|
|
errors = validate_all(data)
|
|
|
|
if errors:
|
|
print(f"Validation failed with {len(errors)} errors:")
|
|
for error in errors:
|
|
print(f" {error.severity}: {error.message}")
|
|
print(f" Rule: {error.rule}")
|
|
print(f" Context: {error.context}")
|
|
else:
|
|
print("Validation successful!")
|
|
```
|
|
|
|
---
|
|
|
|
### Integration with Data Pipelines
|
|
|
|
**Pattern 1: Validate Before Conversion**
|
|
|
|
```python
|
|
from linkml_validators import validate_all
|
|
from linkml_runtime.dumpers import rdflib_dumper
|
|
import yaml
|
|
|
|
def convert_yaml_to_rdf(yaml_path, rdf_path):
|
|
"""Convert YAML to RDF with validation."""
|
|
|
|
# Load YAML
|
|
with open(yaml_path, 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
|
|
# Validate FIRST (Layer 1)
|
|
errors = validate_all(data)
|
|
if errors:
|
|
print(f"❌ Validation failed: {len(errors)} errors")
|
|
for error in errors:
|
|
print(f" - {error.message}")
|
|
return False
|
|
|
|
# Convert to RDF (only if validation passed)
|
|
print("✅ Validation passed, converting to RDF...")
|
|
graph = rdflib_dumper.dump(data, target_class=HeritageCustodian)
|
|
graph.serialize(rdf_path, format='turtle')
|
|
print(f"✅ RDF written to {rdf_path}")
|
|
return True
|
|
```
|
|
|
|
---
|
|
|
|
### Integration with CI/CD
|
|
|
|
**GitHub Actions Example**:
|
|
|
|
```yaml
|
|
# .github/workflows/validate-data.yml
|
|
name: Validate Heritage Custodian Data
|
|
|
|
on:
|
|
push:
|
|
paths:
|
|
- 'data/instances/**/*.yaml'
|
|
pull_request:
|
|
paths:
|
|
- 'data/instances/**/*.yaml'
|
|
|
|
jobs:
|
|
validate:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
|
|
- name: Set up Python
|
|
uses: actions/setup-python@v4
|
|
with:
|
|
python-version: '3.10'
|
|
|
|
- name: Install dependencies
|
|
run: |
|
|
pip install pyyaml linkml-runtime
|
|
|
|
- name: Validate YAML instances
|
|
run: |
|
|
# Validate all YAML files in data/instances/
|
|
for file in data/instances/**/*.yaml; do
|
|
echo "Validating $file..."
|
|
python scripts/linkml_validators.py "$file"
|
|
if [ $? -ne 0 ]; then
|
|
echo "❌ Validation failed for $file"
|
|
exit 1
|
|
fi
|
|
done
|
|
echo "✅ All files validated successfully"
|
|
```
|
|
|
|
**Exit Codes**:
|
|
- `0`: Validation successful
|
|
- `1`: Validation failed (errors found)
|
|
- `2`: Script error (file not found, invalid YAML syntax)
|
|
|
|
---
|
|
|
|
## Validation Test Suite
|
|
|
|
The project includes **3 comprehensive test examples** demonstrating validation behavior:
|
|
|
|
### Test 1: Valid Complete Example
|
|
|
|
**File**: `schemas/20251121/examples/validation_tests/valid_complete_example.yaml`
|
|
|
|
**Description**: A fictional heritage museum with:
|
|
- 3 organizational units (departments)
|
|
- 2 collections (properly aligned temporally)
|
|
- 3 staff members (properly aligned temporally)
|
|
- All bidirectional relationships correct
|
|
|
|
**Expected Result**: ✅ **PASS** (no validation errors)
|
|
|
|
**Key Features**:
|
|
- All `valid_from` dates are consistent (collections/staff after units)
|
|
- All inverse relationships present (`manages_collections` ↔ `managed_by_unit`)
|
|
- Demonstrates best practices for data modeling
|
|
|
|
---
|
|
|
|
### Test 2: Invalid Temporal Violation
|
|
|
|
**File**: `schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml`
|
|
|
|
**Description**: A museum with **temporal inconsistencies**:
|
|
- Collection founded in 2002, but managing unit not established until 2005
|
|
- Collection founded in 2008, but managing unit not established until 2010
|
|
- Staff employed in 2003, but employing unit not established until 2005
|
|
- Staff employed in 2009, but employing unit not established until 2010
|
|
|
|
**Expected Result**: ❌ **FAIL** with 4 errors
|
|
|
|
**Violations**:
|
|
1. Collection `early-collection`: `valid_from: 2002-03-15` < Unit `valid_from: 2005-01-01`
|
|
2. Collection `another-early-collection`: `valid_from: 2008-09-01` < Unit `valid_from: 2010-06-01`
|
|
3. Staff `early-curator`: `valid_from: 2003-01-15` < Unit `valid_from: 2005-01-01`
|
|
4. Staff `early-researcher`: `valid_from: 2009-03-01` < Unit `valid_from: 2010-06-01`
|
|
|
|
---
|
|
|
|
### Test 3: Invalid Bidirectional Violation
|
|
|
|
**File**: `schemas/20251121/examples/validation_tests/invalid_bidirectional_violation.yaml`
|
|
|
|
**Description**: A museum with **missing inverse relationships**:
|
|
- Collection references managing unit, but unit doesn't reference collection back
|
|
- Staff references employing unit, but unit doesn't reference staff back
|
|
|
|
**Expected Result**: ❌ **FAIL** with 2 errors
|
|
|
|
**Violations**:
|
|
1. Collection `paintings-collection-003` → Unit `curatorial-dept-003` (forward exists), but Unit → Collection (inverse missing)
|
|
2. Staff `researcher-001-003` → Unit `research-dept-003` (forward exists), but Unit → Staff (inverse missing)
|
|
|
|
---
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Test 1: Valid example (should pass)
|
|
python scripts/linkml_validators.py \
|
|
schemas/20251121/examples/validation_tests/valid_complete_example.yaml
|
|
# ✅ Expected: "Validation successful! No errors found."
|
|
|
|
# Test 2: Temporal violations (should fail)
|
|
python scripts/linkml_validators.py \
|
|
schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml
|
|
# ❌ Expected: "Validation failed with 4 errors"
|
|
|
|
# Test 3: Bidirectional violations (should fail)
|
|
python scripts/linkml_validators.py \
|
|
schemas/20251121/examples/validation_tests/invalid_bidirectional_violation.yaml
|
|
# ❌ Expected: "Validation failed with 2 errors"
|
|
```
|
|
|
|
---
|
|
|
|
## Integration Patterns
|
|
|
|
### Pattern 1: Validate on Data Import
|
|
|
|
```python
|
|
def import_heritage_custodian(yaml_path):
|
|
"""Import and validate a heritage custodian YAML file."""
|
|
import yaml
|
|
from linkml_validators import validate_all
|
|
|
|
# Load YAML
|
|
with open(yaml_path, 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
|
|
# Validate FIRST
|
|
errors = validate_all(data)
|
|
if errors:
|
|
raise ValueError(f"Validation failed: {errors}")
|
|
|
|
# Process data (convert to RDF, store in database, etc.)
|
|
process_data(data)
|
|
```
|
|
|
|
---
|
|
|
|
### Pattern 2: Pre-commit Hook
|
|
|
|
**File**: `.git/hooks/pre-commit`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Validate all staged YAML files before commit
|
|
|
|
echo "Validating heritage custodian YAML files..."
|
|
|
|
# Find all staged YAML files in data/instances/
|
|
staged_files=$(git diff --cached --name-only --diff-filter=ACM | grep "data/instances/.*\.yaml$")
|
|
|
|
if [ -z "$staged_files" ]; then
|
|
echo "No YAML files staged, skipping validation."
|
|
exit 0
|
|
fi
|
|
|
|
# Validate each file
|
|
for file in $staged_files; do
|
|
echo " Validating $file..."
|
|
python scripts/linkml_validators.py "$file"
|
|
if [ $? -ne 0 ]; then
|
|
echo "❌ Validation failed for $file"
|
|
echo "Commit aborted. Fix validation errors and try again."
|
|
exit 1
|
|
fi
|
|
done
|
|
|
|
echo "✅ All YAML files validated successfully."
|
|
exit 0
|
|
```
|
|
|
|
**Installation**:
|
|
```bash
|
|
chmod +x .git/hooks/pre-commit
|
|
```
|
|
|
|
---
|
|
|
|
### Pattern 3: Batch Validation
|
|
|
|
```python
|
|
def validate_directory(directory_path):
|
|
"""Validate all YAML files in a directory."""
|
|
import os
|
|
import yaml
|
|
from linkml_validators import validate_all
|
|
|
|
results = {"passed": [], "failed": []}
|
|
|
|
for root, dirs, files in os.walk(directory_path):
|
|
for file in files:
|
|
if file.endswith('.yaml'):
|
|
yaml_path = os.path.join(root, file)
|
|
|
|
with open(yaml_path, 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
|
|
errors = validate_all(data)
|
|
if errors:
|
|
results["failed"].append({
|
|
"file": yaml_path,
|
|
"errors": errors
|
|
})
|
|
else:
|
|
results["passed"].append(yaml_path)
|
|
|
|
# Report results
|
|
print(f"✅ Passed: {len(results['passed'])} files")
|
|
print(f"❌ Failed: {len(results['failed'])} files")
|
|
|
|
for failure in results["failed"]:
|
|
print(f"\n{failure['file']}:")
|
|
for error in failure["errors"]:
|
|
print(f" - {error.message}")
|
|
|
|
return results
|
|
```
|
|
|
|
---
|
|
|
|
## Comparison with Other Approaches
|
|
|
|
### LinkML vs. Python Validator (Phase 5)
|
|
|
|
| Feature | LinkML Validators | Phase 5 Python Validator |
|
|
|---------|-------------------|--------------------------|
|
|
| **Input** | YAML instances | RDF triples (after conversion) |
|
|
| **Speed** | ⚡ Fast (ms) | 🐢 Moderate (sec) |
|
|
| **Error Location** | YAML field names | RDF triple patterns |
|
|
| **Use Case** | Development, CI/CD | Post-conversion validation |
|
|
| **Integration** | Data pipeline ingestion | RDF quality assurance |
|
|
|
|
**Recommendation**: Use **both** for defense-in-depth validation.
|
|
|
|
---
|
|
|
|
### LinkML vs. SHACL (Phase 7)
|
|
|
|
| Feature | LinkML Validators | SHACL Shapes |
|
|
|---------|-------------------|--------------|
|
|
| **Input** | YAML instances | RDF graphs |
|
|
| **Validation Time** | Before RDF conversion | During RDF ingestion |
|
|
| **Error Messages** | Python-friendly | RDF-centric |
|
|
| **Extensibility** | Python code | SPARQL-based constraints |
|
|
| **Standards** | LinkML metamodel | W3C SHACL standard |
|
|
| **Use Case** | Development | Triple store ingestion |
|
|
|
|
**Recommendation**:
|
|
- Use **LinkML** for early validation (development phase)
|
|
- Use **SHACL** for production validation (RDF ingestion)
|
|
|
|
---
|
|
|
|
### LinkML vs. SPARQL Queries (Phase 6)
|
|
|
|
| Feature | LinkML Validators | SPARQL Queries |
|
|
|---------|-------------------|----------------|
|
|
| **Input** | YAML instances | RDF triple store |
|
|
| **Timing** | Before RDF conversion | After data is stored |
|
|
| **Purpose** | **Prevention** | **Detection** |
|
|
| **Speed** | ⚡ Fast | 🐢 Slow (depends on data size) |
|
|
| **Use Case** | Data quality gates | Monitoring, auditing |
|
|
|
|
**Recommendation**:
|
|
- Use **LinkML** to **prevent** invalid data from entering system
|
|
- Use **SPARQL** to **detect** existing violations in production data
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue 1: "Missing required field" Error
|
|
|
|
**Symptom**:
|
|
```
|
|
ValueError: Missing required field: name
|
|
```
|
|
|
|
**Cause**: YAML instance is missing a required field defined in the schema.
|
|
|
|
**Solution**:
|
|
```yaml
|
|
# ❌ Missing required field
|
|
id: https://example.org/custodian/001
|
|
description: Some museum
|
|
|
|
# ✅ Add required field
|
|
id: https://example.org/custodian/001
|
|
name: Example Museum # ← Add this
|
|
description: Some museum
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 2: "Expected date, got string" Error
|
|
|
|
**Symptom**:
|
|
```
|
|
ValueError: Expected date, got string '2000/01/01'
|
|
```
|
|
|
|
**Cause**: Date format doesn't match ISO 8601 pattern (`YYYY-MM-DD`).
|
|
|
|
**Solution**:
|
|
```yaml
|
|
# ❌ Wrong date format
|
|
valid_from: "2000/01/01" # Slashes instead of hyphens
|
|
|
|
# ✅ Correct date format
|
|
valid_from: "2000-01-01" # ISO 8601: YYYY-MM-DD
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 3: Validation Passes but SHACL Fails
|
|
|
|
**Symptom**: LinkML validation passes, but SHACL validation fails with the same data.
|
|
|
|
**Cause**: LinkML validators check **YAML structure**, SHACL validates **RDF graph patterns**. Some constraints (e.g., inverse relationships) may be implicit in YAML but explicit in RDF.
|
|
|
|
**Solution**: Ensure YAML data includes **all required inverse relationships**:
|
|
```yaml
|
|
# ✅ Explicit bidirectional relationships in YAML
|
|
organizational_structure:
|
|
- id: unit-001
|
|
manages_collections: # ← Inverse relationship
|
|
- collection-001
|
|
|
|
collections_aspect:
|
|
- id: collection-001
|
|
managed_by_unit: # ← Forward relationship
|
|
- unit-001
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 4: "List index out of range" or "KeyError"
|
|
|
|
**Symptom**: Python exception during validation.
|
|
|
|
**Cause**: YAML structure doesn't match expected schema (e.g., missing nested fields).
|
|
|
|
**Solution**: Use defensive programming in custom validators:
|
|
```python
|
|
# ❌ Unsafe access
|
|
unit_valid_from = data['organizational_structure'][0]['valid_from']
|
|
|
|
# ✅ Safe access with defaults
|
|
units = data.get('organizational_structure', [])
|
|
unit_valid_from = units[0].get('valid_from') if units else None
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 5: Slow Validation Performance
|
|
|
|
**Symptom**: Validation takes a long time for large datasets.
|
|
|
|
**Cause**: Custom validators may have O(n²) complexity when checking relationships.
|
|
|
|
**Solution**: Use indexed lookups:
|
|
```python
|
|
# ❌ Slow: O(n²) nested loops
|
|
for collection in collections:
|
|
for unit in units:
|
|
if unit['id'] in collection['managed_by_unit']:
|
|
# Check relationship
|
|
|
|
# ✅ Fast: O(n) with dict lookup
|
|
unit_dates = {unit['id']: unit['valid_from'] for unit in units}
|
|
for collection in collections:
|
|
for unit_id in collection['managed_by_unit']:
|
|
unit_date = unit_dates.get(unit_id) # O(1) lookup
|
|
```
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**LinkML Constraints Capabilities**:
|
|
|
|
✅ **Built-in Constraints** (declarative):
|
|
- Required fields (`required: true`)
|
|
- Data types (`range: date`, `range: float`)
|
|
- Regex patterns (`pattern: "^\\d{4}-\\d{2}-\\d{2}$"`)
|
|
- Cardinality (`multivalued: true/false`)
|
|
- Min/max values (`minimum_value`, `maximum_value`)
|
|
|
|
✅ **Custom Validators** (programmatic):
|
|
- Temporal consistency (collections/staff before units)
|
|
- Bidirectional relationships (forward ↔ inverse)
|
|
- Complex business rules (Python functions)
|
|
|
|
✅ **Integration**:
|
|
- Command-line interface (`linkml_validators.py`)
|
|
- Python API (`import linkml_validators`)
|
|
- CI/CD workflows (GitHub Actions, pre-commit hooks)
|
|
- Data pipelines (validate before RDF conversion)
|
|
|
|
✅ **Test Suite**:
|
|
- Valid example (passes all rules)
|
|
- Temporal violations (fails Rules 1 & 4)
|
|
- Bidirectional violations (fails Rules 2 & 5)
|
|
|
|
**Next Steps**:
|
|
|
|
1. ✅ **Phase 8 Complete**: LinkML constraints documented
|
|
2. ⏳ **Phase 9**: Apply validators to real-world heritage institution data
|
|
3. ⏳ **Performance Testing**: Benchmark validation speed on large datasets (10K+ institutions)
|
|
4. ⏳ **Additional Rules**: Extend validators for custody transfer events, legal form constraints
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Phase 5**: `docs/VALIDATION_RULES.md` (Python validator)
|
|
- **Phase 6**: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md` (SPARQL queries)
|
|
- **Phase 7**: `docs/SHACL_VALIDATION_SHAPES.md` (SHACL shapes)
|
|
- **Phase 8**: This document (LinkML constraints)
|
|
- **Schema**: `schemas/20251121/linkml/01_custodian_name_modular.yaml`
|
|
- **Validators**: `scripts/linkml_validators.py`
|
|
- **Test Suite**: `schemas/20251121/examples/validation_tests/`
|
|
- **LinkML Documentation**: https://linkml.io/
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Phase**: 8 (Complete)
|
|
**Date**: 2025-11-22
|