glam/docs/LINKML_CONSTRAINTS.md
kempersc 67657c39b6 feat: Complete Country Class Implementation and Hypernyms Removal
- Created the Country class with ISO 3166-1 alpha-2 and alpha-3 codes, ensuring minimal design without additional metadata.
- Integrated the Country class into CustodianPlace and LegalForm schemas to support country-specific feature types and legal forms.
- Removed duplicate keys in FeatureTypeEnum.yaml, resulting in 294 unique feature types.
- Eliminated "Hypernyms:" text from FeatureTypeEnum descriptions, verifying that semantic relationships are now conveyed through ontology mappings.
- Created example instance file demonstrating integration of Country with CustodianPlace and LegalForm.
- Updated documentation to reflect the completion of the Country class implementation and hypernyms removal.
2025-11-23 13:09:38 +01:00

1000 lines
28 KiB
Markdown

# LinkML Constraints and Validation
**Version**: 1.0
**Date**: 2025-11-22
**Status**: Phase 8 Complete
This document describes the LinkML-level validation approach for the Heritage Custodian Ontology, including built-in constraints, custom validators, and integration patterns.
---
## Table of Contents
1. [Overview](#overview)
2. [Three-Layer Validation Strategy](#three-layer-validation-strategy)
3. [LinkML Built-in Constraints](#linkml-built-in-constraints)
4. [Custom Python Validators](#custom-python-validators)
5. [Usage Examples](#usage-examples)
6. [Validation Test Suite](#validation-test-suite)
7. [Integration Patterns](#integration-patterns)
8. [Comparison with Other Approaches](#comparison-with-other-approaches)
9. [Troubleshooting](#troubleshooting)
---
## Overview
**Goal**: Validate heritage custodian data at the **YAML instance level** BEFORE converting to RDF.
**Why Validate at LinkML Level?**
-**Early Detection**: Catch errors before expensive RDF conversion
-**Fast Feedback**: YAML validation is faster than RDF/SHACL validation
-**Developer-Friendly**: Error messages reference YAML structure (not RDF triples)
-**CI/CD Integration**: Validate data pipelines before publishing
**What LinkML Validates**:
1. **Schema Compliance**: Data types, required fields, cardinality
2. **Format Constraints**: Date formats, regex patterns, enumerations
3. **Custom Business Rules**: Temporal consistency, bidirectional relationships (via Python validators)
---
## Three-Layer Validation Strategy
The Heritage Custodian Ontology uses **complementary validation at three levels**:
| Layer | Technology | When | Purpose | Speed |
|-------|------------|------|---------|-------|
| **Layer 1: LinkML** | Python validators | YAML loading | Validate BEFORE RDF conversion | ⚡ Fast (ms) |
| **Layer 2: SHACL** | RDF shapes | RDF ingestion | Validate DURING triple store loading | 🐢 Moderate (sec) |
| **Layer 3: SPARQL** | Query-based | Runtime | Validate AFTER data is stored | 🐢 Slow (sec-min) |
**Recommended Workflow**:
```
1. Create YAML instance
2. Validate with LinkML (Layer 1) ← THIS DOCUMENT
3. If valid → Convert to RDF
4. Validate with SHACL (Layer 2)
5. If valid → Load into triple store
6. Monitor with SPARQL (Layer 3)
```
**See Also**:
- Layer 2: `docs/SHACL_VALIDATION_SHAPES.md`
- Layer 3: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md`
---
## LinkML Built-in Constraints
LinkML provides **declarative constraints** that can be embedded directly in schema YAML files.
### 1. Required Fields
**Schema Syntax**:
```yaml
# schemas/20251121/linkml/modules/classes/HeritageCustodian.yaml
slots:
- name
- custodian_aspect # ← Required
slot_definitions:
name:
required: true # ← Must be present
```
**Validation**:
```python
from linkml_runtime.loaders import yaml_loader
# ❌ This will fail validation (missing required field)
data = {"id": "test", "description": "No name provided"}
try:
instance = yaml_loader.load(data, target_class=HeritageCustodian)
except ValueError as e:
print(f"Error: {e}") # "Missing required field: name"
```
---
### 2. Data Type Constraints
**Schema Syntax**:
```yaml
slots:
valid_from:
range: date # ← Must be a valid date
latitude:
range: float # ← Must be a float
institution_type:
range: InstitutionTypeEnum # ← Must be one of enum values
```
**Validation**:
```python
# ❌ This will fail (invalid date format)
data = {
"valid_from": "not-a-date" # Should be "YYYY-MM-DD"
}
# Error: "Expected date, got string 'not-a-date'"
# ❌ This will fail (invalid enum value)
data = {
"institution_type": "FAKE_TYPE" # Should be MUSEUM, LIBRARY, etc.
}
# Error: "Value 'FAKE_TYPE' not in InstitutionTypeEnum"
```
---
### 3. Pattern Constraints (Regex)
**Schema Syntax**:
```yaml
# schemas/20251121/linkml/modules/slots/valid_from.yaml
valid_from:
description: Start date of temporal validity (ISO 8601 format)
range: date
pattern: "^\\d{4}-\\d{2}-\\d{2}$" # ← Regex pattern for YYYY-MM-DD
examples:
- value: "2000-01-01"
- value: "1923-05-15"
```
**Validation**:
```python
# ✅ Valid dates
"2000-01-01" # Pass
"1923-05-15" # Pass
# ❌ Invalid dates
"2000/01/01" # Fail (wrong separator)
"Jan 1, 2000" # Fail (wrong format)
"2000-1-1" # Fail (missing leading zeros)
```
---
### 4. Cardinality Constraints
**Schema Syntax**:
```yaml
slots:
locations:
multivalued: true # ← Can have multiple values
required: false # ← But list can be empty
custodian_aspect:
multivalued: false # ← Only one value allowed
required: true # ← Must be present
```
**Validation**:
```python
# ✅ Valid: Multiple locations
data = {
"locations": [
{"city": "Amsterdam", "country": "NL"},
{"city": "The Hague", "country": "NL"}
]
}
# ❌ Invalid: Multiple custodian_aspect (should be single)
data = {
"custodian_aspect": [
{"name": "Museum A"},
{"name": "Museum B"}
]
}
# Error: "custodian_aspect must be single-valued"
```
---
### 5. Minimum/Maximum Value Constraints
**Schema Syntax** (example for future use):
```yaml
latitude:
range: float
minimum_value: -90.0 # ← Latitude bounds
maximum_value: 90.0
longitude:
range: float
minimum_value: -180.0
maximum_value: 180.0
confidence_score:
range: float
minimum_value: 0.0 # ← Confidence between 0.0 and 1.0
maximum_value: 1.0
```
---
## Custom Python Validators
For **complex business rules** that can't be expressed with built-in constraints, use **custom Python validators**.
### Location: `scripts/linkml_validators.py`
This script provides 5 custom validation functions implementing organizational structure rules:
---
### Validator 1: Collection-Unit Temporal Consistency
**Rule**: A collection's `valid_from` date must be >= its managing unit's `valid_from` date.
**Rationale**: A collection cannot be managed by a unit that doesn't yet exist.
**Function**:
```python
def validate_collection_unit_temporal(data: Dict[str, Any]) -> List[ValidationError]:
"""
Validate that collections are not founded before their managing units.
Rule 1: collection.valid_from >= unit.valid_from
"""
errors = []
# Extract organizational units
units = data.get('organizational_structure', [])
unit_dates = {unit['id']: unit.get('valid_from') for unit in units}
# Extract collections
collections = data.get('collections_aspect', [])
for collection in collections:
collection_valid_from = collection.get('valid_from')
managing_units = collection.get('managed_by_unit', [])
for unit_id in managing_units:
unit_valid_from = unit_dates.get(unit_id)
if collection_valid_from and unit_valid_from:
if collection_valid_from < unit_valid_from:
errors.append(ValidationError(
rule="COLLECTION_UNIT_TEMPORAL",
severity="ERROR",
message=f"Collection founded before its managing unit",
context={
"collection_id": collection.get('id'),
"collection_valid_from": collection_valid_from,
"unit_id": unit_id,
"unit_valid_from": unit_valid_from
}
))
return errors
```
**Example Violation**:
```yaml
# ❌ Collection founded in 2002, but unit not established until 2005
organizational_structure:
- id: unit-001
valid_from: "2005-01-01" # Unit founded 2005
collections_aspect:
- id: collection-001
valid_from: "2002-03-15" # ❌ Collection founded 2002 (before unit!)
managed_by_unit:
- unit-001
```
**Expected Error**:
```
ERROR: Collection founded before its managing unit
Collection: collection-001 (valid_from: 2002-03-15)
Managing Unit: unit-001 (valid_from: 2005-01-01)
Violation: 2002-03-15 < 2005-01-01
```
---
### Validator 2: Collection-Unit Bidirectional Consistency
**Rule**: If a collection references a unit via `managed_by_unit`, the unit must reference the collection back via `manages_collections`.
**Rationale**: Bidirectional relationships ensure graph consistency (required for W3C Org Ontology).
**Function**:
```python
def validate_collection_unit_bidirectional(data: Dict[str, Any]) -> List[ValidationError]:
"""
Validate bidirectional relationships between collections and units.
Rule 2: If collection → unit, then unit → collection (inverse).
"""
errors = []
# Build inverse mapping: unit_id → collections managed by unit
units = data.get('organizational_structure', [])
unit_collections = {unit['id']: unit.get('manages_collections', []) for unit in units}
# Check collections
collections = data.get('collections_aspect', [])
for collection in collections:
collection_id = collection.get('id')
managing_units = collection.get('managed_by_unit', [])
for unit_id in managing_units:
# Check if unit references collection back
if collection_id not in unit_collections.get(unit_id, []):
errors.append(ValidationError(
rule="COLLECTION_UNIT_BIDIRECTIONAL",
severity="ERROR",
message=f"Collection references unit, but unit doesn't reference collection",
context={
"collection_id": collection_id,
"unit_id": unit_id,
"unit_manages_collections": unit_collections.get(unit_id, [])
}
))
return errors
```
**Example Violation**:
```yaml
# ❌ Collection → Unit exists, but Unit → Collection missing
organizational_structure:
- id: unit-001
# Missing: manages_collections: [collection-001]
collections_aspect:
- id: collection-001
managed_by_unit:
- unit-001 # ✓ Forward reference exists
```
**Expected Error**:
```
ERROR: Collection references unit, but unit doesn't reference collection
Collection: collection-001
Unit: unit-001
Unit's manages_collections: [] (empty - should include collection-001)
```
---
### Validator 3: Staff-Unit Temporal Consistency
**Rule**: A staff member's `valid_from` date must be >= their employing unit's `valid_from` date.
**Rationale**: A person cannot be employed by a unit that doesn't yet exist.
**Function**:
```python
def validate_staff_unit_temporal(data: Dict[str, Any]) -> List[ValidationError]:
"""
Validate that staff employment dates are consistent with unit founding dates.
Rule 4: staff.valid_from >= unit.valid_from
"""
errors = []
# Extract organizational units
units = data.get('organizational_structure', [])
unit_dates = {unit['id']: unit.get('valid_from') for unit in units}
# Extract staff
staff = data.get('staff_aspect', [])
for person in staff:
person_obs = person.get('person_observation', {})
person_valid_from = person_obs.get('valid_from')
employing_units = person.get('employed_by_unit', [])
for unit_id in employing_units:
unit_valid_from = unit_dates.get(unit_id)
if person_valid_from and unit_valid_from:
if person_valid_from < unit_valid_from:
errors.append(ValidationError(
rule="STAFF_UNIT_TEMPORAL",
severity="ERROR",
message=f"Staff employment started before unit existed",
context={
"staff_id": person.get('id'),
"staff_valid_from": person_valid_from,
"unit_id": unit_id,
"unit_valid_from": unit_valid_from
}
))
return errors
```
---
### Validator 4: Staff-Unit Bidirectional Consistency
**Rule**: If staff references a unit via `employed_by_unit`, the unit must reference the staff back via `employs_staff`.
**Function**: Similar structure to Validator 2 (see `scripts/linkml_validators.py` for implementation).
---
### Validator 5: Batch Validation
**Function**: Run all validators at once and return combined results.
```python
def validate_all(data: Dict[str, Any]) -> List[ValidationError]:
"""
Run all validation rules and return combined results.
"""
errors = []
errors.extend(validate_collection_unit_temporal(data))
errors.extend(validate_collection_unit_bidirectional(data))
errors.extend(validate_staff_unit_temporal(data))
errors.extend(validate_staff_unit_bidirectional(data))
return errors
```
---
## Usage Examples
### Command-Line Interface
The `linkml_validators.py` script provides a CLI for standalone validation:
```bash
# Validate a single YAML file
python scripts/linkml_validators.py \
schemas/20251121/examples/validation_tests/valid_complete_example.yaml
# ✅ Output (valid file):
# Validation successful! No errors found.
# File: valid_complete_example.yaml
# Validate an invalid file
python scripts/linkml_validators.py \
schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml
# ❌ Output (invalid file):
# Validation failed with 4 errors:
#
# ERROR: Collection founded before its managing unit
# Collection: early-collection (valid_from: 2002-03-15)
# Unit: curatorial-dept-002 (valid_from: 2005-01-01)
#
# ERROR: Collection founded before its managing unit
# Collection: another-early-collection (valid_from: 2008-09-01)
# Unit: research-dept-002 (valid_from: 2010-06-01)
#
# ERROR: Staff employment started before unit existed
# Staff: early-curator (valid_from: 2003-01-15)
# Unit: curatorial-dept-002 (valid_from: 2005-01-01)
#
# ERROR: Staff employment started before unit existed
# Staff: early-researcher (valid_from: 2009-03-01)
# Unit: research-dept-002 (valid_from: 2010-06-01)
```
---
### Python API
Import and use validators in your Python code:
```python
from linkml_validators import validate_all, ValidationError
import yaml
# Load YAML data
with open('data/instance.yaml', 'r') as f:
data = yaml.safe_load(f)
# Run validation
errors = validate_all(data)
if errors:
print(f"Validation failed with {len(errors)} errors:")
for error in errors:
print(f" {error.severity}: {error.message}")
print(f" Rule: {error.rule}")
print(f" Context: {error.context}")
else:
print("Validation successful!")
```
---
### Integration with Data Pipelines
**Pattern 1: Validate Before Conversion**
```python
from linkml_validators import validate_all
from linkml_runtime.dumpers import rdflib_dumper
import yaml
def convert_yaml_to_rdf(yaml_path, rdf_path):
"""Convert YAML to RDF with validation."""
# Load YAML
with open(yaml_path, 'r') as f:
data = yaml.safe_load(f)
# Validate FIRST (Layer 1)
errors = validate_all(data)
if errors:
print(f"❌ Validation failed: {len(errors)} errors")
for error in errors:
print(f" - {error.message}")
return False
# Convert to RDF (only if validation passed)
print("✅ Validation passed, converting to RDF...")
graph = rdflib_dumper.dump(data, target_class=HeritageCustodian)
graph.serialize(rdf_path, format='turtle')
print(f"✅ RDF written to {rdf_path}")
return True
```
---
### Integration with CI/CD
**GitHub Actions Example**:
```yaml
# .github/workflows/validate-data.yml
name: Validate Heritage Custodian Data
on:
push:
paths:
- 'data/instances/**/*.yaml'
pull_request:
paths:
- 'data/instances/**/*.yaml'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install pyyaml linkml-runtime
- name: Validate YAML instances
run: |
# Validate all YAML files in data/instances/
for file in data/instances/**/*.yaml; do
echo "Validating $file..."
python scripts/linkml_validators.py "$file"
if [ $? -ne 0 ]; then
echo "❌ Validation failed for $file"
exit 1
fi
done
echo "✅ All files validated successfully"
```
**Exit Codes**:
- `0`: Validation successful
- `1`: Validation failed (errors found)
- `2`: Script error (file not found, invalid YAML syntax)
---
## Validation Test Suite
The project includes **3 comprehensive test examples** demonstrating validation behavior:
### Test 1: Valid Complete Example
**File**: `schemas/20251121/examples/validation_tests/valid_complete_example.yaml`
**Description**: A fictional heritage museum with:
- 3 organizational units (departments)
- 2 collections (properly aligned temporally)
- 3 staff members (properly aligned temporally)
- All bidirectional relationships correct
**Expected Result**: ✅ **PASS** (no validation errors)
**Key Features**:
- All `valid_from` dates are consistent (collections/staff after units)
- All inverse relationships present (`manages_collections` ↔ `managed_by_unit`)
- Demonstrates best practices for data modeling
---
### Test 2: Invalid Temporal Violation
**File**: `schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml`
**Description**: A museum with **temporal inconsistencies**:
- Collection founded in 2002, but managing unit not established until 2005
- Collection founded in 2008, but managing unit not established until 2010
- Staff employed in 2003, but employing unit not established until 2005
- Staff employed in 2009, but employing unit not established until 2010
**Expected Result**: ❌ **FAIL** with 4 errors
**Violations**:
1. Collection `early-collection`: `valid_from: 2002-03-15` < Unit `valid_from: 2005-01-01`
2. Collection `another-early-collection`: `valid_from: 2008-09-01` < Unit `valid_from: 2010-06-01`
3. Staff `early-curator`: `valid_from: 2003-01-15` < Unit `valid_from: 2005-01-01`
4. Staff `early-researcher`: `valid_from: 2009-03-01` < Unit `valid_from: 2010-06-01`
---
### Test 3: Invalid Bidirectional Violation
**File**: `schemas/20251121/examples/validation_tests/invalid_bidirectional_violation.yaml`
**Description**: A museum with **missing inverse relationships**:
- Collection references managing unit, but unit doesn't reference collection back
- Staff references employing unit, but unit doesn't reference staff back
**Expected Result**: **FAIL** with 2 errors
**Violations**:
1. Collection `paintings-collection-003` Unit `curatorial-dept-003` (forward exists), but Unit Collection (inverse missing)
2. Staff `researcher-001-003` Unit `research-dept-003` (forward exists), but Unit Staff (inverse missing)
---
### Running Tests
```bash
# Test 1: Valid example (should pass)
python scripts/linkml_validators.py \
schemas/20251121/examples/validation_tests/valid_complete_example.yaml
# ✅ Expected: "Validation successful! No errors found."
# Test 2: Temporal violations (should fail)
python scripts/linkml_validators.py \
schemas/20251121/examples/validation_tests/invalid_temporal_violation.yaml
# ❌ Expected: "Validation failed with 4 errors"
# Test 3: Bidirectional violations (should fail)
python scripts/linkml_validators.py \
schemas/20251121/examples/validation_tests/invalid_bidirectional_violation.yaml
# ❌ Expected: "Validation failed with 2 errors"
```
---
## Integration Patterns
### Pattern 1: Validate on Data Import
```python
def import_heritage_custodian(yaml_path):
"""Import and validate a heritage custodian YAML file."""
import yaml
from linkml_validators import validate_all
# Load YAML
with open(yaml_path, 'r') as f:
data = yaml.safe_load(f)
# Validate FIRST
errors = validate_all(data)
if errors:
raise ValueError(f"Validation failed: {errors}")
# Process data (convert to RDF, store in database, etc.)
process_data(data)
```
---
### Pattern 2: Pre-commit Hook
**File**: `.git/hooks/pre-commit`
```bash
#!/bin/bash
# Validate all staged YAML files before commit
echo "Validating heritage custodian YAML files..."
# Find all staged YAML files in data/instances/
staged_files=$(git diff --cached --name-only --diff-filter=ACM | grep "data/instances/.*\.yaml$")
if [ -z "$staged_files" ]; then
echo "No YAML files staged, skipping validation."
exit 0
fi
# Validate each file
for file in $staged_files; do
echo " Validating $file..."
python scripts/linkml_validators.py "$file"
if [ $? -ne 0 ]; then
echo "❌ Validation failed for $file"
echo "Commit aborted. Fix validation errors and try again."
exit 1
fi
done
echo "✅ All YAML files validated successfully."
exit 0
```
**Installation**:
```bash
chmod +x .git/hooks/pre-commit
```
---
### Pattern 3: Batch Validation
```python
def validate_directory(directory_path):
"""Validate all YAML files in a directory."""
import os
import yaml
from linkml_validators import validate_all
results = {"passed": [], "failed": []}
for root, dirs, files in os.walk(directory_path):
for file in files:
if file.endswith('.yaml'):
yaml_path = os.path.join(root, file)
with open(yaml_path, 'r') as f:
data = yaml.safe_load(f)
errors = validate_all(data)
if errors:
results["failed"].append({
"file": yaml_path,
"errors": errors
})
else:
results["passed"].append(yaml_path)
# Report results
print(f"✅ Passed: {len(results['passed'])} files")
print(f"❌ Failed: {len(results['failed'])} files")
for failure in results["failed"]:
print(f"\n{failure['file']}:")
for error in failure["errors"]:
print(f" - {error.message}")
return results
```
---
## Comparison with Other Approaches
### LinkML vs. Python Validator (Phase 5)
| Feature | LinkML Validators | Phase 5 Python Validator |
|---------|-------------------|--------------------------|
| **Input** | YAML instances | RDF triples (after conversion) |
| **Speed** | Fast (ms) | 🐢 Moderate (sec) |
| **Error Location** | YAML field names | RDF triple patterns |
| **Use Case** | Development, CI/CD | Post-conversion validation |
| **Integration** | Data pipeline ingestion | RDF quality assurance |
**Recommendation**: Use **both** for defense-in-depth validation.
---
### LinkML vs. SHACL (Phase 7)
| Feature | LinkML Validators | SHACL Shapes |
|---------|-------------------|--------------|
| **Input** | YAML instances | RDF graphs |
| **Validation Time** | Before RDF conversion | During RDF ingestion |
| **Error Messages** | Python-friendly | RDF-centric |
| **Extensibility** | Python code | SPARQL-based constraints |
| **Standards** | LinkML metamodel | W3C SHACL standard |
| **Use Case** | Development | Triple store ingestion |
**Recommendation**:
- Use **LinkML** for early validation (development phase)
- Use **SHACL** for production validation (RDF ingestion)
---
### LinkML vs. SPARQL Queries (Phase 6)
| Feature | LinkML Validators | SPARQL Queries |
|---------|-------------------|----------------|
| **Input** | YAML instances | RDF triple store |
| **Timing** | Before RDF conversion | After data is stored |
| **Purpose** | **Prevention** | **Detection** |
| **Speed** | Fast | 🐢 Slow (depends on data size) |
| **Use Case** | Data quality gates | Monitoring, auditing |
**Recommendation**:
- Use **LinkML** to **prevent** invalid data from entering system
- Use **SPARQL** to **detect** existing violations in production data
---
## Troubleshooting
### Issue 1: "Missing required field" Error
**Symptom**:
```
ValueError: Missing required field: name
```
**Cause**: YAML instance is missing a required field defined in the schema.
**Solution**:
```yaml
# ❌ Missing required field
id: https://example.org/custodian/001
description: Some museum
# ✅ Add required field
id: https://example.org/custodian/001
name: Example Museum # ← Add this
description: Some museum
```
---
### Issue 2: "Expected date, got string" Error
**Symptom**:
```
ValueError: Expected date, got string '2000/01/01'
```
**Cause**: Date format doesn't match ISO 8601 pattern (`YYYY-MM-DD`).
**Solution**:
```yaml
# ❌ Wrong date format
valid_from: "2000/01/01" # Slashes instead of hyphens
# ✅ Correct date format
valid_from: "2000-01-01" # ISO 8601: YYYY-MM-DD
```
---
### Issue 3: Validation Passes but SHACL Fails
**Symptom**: LinkML validation passes, but SHACL validation fails with the same data.
**Cause**: LinkML validators check **YAML structure**, SHACL validates **RDF graph patterns**. Some constraints (e.g., inverse relationships) may be implicit in YAML but explicit in RDF.
**Solution**: Ensure YAML data includes **all required inverse relationships**:
```yaml
# ✅ Explicit bidirectional relationships in YAML
organizational_structure:
- id: unit-001
manages_collections: # ← Inverse relationship
- collection-001
collections_aspect:
- id: collection-001
managed_by_unit: # ← Forward relationship
- unit-001
```
---
### Issue 4: "List index out of range" or "KeyError"
**Symptom**: Python exception during validation.
**Cause**: YAML structure doesn't match expected schema (e.g., missing nested fields).
**Solution**: Use defensive programming in custom validators:
```python
# ❌ Unsafe access
unit_valid_from = data['organizational_structure'][0]['valid_from']
# ✅ Safe access with defaults
units = data.get('organizational_structure', [])
unit_valid_from = units[0].get('valid_from') if units else None
```
---
### Issue 5: Slow Validation Performance
**Symptom**: Validation takes a long time for large datasets.
**Cause**: Custom validators may have O(n²) complexity when checking relationships.
**Solution**: Use indexed lookups:
```python
# ❌ Slow: O(n²) nested loops
for collection in collections:
for unit in units:
if unit['id'] in collection['managed_by_unit']:
# Check relationship
# ✅ Fast: O(n) with dict lookup
unit_dates = {unit['id']: unit['valid_from'] for unit in units}
for collection in collections:
for unit_id in collection['managed_by_unit']:
unit_date = unit_dates.get(unit_id) # O(1) lookup
```
---
## Summary
**LinkML Constraints Capabilities**:
**Built-in Constraints** (declarative):
- Required fields (`required: true`)
- Data types (`range: date`, `range: float`)
- Regex patterns (`pattern: "^\\d{4}-\\d{2}-\\d{2}$"`)
- Cardinality (`multivalued: true/false`)
- Min/max values (`minimum_value`, `maximum_value`)
**Custom Validators** (programmatic):
- Temporal consistency (collections/staff before units)
- Bidirectional relationships (forward inverse)
- Complex business rules (Python functions)
**Integration**:
- Command-line interface (`linkml_validators.py`)
- Python API (`import linkml_validators`)
- CI/CD workflows (GitHub Actions, pre-commit hooks)
- Data pipelines (validate before RDF conversion)
**Test Suite**:
- Valid example (passes all rules)
- Temporal violations (fails Rules 1 & 4)
- Bidirectional violations (fails Rules 2 & 5)
**Next Steps**:
1. **Phase 8 Complete**: LinkML constraints documented
2. **Phase 9**: Apply validators to real-world heritage institution data
3. **Performance Testing**: Benchmark validation speed on large datasets (10K+ institutions)
4. **Additional Rules**: Extend validators for custody transfer events, legal form constraints
---
## References
- **Phase 5**: `docs/VALIDATION_RULES.md` (Python validator)
- **Phase 6**: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md` (SPARQL queries)
- **Phase 7**: `docs/SHACL_VALIDATION_SHAPES.md` (SHACL shapes)
- **Phase 8**: This document (LinkML constraints)
- **Schema**: `schemas/20251121/linkml/01_custodian_name_modular.yaml`
- **Validators**: `scripts/linkml_validators.py`
- **Test Suite**: `schemas/20251121/examples/validation_tests/`
- **LinkML Documentation**: https://linkml.io/
---
**Version**: 1.0
**Phase**: 8 (Complete)
**Date**: 2025-11-22