glam/docs/SHACL_VALIDATION_SHAPES.md
kempersc 6eb18700f0 Add SHACL validation shapes and validation script for Heritage Custodian Ontology
- Created SHACL shapes for validating temporal consistency and bidirectional relationships in custodial collections and staff observations.
- Implemented a Python script to validate RDF data against the defined SHACL shapes using the pyshacl library.
- Added command-line interface for validation with options for specifying data formats and output reports.
- Included detailed error handling and reporting for validation results.
2025-11-22 23:22:10 +01:00

823 lines
22 KiB
Markdown

# SHACL Validation Shapes for Heritage Custodian Ontology
**Version**: 1.0.0
**Schema Version**: v0.7.0
**Created**: 2025-11-22
**SHACL Spec**: https://www.w3.org/TR/shacl/
---
## Table of Contents
1. [Overview](#overview)
2. [Installation](#installation)
3. [Usage](#usage)
4. [Validation Rules](#validation-rules)
5. [Shape Definitions](#shape-definitions)
6. [Examples](#examples)
7. [Integration](#integration)
8. [Comparison with Python Validator](#comparison-with-python-validator)
---
## Overview
This document describes the **SHACL (Shapes Constraint Language)** validation shapes for the Heritage Custodian Ontology. SHACL shapes enforce data quality constraints at RDF ingestion time, preventing invalid data from entering triple stores.
### What is SHACL?
**SHACL** is a W3C recommendation for validating RDF graphs against a set of conditions (shapes). Unlike SPARQL queries that **detect** violations after data is stored, SHACL shapes **prevent** violations during data loading.
### Benefits of SHACL Validation
**Prevention over Detection**: Reject invalid data before storage
**Standardized Reports**: Machine-readable validation results
**Triple Store Integration**: Native support in GraphDB, Jena, Virtuoso
**Declarative Constraints**: Express rules in RDF (no external scripts)
**Detailed Error Messages**: Precise identification of failing triples
---
## Installation
### Prerequisites
Install Python dependencies:
```bash
pip install pyshacl rdflib
```
**Libraries**:
- **pyshacl** (v0.25.0+): SHACL validator for Python
- **rdflib** (v7.0.0+): RDF graph library
### Verify Installation
```bash
python3 -c "import pyshacl; print(pyshacl.__version__)"
# Expected output: 0.25.0 (or later)
```
---
## Usage
### Command Line Validation
**Basic Usage**:
```bash
python scripts/validate_with_shacl.py data.ttl
```
**With Custom Shapes**:
```bash
python scripts/validate_with_shacl.py data.ttl --shapes custom_shapes.ttl
```
**Different RDF Formats**:
```bash
# JSON-LD data
python scripts/validate_with_shacl.py data.jsonld --format jsonld
# N-Triples data
python scripts/validate_with_shacl.py data.nt --format nt
```
**Save Validation Report**:
```bash
python scripts/validate_with_shacl.py data.ttl --output report.ttl
```
**Verbose Output**:
```bash
python scripts/validate_with_shacl.py data.ttl --verbose
```
### Python Library Usage
```python
from scripts.validate_with_shacl import validate_file
# Validate with default shapes
if validate_file("data.ttl"):
print("✅ Data is valid")
else:
print("❌ Data has violations")
# Validate with custom shapes
if validate_file("data.ttl", shapes_file="custom_shapes.ttl"):
print("✅ Valid")
```
### Triple Store Integration
**Apache Jena Fuseki**:
```bash
# Load shapes into Fuseki dataset
tdbloader2 --loc=/path/to/tdb custodian_validation_shapes.ttl
# Validate data during SPARQL UPDATE
# Fuseki automatically applies SHACL validation if shapes are loaded
```
**GraphDB**:
1. Create repository with SHACL validation enabled
2. Import shapes file into dedicated context: `http://shacl/shapes`
3. GraphDB validates all data changes automatically
---
## Validation Rules
This SHACL shapes file implements **5 core validation rules** from Phase 5:
| Rule ID | Name | Severity | Description |
|---------|------|----------|-------------|
| **Rule 1** | Collection-Unit Temporal Consistency | ERROR | Collection custody dates must fall within managing unit's validity period |
| **Rule 2** | Collection-Unit Bidirectional | ERROR | Collection → unit must have inverse unit → collection |
| **Rule 3** | Custody Transfer Continuity | WARNING | Custody transfers must be continuous (no gaps/overlaps) |
| **Rule 4** | Staff-Unit Temporal Consistency | ERROR | Staff employment dates must fall within unit's validity period |
| **Rule 5** | Staff-Unit Bidirectional | ERROR | Person → unit must have inverse unit → person |
Plus **3 additional shapes** for type and format constraints.
---
## Shape Definitions
### Rule 1: Collection-Unit Temporal Consistency
**Shape ID**: `custodian:CollectionUnitTemporalConsistencyShape`
**Target**: All instances of `custodian:CustodianCollection`
**Constraints**:
#### Constraint 1.1: Collection Starts After Unit Founding
```turtle
sh:sparql [
sh:message "Collection valid_from ({?collectionStart}) must be >= managing unit valid_from ({?unitStart})" ;
sh:select """
SELECT $this ?collectionStart ?unitStart ?managingUnit
WHERE {
$this custodian:managing_unit ?managingUnit ;
custodian:valid_from ?collectionStart .
?managingUnit custodian:valid_from ?unitStart .
# VIOLATION: Collection starts before unit exists
FILTER(?collectionStart < ?unitStart)
}
""" ;
] .
```
**Example Violation**:
```turtle
# Unit founded 2010
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure ;
custodian:valid_from "2010-01-01"^^xsd:date .
# Collection started 2005 (INVALID!)
<https://example.org/collection/col-1>
a custodian:CustodianCollection ;
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2005-01-01"^^xsd:date .
```
**Violation Report**:
```
❌ Validation Result [Constraint Component: sh:SPARQLConstraintComponent]
Severity: sh:Violation
Message: Collection valid_from (2005-01-01) must be >= managing unit valid_from (2010-01-01)
Focus Node: https://example.org/collection/col-1
```
---
#### Constraint 1.2: Collection Ends Before Unit Dissolution
```turtle
sh:sparql [
sh:message "Collection valid_to ({?collectionEnd}) must be <= managing unit valid_to ({?unitEnd})" ;
sh:select """
SELECT $this ?collectionEnd ?unitEnd ?managingUnit
WHERE {
$this custodian:managing_unit ?managingUnit ;
custodian:valid_to ?collectionEnd .
?managingUnit custodian:valid_to ?unitEnd .
# Unit is dissolved
FILTER(BOUND(?unitEnd))
# VIOLATION: Collection custody ends after unit dissolution
FILTER(?collectionEnd > ?unitEnd)
}
""" ;
] .
```
**Example Violation**:
```turtle
# Unit dissolved 2020
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure ;
custodian:valid_from "2010-01-01"^^xsd:date ;
custodian:valid_to "2020-12-31"^^xsd:date .
# Collection custody ended 2023 (INVALID!)
<https://example.org/collection/col-1>
a custodian:CustodianCollection ;
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2015-01-01"^^xsd:date ;
custodian:valid_to "2023-06-01"^^xsd:date .
```
---
#### Warning: Ongoing Custody After Unit Dissolution
```turtle
sh:sparql [
sh:severity sh:Warning ;
sh:message "Collection has ongoing custody but managing unit was dissolved" ;
sh:select """
SELECT $this ?managingUnit ?unitEnd
WHERE {
$this custodian:managing_unit ?managingUnit .
# Collection has no end date (ongoing)
FILTER NOT EXISTS { $this custodian:valid_to ?collectionEnd }
# But unit is dissolved
?managingUnit custodian:valid_to ?unitEnd .
}
""" ;
] .
```
**Example Warning**:
```turtle
# Unit dissolved 2020
<https://example.org/unit/dept-1>
custodian:valid_to "2020-12-31"^^xsd:date .
# Collection custody ongoing (WARNING!)
<https://example.org/collection/col-1>
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2015-01-01"^^xsd:date .
# No valid_to → custody still active
```
**Interpretation**: Collection likely transferred to another unit but custody history not updated.
---
### Rule 2: Collection-Unit Bidirectional Relationships
**Shape ID**: `custodian:CollectionUnitBidirectionalShape`
**Target**: All instances of `custodian:CustodianCollection`
**Constraint**: If collection references `managing_unit`, unit must reference collection in `managed_collections`.
```turtle
sh:sparql [
sh:message "Collection references managing_unit {?unit} but unit does not list collection in managed_collections" ;
sh:select """
SELECT $this ?unit
WHERE {
$this custodian:managing_unit ?unit .
# VIOLATION: Unit does not reference collection back
FILTER NOT EXISTS {
?unit custodian:managed_collections $this
}
}
""" ;
] .
```
**Example Violation**:
```turtle
# Collection references unit
<https://example.org/collection/col-1>
custodian:managing_unit <https://example.org/unit/dept-1> .
# But unit does NOT reference collection (INVALID!)
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure .
# Missing: custodian:managed_collections <https://example.org/collection/col-1>
```
**Fix**:
```turtle
# Add inverse relationship
<https://example.org/unit/dept-1>
custodian:managed_collections <https://example.org/collection/col-1> .
```
---
### Rule 3: Custody Transfer Continuity
**Shape ID**: `custodian:CustodyTransferContinuityShape`
**Target**: All instances of `custodian:CustodianCollection`
**Constraints**:
#### Check for Gaps in Custody Chain
```turtle
sh:sparql [
sh:severity sh:Warning ;
sh:message "Custody gap detected: previous custody ended on {?prevEnd} but next custody started on {?nextStart}" ;
sh:select """
SELECT $this ?prevEnd ?nextStart ?gapDays
WHERE {
$this custodian:custody_history ?event1 ;
custodian:custody_history ?event2 .
?event1 custodian:transfer_date ?prevEnd .
?event2 custodian:transfer_date ?nextStart .
FILTER(?nextStart > ?prevEnd)
BIND((xsd:date(?nextStart) - xsd:date(?prevEnd)) AS ?gapDays)
# WARNING: Gap > 1 day
FILTER(?gapDays > 1)
}
""" ;
] .
```
**Example Warning**:
```turtle
<https://example.org/collection/col-1>
custodian:custody_history <https://example.org/event/transfer-1> ;
custodian:custody_history <https://example.org/event/transfer-2> .
<https://example.org/event/transfer-1>
custodian:transfer_date "2010-01-01"^^xsd:date .
<https://example.org/event/transfer-2>
custodian:transfer_date "2010-02-01"^^xsd:date .
# Gap of 31 days between transfers
```
---
#### Check for Overlaps in Custody Chain
```turtle
sh:sparql [
sh:message "Custody overlap detected: collection managed by {?custodian1} until {?end1} and simultaneously by {?custodian2} from {?start2}" ;
sh:select """
SELECT $this ?custodian1 ?end1 ?custodian2 ?start2
WHERE {
$this custodian:custody_history ?event1 ;
custodian:custody_history ?event2 .
?event1 custodian:new_custodian ?custodian1 ;
custodian:custody_end_date ?end1 .
?event2 custodian:new_custodian ?custodian2 ;
custodian:transfer_date ?start2 .
FILTER(?custodian1 != ?custodian2)
FILTER(?start2 < ?end1) # Overlap!
}
""" ;
] .
```
---
### Rule 4: Staff-Unit Temporal Consistency
**Shape ID**: `custodian:StaffUnitTemporalConsistencyShape`
**Target**: All instances of `custodian:PersonObservation`
**Constraints**: Same as Rule 1, but for staff employment dates vs. unit validity period.
#### Constraint 4.1: Employment Starts After Unit Founding
```turtle
sh:sparql [
sh:message "Staff employment_start_date ({?employmentStart}) must be >= unit valid_from ({?unitStart})" ;
sh:select """
SELECT $this ?employmentStart ?unitStart ?unit
WHERE {
$this custodian:unit_affiliation ?unit ;
custodian:employment_start_date ?employmentStart .
?unit custodian:valid_from ?unitStart .
FILTER(?employmentStart < ?unitStart)
}
""" ;
] .
```
**Example Violation**:
```turtle
# Unit founded 2015
<https://example.org/unit/dept-1>
custodian:valid_from "2015-01-01"^^xsd:date .
# Staff employed 2010 (INVALID!)
<https://example.org/person/john-doe>
custodian:unit_affiliation <https://example.org/unit/dept-1> ;
custodian:employment_start_date "2010-01-01"^^xsd:date .
```
---
### Rule 5: Staff-Unit Bidirectional Relationships
**Shape ID**: `custodian:StaffUnitBidirectionalShape`
**Target**: All instances of `custodian:PersonObservation`
**Constraint**: If person references `unit_affiliation`, unit must reference person in `staff_members` or `org:hasMember`.
```turtle
sh:sparql [
sh:message "Person references unit_affiliation {?unit} but unit does not list person in staff_members" ;
sh:select """
SELECT $this ?unit
WHERE {
$this custodian:unit_affiliation ?unit .
# VIOLATION: Unit does not reference person back
FILTER NOT EXISTS {
{ ?unit custodian:staff_members $this }
UNION
{ ?unit org:hasMember $this }
}
}
""" ;
] .
```
---
### Additional Shapes: Type and Format Constraints
#### Type Constraint: managing_unit Must Be OrganizationalStructure
```turtle
custodian:CollectionManagingUnitTypeShape
sh:property [
sh:path custodian:managing_unit ;
sh:class custodian:OrganizationalStructure ;
sh:message "managing_unit must be an instance of OrganizationalStructure" ;
] .
```
#### Type Constraint: unit_affiliation Must Be OrganizationalStructure
```turtle
custodian:PersonUnitAffiliationTypeShape
sh:property [
sh:path custodian:unit_affiliation ;
sh:class custodian:OrganizationalStructure ;
sh:message "unit_affiliation must be an instance of OrganizationalStructure" ;
] .
```
#### Format Constraint: Dates Must Be xsd:date or xsd:dateTime
```turtle
custodian:DatetimeFormatShape
sh:property [
sh:path custodian:valid_from ;
sh:or (
[ sh:datatype xsd:date ]
[ sh:datatype xsd:dateTime ]
) ;
] .
```
---
## Examples
### Example 1: Valid Collection-Unit Relationship
**Valid RDF Data**:
```turtle
@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://example.org/unit/paintings-dept>
a custodian:OrganizationalStructure ;
custodian:unit_name "Paintings Department" ;
custodian:valid_from "1985-01-01"^^xsd:date ;
custodian:managed_collections <https://example.org/collection/dutch-paintings> .
<https://example.org/collection/dutch-paintings>
a custodian:CustodianCollection ;
custodian:collection_name "Dutch Paintings" ;
custodian:managing_unit <https://example.org/unit/paintings-dept> ;
custodian:valid_from "1995-01-01"^^xsd:date .
```
**Validation**:
```bash
python scripts/validate_with_shacl.py valid_data.ttl
# ✅ VALIDATION PASSED
# No constraint violations found.
```
---
### Example 2: Invalid - Temporal Violation
**Invalid RDF Data**:
```turtle
<https://example.org/unit/paintings-dept>
custodian:valid_from "1985-01-01"^^xsd:date .
<https://example.org/collection/dutch-paintings>
custodian:managing_unit <https://example.org/unit/paintings-dept> ;
custodian:valid_from "1970-01-01"^^xsd:date . # Before unit exists!
```
**Validation**:
```bash
python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
#
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result [Constraint Component: sh:SPARQLConstraintComponent]:
# Severity: sh:Violation
# Message: Collection valid_from (1970-01-01) must be >= managing unit valid_from (1985-01-01)
# Focus Node: https://example.org/collection/dutch-paintings
# Result Path: -
# Source Shape: custodian:CollectionUnitTemporalConsistencyShape
```
---
### Example 3: Invalid - Missing Bidirectional Relationship
**Invalid RDF Data**:
```turtle
<https://example.org/collection/dutch-paintings>
custodian:managing_unit <https://example.org/unit/paintings-dept> .
<https://example.org/unit/paintings-dept>
a custodian:OrganizationalStructure .
# Missing: custodian:managed_collections <https://example.org/collection/dutch-paintings>
```
**Validation**:
```bash
python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
#
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result:
# Severity: sh:Violation
# Message: Collection references managing_unit https://example.org/unit/paintings-dept
# but unit does not list collection in managed_collections
# Focus Node: https://example.org/collection/dutch-paintings
```
---
## Integration
### CI/CD Pipeline Integration
**GitHub Actions Example**:
```yaml
name: SHACL Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install pyshacl rdflib
- name: Validate RDF data
run: |
python scripts/validate_with_shacl.py data/instances/*.ttl
- name: Upload validation report
if: failure()
uses: actions/upload-artifact@v3
with:
name: validation-report
path: validation_report.ttl
```
---
### Pre-commit Hook
**`.git/hooks/pre-commit`**:
```bash
#!/bin/bash
# Validate RDF files before commit
echo "Running SHACL validation..."
for file in data/instances/*.ttl; do
python scripts/validate_with_shacl.py "$file" --quiet
if [ $? -ne 0 ]; then
echo "❌ SHACL validation failed for $file"
echo "Fix violations before committing."
exit 1
fi
done
echo "✅ All files pass SHACL validation"
exit 0
```
---
## Comparison with Python Validator
### Phase 5 Python Validator vs. Phase 7 SHACL Shapes
| Aspect | Python Validator (Phase 5) | SHACL Shapes (Phase 7) |
|--------|---------------------------|------------------------|
| **Input Format** | YAML (LinkML instances) | RDF (Turtle, JSON-LD, etc.) |
| **Execution** | Standalone script | Triple store integrated OR pyshacl |
| **Performance** | Fast for <1,000 records | Optimized for >10,000 records |
| **Deployment** | Python runtime required | RDF triple store native |
| **Error Messages** | Custom CLI output | Standardized SHACL reports |
| **CI/CD** | Exit codes (0/1/2) | Exit codes (0/1/2) + RDF report |
| **Use Case** | Development validation | Production runtime validation |
### When to Use Which?
**Use Python Validator** (`validate_temporal_consistency.py`):
- ✅ During schema development (fast feedback on YAML instances)
- ✅ Pre-commit hooks for LinkML files
- ✅ Unit testing LinkML examples
- ✅ Before RDF conversion
**Use SHACL Shapes** (`validate_with_shacl.py`):
- ✅ Production RDF triple stores (GraphDB, Fuseki)
- ✅ Data ingestion pipelines
- ✅ Continuous monitoring (real-time validation)
- ✅ After RDF conversion (final quality gate)
**Best Practice**: Use **both**:
1. Python validator during development (YAML → validate → RDF)
2. SHACL shapes in production (RDF → validate → store)
---
## Advanced Usage
### Generate Validation Report
```bash
python scripts/validate_with_shacl.py data.ttl --output report.ttl
```
**Report Format** (Turtle):
```turtle
@prefix sh: <http://www.w3.org/ns/shacl#> .
[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [
a sh:ValidationResult ;
sh:focusNode <https://example.org/collection/col-1> ;
sh:resultMessage "Collection valid_from (1970-01-01) must be >= ..." ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:SPARQLConstraintComponent ;
sh:sourceShape custodian:CollectionUnitTemporalConsistencyShape
]
] .
```
---
### Custom Severity Levels
SHACL supports three severity levels:
```turtle
sh:severity sh:Violation ; # ERROR (blocks data loading)
sh:severity sh:Warning ; # WARNING (logged but allowed)
sh:severity sh:Info ; # INFO (informational only)
```
**Example**: Custody gap is a **warning** (data quality issue but not invalid):
```turtle
custodian:CustodyTransferContinuityShape
sh:sparql [
sh:severity sh:Warning ; # Allow data but log warning
sh:message "Custody gap detected..." ;
...
] .
```
---
### Extending Shapes
Add custom validation rules by creating new shapes:
```turtle
# Custom rule: Collection name must not be empty
custodian:CollectionNameNotEmptyShape
a sh:NodeShape ;
sh:targetClass custodian:CustodianCollection ;
sh:property [
sh:path custodian:collection_name ;
sh:minLength 1 ;
sh:message "Collection name must not be empty" ;
] .
```
---
## Troubleshooting
### Common Issues
#### Issue 1: "pyshacl not found"
**Solution**:
```bash
pip install pyshacl rdflib
```
#### Issue 2: "Parse error: Invalid Turtle syntax"
**Solution**: Validate RDF syntax first:
```bash
rdfpipe -i turtle data.ttl > /dev/null
# If errors, fix syntax before SHACL validation
```
#### Issue 3: "No violations found but data is clearly invalid"
**Solution**: Check namespace prefixes match between shapes and data:
```turtle
# Shapes file uses:
@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .
# Data file must use same namespace:
<https://nde.nl/ontology/hc/custodian/CustodianCollection>
```
---
## References
- **SHACL Specification**: https://www.w3.org/TR/shacl/
- **pyshacl Documentation**: https://github.com/RDFLib/pySHACL
- **SHACL Advanced Features**: https://www.w3.org/TR/shacl-af/
- **Python Validator (Phase 5)**: `scripts/validate_temporal_consistency.py`
- **SPARQL Queries (Phase 6)**: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md`
- **Schema (v0.7.0)**: `schemas/20251121/linkml/01_custodian_name_modular.yaml`
---
## Next Steps
### Phase 8: LinkML Schema Constraints
Embed validation rules directly into LinkML schema using:
- `minimum_value` / `maximum_value` for date comparisons
- `pattern` for format validation
- Custom validators with Python functions
- Slot-level constraints
**Goal**: Validate at **schema definition** level, not just RDF level.
---
**Document Version**: 1.0.0
**Schema Version**: v0.7.0
**Last Updated**: 2025-11-22
**SHACL Shapes File**: `schemas/20251121/shacl/custodian_validation_shapes.ttl` (474 lines)
**Validation Script**: `scripts/validate_with_shacl.py` (289 lines)