- Created SHACL shapes for validating temporal consistency and bidirectional relationships in custodial collections and staff observations. - Implemented a Python script to validate RDF data against the defined SHACL shapes using the pyshacl library. - Added command-line interface for validation with options for specifying data formats and output reports. - Included detailed error handling and reporting for validation results.
22 KiB
SHACL Validation Shapes for Heritage Custodian Ontology
Version: 1.0.0
Schema Version: v0.7.0
Created: 2025-11-22
SHACL Spec: https://www.w3.org/TR/shacl/
Table of Contents
- Overview
- Installation
- Usage
- Validation Rules
- Shape Definitions
- Examples
- Integration
- Comparison with Python Validator
Overview
This document describes the SHACL (Shapes Constraint Language) validation shapes for the Heritage Custodian Ontology. SHACL shapes enforce data quality constraints at RDF ingestion time, preventing invalid data from entering triple stores.
What is SHACL?
SHACL is a W3C recommendation for validating RDF graphs against a set of conditions (shapes). Unlike SPARQL queries that detect violations after data is stored, SHACL shapes prevent violations during data loading.
Benefits of SHACL Validation
✅ Prevention over Detection: Reject invalid data before storage
✅ Standardized Reports: Machine-readable validation results
✅ Triple Store Integration: Native support in GraphDB, Jena, Virtuoso
✅ Declarative Constraints: Express rules in RDF (no external scripts)
✅ Detailed Error Messages: Precise identification of failing triples
Installation
Prerequisites
Install Python dependencies:
pip install pyshacl rdflib
Libraries:
- pyshacl (v0.25.0+): SHACL validator for Python
- rdflib (v7.0.0+): RDF graph library
Verify Installation
python3 -c "import pyshacl; print(pyshacl.__version__)"
# Expected output: 0.25.0 (or later)
Usage
Command Line Validation
Basic Usage:
python scripts/validate_with_shacl.py data.ttl
With Custom Shapes:
python scripts/validate_with_shacl.py data.ttl --shapes custom_shapes.ttl
Different RDF Formats:
# JSON-LD data
python scripts/validate_with_shacl.py data.jsonld --format jsonld
# N-Triples data
python scripts/validate_with_shacl.py data.nt --format nt
Save Validation Report:
python scripts/validate_with_shacl.py data.ttl --output report.ttl
Verbose Output:
python scripts/validate_with_shacl.py data.ttl --verbose
Python Library Usage
from scripts.validate_with_shacl import validate_file
# Validate with default shapes
if validate_file("data.ttl"):
print("✅ Data is valid")
else:
print("❌ Data has violations")
# Validate with custom shapes
if validate_file("data.ttl", shapes_file="custom_shapes.ttl"):
print("✅ Valid")
Triple Store Integration
Apache Jena Fuseki:
# Load shapes into Fuseki dataset
tdbloader2 --loc=/path/to/tdb custodian_validation_shapes.ttl
# Validate data during SPARQL UPDATE
# Fuseki automatically applies SHACL validation if shapes are loaded
GraphDB:
- Create repository with SHACL validation enabled
- Import shapes file into dedicated context:
http://shacl/shapes - GraphDB validates all data changes automatically
Validation Rules
This SHACL shapes file implements 5 core validation rules from Phase 5:
| Rule ID | Name | Severity | Description |
|---|---|---|---|
| Rule 1 | Collection-Unit Temporal Consistency | ERROR | Collection custody dates must fall within managing unit's validity period |
| Rule 2 | Collection-Unit Bidirectional | ERROR | Collection → unit must have inverse unit → collection |
| Rule 3 | Custody Transfer Continuity | WARNING | Custody transfers must be continuous (no gaps/overlaps) |
| Rule 4 | Staff-Unit Temporal Consistency | ERROR | Staff employment dates must fall within unit's validity period |
| Rule 5 | Staff-Unit Bidirectional | ERROR | Person → unit must have inverse unit → person |
Plus 3 additional shapes for type and format constraints.
Shape Definitions
Rule 1: Collection-Unit Temporal Consistency
Shape ID: custodian:CollectionUnitTemporalConsistencyShape
Target: All instances of custodian:CustodianCollection
Constraints:
Constraint 1.1: Collection Starts After Unit Founding
sh:sparql [
sh:message "Collection valid_from ({?collectionStart}) must be >= managing unit valid_from ({?unitStart})" ;
sh:select """
SELECT $this ?collectionStart ?unitStart ?managingUnit
WHERE {
$this custodian:managing_unit ?managingUnit ;
custodian:valid_from ?collectionStart .
?managingUnit custodian:valid_from ?unitStart .
# VIOLATION: Collection starts before unit exists
FILTER(?collectionStart < ?unitStart)
}
""" ;
] .
Example Violation:
# Unit founded 2010
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure ;
custodian:valid_from "2010-01-01"^^xsd:date .
# Collection started 2005 (INVALID!)
<https://example.org/collection/col-1>
a custodian:CustodianCollection ;
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2005-01-01"^^xsd:date .
Violation Report:
❌ Validation Result [Constraint Component: sh:SPARQLConstraintComponent]
Severity: sh:Violation
Message: Collection valid_from (2005-01-01) must be >= managing unit valid_from (2010-01-01)
Focus Node: https://example.org/collection/col-1
Constraint 1.2: Collection Ends Before Unit Dissolution
sh:sparql [
sh:message "Collection valid_to ({?collectionEnd}) must be <= managing unit valid_to ({?unitEnd})" ;
sh:select """
SELECT $this ?collectionEnd ?unitEnd ?managingUnit
WHERE {
$this custodian:managing_unit ?managingUnit ;
custodian:valid_to ?collectionEnd .
?managingUnit custodian:valid_to ?unitEnd .
# Unit is dissolved
FILTER(BOUND(?unitEnd))
# VIOLATION: Collection custody ends after unit dissolution
FILTER(?collectionEnd > ?unitEnd)
}
""" ;
] .
Example Violation:
# Unit dissolved 2020
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure ;
custodian:valid_from "2010-01-01"^^xsd:date ;
custodian:valid_to "2020-12-31"^^xsd:date .
# Collection custody ended 2023 (INVALID!)
<https://example.org/collection/col-1>
a custodian:CustodianCollection ;
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2015-01-01"^^xsd:date ;
custodian:valid_to "2023-06-01"^^xsd:date .
Warning: Ongoing Custody After Unit Dissolution
sh:sparql [
sh:severity sh:Warning ;
sh:message "Collection has ongoing custody but managing unit was dissolved" ;
sh:select """
SELECT $this ?managingUnit ?unitEnd
WHERE {
$this custodian:managing_unit ?managingUnit .
# Collection has no end date (ongoing)
FILTER NOT EXISTS { $this custodian:valid_to ?collectionEnd }
# But unit is dissolved
?managingUnit custodian:valid_to ?unitEnd .
}
""" ;
] .
Example Warning:
# Unit dissolved 2020
<https://example.org/unit/dept-1>
custodian:valid_to "2020-12-31"^^xsd:date .
# Collection custody ongoing (WARNING!)
<https://example.org/collection/col-1>
custodian:managing_unit <https://example.org/unit/dept-1> ;
custodian:valid_from "2015-01-01"^^xsd:date .
# No valid_to → custody still active
Interpretation: Collection likely transferred to another unit but custody history not updated.
Rule 2: Collection-Unit Bidirectional Relationships
Shape ID: custodian:CollectionUnitBidirectionalShape
Target: All instances of custodian:CustodianCollection
Constraint: If collection references managing_unit, unit must reference collection in managed_collections.
sh:sparql [
sh:message "Collection references managing_unit {?unit} but unit does not list collection in managed_collections" ;
sh:select """
SELECT $this ?unit
WHERE {
$this custodian:managing_unit ?unit .
# VIOLATION: Unit does not reference collection back
FILTER NOT EXISTS {
?unit custodian:managed_collections $this
}
}
""" ;
] .
Example Violation:
# Collection references unit
<https://example.org/collection/col-1>
custodian:managing_unit <https://example.org/unit/dept-1> .
# But unit does NOT reference collection (INVALID!)
<https://example.org/unit/dept-1>
a custodian:OrganizationalStructure .
# Missing: custodian:managed_collections <https://example.org/collection/col-1>
Fix:
# Add inverse relationship
<https://example.org/unit/dept-1>
custodian:managed_collections <https://example.org/collection/col-1> .
Rule 3: Custody Transfer Continuity
Shape ID: custodian:CustodyTransferContinuityShape
Target: All instances of custodian:CustodianCollection
Constraints:
Check for Gaps in Custody Chain
sh:sparql [
sh:severity sh:Warning ;
sh:message "Custody gap detected: previous custody ended on {?prevEnd} but next custody started on {?nextStart}" ;
sh:select """
SELECT $this ?prevEnd ?nextStart ?gapDays
WHERE {
$this custodian:custody_history ?event1 ;
custodian:custody_history ?event2 .
?event1 custodian:transfer_date ?prevEnd .
?event2 custodian:transfer_date ?nextStart .
FILTER(?nextStart > ?prevEnd)
BIND((xsd:date(?nextStart) - xsd:date(?prevEnd)) AS ?gapDays)
# WARNING: Gap > 1 day
FILTER(?gapDays > 1)
}
""" ;
] .
Example Warning:
<https://example.org/collection/col-1>
custodian:custody_history <https://example.org/event/transfer-1> ;
custodian:custody_history <https://example.org/event/transfer-2> .
<https://example.org/event/transfer-1>
custodian:transfer_date "2010-01-01"^^xsd:date .
<https://example.org/event/transfer-2>
custodian:transfer_date "2010-02-01"^^xsd:date .
# Gap of 31 days between transfers
Check for Overlaps in Custody Chain
sh:sparql [
sh:message "Custody overlap detected: collection managed by {?custodian1} until {?end1} and simultaneously by {?custodian2} from {?start2}" ;
sh:select """
SELECT $this ?custodian1 ?end1 ?custodian2 ?start2
WHERE {
$this custodian:custody_history ?event1 ;
custodian:custody_history ?event2 .
?event1 custodian:new_custodian ?custodian1 ;
custodian:custody_end_date ?end1 .
?event2 custodian:new_custodian ?custodian2 ;
custodian:transfer_date ?start2 .
FILTER(?custodian1 != ?custodian2)
FILTER(?start2 < ?end1) # Overlap!
}
""" ;
] .
Rule 4: Staff-Unit Temporal Consistency
Shape ID: custodian:StaffUnitTemporalConsistencyShape
Target: All instances of custodian:PersonObservation
Constraints: Same as Rule 1, but for staff employment dates vs. unit validity period.
Constraint 4.1: Employment Starts After Unit Founding
sh:sparql [
sh:message "Staff employment_start_date ({?employmentStart}) must be >= unit valid_from ({?unitStart})" ;
sh:select """
SELECT $this ?employmentStart ?unitStart ?unit
WHERE {
$this custodian:unit_affiliation ?unit ;
custodian:employment_start_date ?employmentStart .
?unit custodian:valid_from ?unitStart .
FILTER(?employmentStart < ?unitStart)
}
""" ;
] .
Example Violation:
# Unit founded 2015
<https://example.org/unit/dept-1>
custodian:valid_from "2015-01-01"^^xsd:date .
# Staff employed 2010 (INVALID!)
<https://example.org/person/john-doe>
custodian:unit_affiliation <https://example.org/unit/dept-1> ;
custodian:employment_start_date "2010-01-01"^^xsd:date .
Rule 5: Staff-Unit Bidirectional Relationships
Shape ID: custodian:StaffUnitBidirectionalShape
Target: All instances of custodian:PersonObservation
Constraint: If person references unit_affiliation, unit must reference person in staff_members or org:hasMember.
sh:sparql [
sh:message "Person references unit_affiliation {?unit} but unit does not list person in staff_members" ;
sh:select """
SELECT $this ?unit
WHERE {
$this custodian:unit_affiliation ?unit .
# VIOLATION: Unit does not reference person back
FILTER NOT EXISTS {
{ ?unit custodian:staff_members $this }
UNION
{ ?unit org:hasMember $this }
}
}
""" ;
] .
Additional Shapes: Type and Format Constraints
Type Constraint: managing_unit Must Be OrganizationalStructure
custodian:CollectionManagingUnitTypeShape
sh:property [
sh:path custodian:managing_unit ;
sh:class custodian:OrganizationalStructure ;
sh:message "managing_unit must be an instance of OrganizationalStructure" ;
] .
Type Constraint: unit_affiliation Must Be OrganizationalStructure
custodian:PersonUnitAffiliationTypeShape
sh:property [
sh:path custodian:unit_affiliation ;
sh:class custodian:OrganizationalStructure ;
sh:message "unit_affiliation must be an instance of OrganizationalStructure" ;
] .
Format Constraint: Dates Must Be xsd:date or xsd:dateTime
custodian:DatetimeFormatShape
sh:property [
sh:path custodian:valid_from ;
sh:or (
[ sh:datatype xsd:date ]
[ sh:datatype xsd:dateTime ]
) ;
] .
Examples
Example 1: Valid Collection-Unit Relationship
Valid RDF Data:
@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://example.org/unit/paintings-dept>
a custodian:OrganizationalStructure ;
custodian:unit_name "Paintings Department" ;
custodian:valid_from "1985-01-01"^^xsd:date ;
custodian:managed_collections <https://example.org/collection/dutch-paintings> .
<https://example.org/collection/dutch-paintings>
a custodian:CustodianCollection ;
custodian:collection_name "Dutch Paintings" ;
custodian:managing_unit <https://example.org/unit/paintings-dept> ;
custodian:valid_from "1995-01-01"^^xsd:date .
Validation:
python scripts/validate_with_shacl.py valid_data.ttl
# ✅ VALIDATION PASSED
# No constraint violations found.
Example 2: Invalid - Temporal Violation
Invalid RDF Data:
<https://example.org/unit/paintings-dept>
custodian:valid_from "1985-01-01"^^xsd:date .
<https://example.org/collection/dutch-paintings>
custodian:managing_unit <https://example.org/unit/paintings-dept> ;
custodian:valid_from "1970-01-01"^^xsd:date . # Before unit exists!
Validation:
python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
#
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result [Constraint Component: sh:SPARQLConstraintComponent]:
# Severity: sh:Violation
# Message: Collection valid_from (1970-01-01) must be >= managing unit valid_from (1985-01-01)
# Focus Node: https://example.org/collection/dutch-paintings
# Result Path: -
# Source Shape: custodian:CollectionUnitTemporalConsistencyShape
Example 3: Invalid - Missing Bidirectional Relationship
Invalid RDF Data:
<https://example.org/collection/dutch-paintings>
custodian:managing_unit <https://example.org/unit/paintings-dept> .
<https://example.org/unit/paintings-dept>
a custodian:OrganizationalStructure .
# Missing: custodian:managed_collections <https://example.org/collection/dutch-paintings>
Validation:
python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
#
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result:
# Severity: sh:Violation
# Message: Collection references managing_unit https://example.org/unit/paintings-dept
# but unit does not list collection in managed_collections
# Focus Node: https://example.org/collection/dutch-paintings
Integration
CI/CD Pipeline Integration
GitHub Actions Example:
name: SHACL Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install pyshacl rdflib
- name: Validate RDF data
run: |
python scripts/validate_with_shacl.py data/instances/*.ttl
- name: Upload validation report
if: failure()
uses: actions/upload-artifact@v3
with:
name: validation-report
path: validation_report.ttl
Pre-commit Hook
.git/hooks/pre-commit:
#!/bin/bash
# Validate RDF files before commit
echo "Running SHACL validation..."
for file in data/instances/*.ttl; do
python scripts/validate_with_shacl.py "$file" --quiet
if [ $? -ne 0 ]; then
echo "❌ SHACL validation failed for $file"
echo "Fix violations before committing."
exit 1
fi
done
echo "✅ All files pass SHACL validation"
exit 0
Comparison with Python Validator
Phase 5 Python Validator vs. Phase 7 SHACL Shapes
| Aspect | Python Validator (Phase 5) | SHACL Shapes (Phase 7) |
|---|---|---|
| Input Format | YAML (LinkML instances) | RDF (Turtle, JSON-LD, etc.) |
| Execution | Standalone script | Triple store integrated OR pyshacl |
| Performance | Fast for <1,000 records | Optimized for >10,000 records |
| Deployment | Python runtime required | RDF triple store native |
| Error Messages | Custom CLI output | Standardized SHACL reports |
| CI/CD | Exit codes (0/1/2) | Exit codes (0/1/2) + RDF report |
| Use Case | Development validation | Production runtime validation |
When to Use Which?
Use Python Validator (validate_temporal_consistency.py):
- ✅ During schema development (fast feedback on YAML instances)
- ✅ Pre-commit hooks for LinkML files
- ✅ Unit testing LinkML examples
- ✅ Before RDF conversion
Use SHACL Shapes (validate_with_shacl.py):
- ✅ Production RDF triple stores (GraphDB, Fuseki)
- ✅ Data ingestion pipelines
- ✅ Continuous monitoring (real-time validation)
- ✅ After RDF conversion (final quality gate)
Best Practice: Use both:
- Python validator during development (YAML → validate → RDF)
- SHACL shapes in production (RDF → validate → store)
Advanced Usage
Generate Validation Report
python scripts/validate_with_shacl.py data.ttl --output report.ttl
Report Format (Turtle):
@prefix sh: <http://www.w3.org/ns/shacl#> .
[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [
a sh:ValidationResult ;
sh:focusNode <https://example.org/collection/col-1> ;
sh:resultMessage "Collection valid_from (1970-01-01) must be >= ..." ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:SPARQLConstraintComponent ;
sh:sourceShape custodian:CollectionUnitTemporalConsistencyShape
]
] .
Custom Severity Levels
SHACL supports three severity levels:
sh:severity sh:Violation ; # ERROR (blocks data loading)
sh:severity sh:Warning ; # WARNING (logged but allowed)
sh:severity sh:Info ; # INFO (informational only)
Example: Custody gap is a warning (data quality issue but not invalid):
custodian:CustodyTransferContinuityShape
sh:sparql [
sh:severity sh:Warning ; # Allow data but log warning
sh:message "Custody gap detected..." ;
...
] .
Extending Shapes
Add custom validation rules by creating new shapes:
# Custom rule: Collection name must not be empty
custodian:CollectionNameNotEmptyShape
a sh:NodeShape ;
sh:targetClass custodian:CustodianCollection ;
sh:property [
sh:path custodian:collection_name ;
sh:minLength 1 ;
sh:message "Collection name must not be empty" ;
] .
Troubleshooting
Common Issues
Issue 1: "pyshacl not found"
Solution:
pip install pyshacl rdflib
Issue 2: "Parse error: Invalid Turtle syntax"
Solution: Validate RDF syntax first:
rdfpipe -i turtle data.ttl > /dev/null
# If errors, fix syntax before SHACL validation
Issue 3: "No violations found but data is clearly invalid"
Solution: Check namespace prefixes match between shapes and data:
# Shapes file uses:
@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .
# Data file must use same namespace:
<https://nde.nl/ontology/hc/custodian/CustodianCollection>
References
- SHACL Specification: https://www.w3.org/TR/shacl/
- pyshacl Documentation: https://github.com/RDFLib/pySHACL
- SHACL Advanced Features: https://www.w3.org/TR/shacl-af/
- Python Validator (Phase 5):
scripts/validate_temporal_consistency.py - SPARQL Queries (Phase 6):
docs/SPARQL_QUERIES_ORGANIZATIONAL.md - Schema (v0.7.0):
schemas/20251121/linkml/01_custodian_name_modular.yaml
Next Steps
Phase 8: LinkML Schema Constraints
Embed validation rules directly into LinkML schema using:
minimum_value/maximum_valuefor date comparisonspatternfor format validation- Custom validators with Python functions
- Slot-level constraints
Goal: Validate at schema definition level, not just RDF level.
Document Version: 1.0.0
Schema Version: v0.7.0
Last Updated: 2025-11-22
SHACL Shapes File: schemas/20251121/shacl/custodian_validation_shapes.ttl (474 lines)
Validation Script: scripts/validate_with_shacl.py (289 lines)