# SHACL Validation Shapes for Heritage Custodian Ontology **Version**: 1.0.0 **Schema Version**: v0.7.0 **Created**: 2025-11-22 **SHACL Spec**: https://www.w3.org/TR/shacl/ --- ## Table of Contents 1. [Overview](#overview) 2. [Installation](#installation) 3. [Usage](#usage) 4. [Validation Rules](#validation-rules) 5. [Shape Definitions](#shape-definitions) 6. [Examples](#examples) 7. [Integration](#integration) 8. [Comparison with Python Validator](#comparison-with-python-validator) --- ## Overview This document describes the **SHACL (Shapes Constraint Language)** validation shapes for the Heritage Custodian Ontology. SHACL shapes enforce data quality constraints at RDF ingestion time, preventing invalid data from entering triple stores. ### What is SHACL? **SHACL** is a W3C recommendation for validating RDF graphs against a set of conditions (shapes). Unlike SPARQL queries that **detect** violations after data is stored, SHACL shapes **prevent** violations during data loading. ### Benefits of SHACL Validation ✅ **Prevention over Detection**: Reject invalid data before storage ✅ **Standardized Reports**: Machine-readable validation results ✅ **Triple Store Integration**: Native support in GraphDB, Jena, Virtuoso ✅ **Declarative Constraints**: Express rules in RDF (no external scripts) ✅ **Detailed Error Messages**: Precise identification of failing triples --- ## Installation ### Prerequisites Install Python dependencies: ```bash pip install pyshacl rdflib ``` **Libraries**: - **pyshacl** (v0.25.0+): SHACL validator for Python - **rdflib** (v7.0.0+): RDF graph library ### Verify Installation ```bash python3 -c "import pyshacl; print(pyshacl.__version__)" # Expected output: 0.25.0 (or later) ``` --- ## Usage ### Command Line Validation **Basic Usage**: ```bash python scripts/validate_with_shacl.py data.ttl ``` **With Custom Shapes**: ```bash python scripts/validate_with_shacl.py data.ttl --shapes custom_shapes.ttl ``` **Different RDF Formats**: ```bash # JSON-LD data python scripts/validate_with_shacl.py data.jsonld --format jsonld # N-Triples data python scripts/validate_with_shacl.py data.nt --format nt ``` **Save Validation Report**: ```bash python scripts/validate_with_shacl.py data.ttl --output report.ttl ``` **Verbose Output**: ```bash python scripts/validate_with_shacl.py data.ttl --verbose ``` ### Python Library Usage ```python from scripts.validate_with_shacl import validate_file # Validate with default shapes if validate_file("data.ttl"): print("✅ Data is valid") else: print("❌ Data has violations") # Validate with custom shapes if validate_file("data.ttl", shapes_file="custom_shapes.ttl"): print("✅ Valid") ``` ### Triple Store Integration **Apache Jena Fuseki**: ```bash # Load shapes into Fuseki dataset tdbloader2 --loc=/path/to/tdb custodian_validation_shapes.ttl # Validate data during SPARQL UPDATE # Fuseki automatically applies SHACL validation if shapes are loaded ``` **GraphDB**: 1. Create repository with SHACL validation enabled 2. Import shapes file into dedicated context: `http://shacl/shapes` 3. GraphDB validates all data changes automatically --- ## Validation Rules This SHACL shapes file implements **5 core validation rules** from Phase 5: | Rule ID | Name | Severity | Description | |---------|------|----------|-------------| | **Rule 1** | Collection-Unit Temporal Consistency | ERROR | Collection custody dates must fall within managing unit's validity period | | **Rule 2** | Collection-Unit Bidirectional | ERROR | Collection → unit must have inverse unit → collection | | **Rule 3** | Custody Transfer Continuity | WARNING | Custody transfers must be continuous (no gaps/overlaps) | | **Rule 4** | Staff-Unit Temporal Consistency | ERROR | Staff employment dates must fall within unit's validity period | | **Rule 5** | Staff-Unit Bidirectional | ERROR | Person → unit must have inverse unit → person | Plus **3 additional shapes** for type and format constraints. --- ## Shape Definitions ### Rule 1: Collection-Unit Temporal Consistency **Shape ID**: `custodian:CollectionUnitTemporalConsistencyShape` **Target**: All instances of `custodian:CustodianCollection` **Constraints**: #### Constraint 1.1: Collection Starts After Unit Founding ```turtle sh:sparql [ sh:message "Collection valid_from ({?collectionStart}) must be >= managing unit valid_from ({?unitStart})" ; sh:select """ SELECT $this ?collectionStart ?unitStart ?managingUnit WHERE { $this custodian:managing_unit ?managingUnit ; custodian:valid_from ?collectionStart . ?managingUnit custodian:valid_from ?unitStart . # VIOLATION: Collection starts before unit exists FILTER(?collectionStart < ?unitStart) } """ ; ] . ``` **Example Violation**: ```turtle # Unit founded 2010 a custodian:OrganizationalStructure ; custodian:valid_from "2010-01-01"^^xsd:date . # Collection started 2005 (INVALID!) a custodian:CustodianCollection ; custodian:managing_unit ; custodian:valid_from "2005-01-01"^^xsd:date . ``` **Violation Report**: ``` ❌ Validation Result [Constraint Component: sh:SPARQLConstraintComponent] Severity: sh:Violation Message: Collection valid_from (2005-01-01) must be >= managing unit valid_from (2010-01-01) Focus Node: https://example.org/collection/col-1 ``` --- #### Constraint 1.2: Collection Ends Before Unit Dissolution ```turtle sh:sparql [ sh:message "Collection valid_to ({?collectionEnd}) must be <= managing unit valid_to ({?unitEnd})" ; sh:select """ SELECT $this ?collectionEnd ?unitEnd ?managingUnit WHERE { $this custodian:managing_unit ?managingUnit ; custodian:valid_to ?collectionEnd . ?managingUnit custodian:valid_to ?unitEnd . # Unit is dissolved FILTER(BOUND(?unitEnd)) # VIOLATION: Collection custody ends after unit dissolution FILTER(?collectionEnd > ?unitEnd) } """ ; ] . ``` **Example Violation**: ```turtle # Unit dissolved 2020 a custodian:OrganizationalStructure ; custodian:valid_from "2010-01-01"^^xsd:date ; custodian:valid_to "2020-12-31"^^xsd:date . # Collection custody ended 2023 (INVALID!) a custodian:CustodianCollection ; custodian:managing_unit ; custodian:valid_from "2015-01-01"^^xsd:date ; custodian:valid_to "2023-06-01"^^xsd:date . ``` --- #### Warning: Ongoing Custody After Unit Dissolution ```turtle sh:sparql [ sh:severity sh:Warning ; sh:message "Collection has ongoing custody but managing unit was dissolved" ; sh:select """ SELECT $this ?managingUnit ?unitEnd WHERE { $this custodian:managing_unit ?managingUnit . # Collection has no end date (ongoing) FILTER NOT EXISTS { $this custodian:valid_to ?collectionEnd } # But unit is dissolved ?managingUnit custodian:valid_to ?unitEnd . } """ ; ] . ``` **Example Warning**: ```turtle # Unit dissolved 2020 custodian:valid_to "2020-12-31"^^xsd:date . # Collection custody ongoing (WARNING!) custodian:managing_unit ; custodian:valid_from "2015-01-01"^^xsd:date . # No valid_to → custody still active ``` **Interpretation**: Collection likely transferred to another unit but custody history not updated. --- ### Rule 2: Collection-Unit Bidirectional Relationships **Shape ID**: `custodian:CollectionUnitBidirectionalShape` **Target**: All instances of `custodian:CustodianCollection` **Constraint**: If collection references `managing_unit`, unit must reference collection in `managed_collections`. ```turtle sh:sparql [ sh:message "Collection references managing_unit {?unit} but unit does not list collection in managed_collections" ; sh:select """ SELECT $this ?unit WHERE { $this custodian:managing_unit ?unit . # VIOLATION: Unit does not reference collection back FILTER NOT EXISTS { ?unit custodian:managed_collections $this } } """ ; ] . ``` **Example Violation**: ```turtle # Collection references unit custodian:managing_unit . # But unit does NOT reference collection (INVALID!) a custodian:OrganizationalStructure . # Missing: custodian:managed_collections ``` **Fix**: ```turtle # Add inverse relationship custodian:managed_collections . ``` --- ### Rule 3: Custody Transfer Continuity **Shape ID**: `custodian:CustodyTransferContinuityShape` **Target**: All instances of `custodian:CustodianCollection` **Constraints**: #### Check for Gaps in Custody Chain ```turtle sh:sparql [ sh:severity sh:Warning ; sh:message "Custody gap detected: previous custody ended on {?prevEnd} but next custody started on {?nextStart}" ; sh:select """ SELECT $this ?prevEnd ?nextStart ?gapDays WHERE { $this custodian:custody_history ?event1 ; custodian:custody_history ?event2 . ?event1 custodian:transfer_date ?prevEnd . ?event2 custodian:transfer_date ?nextStart . FILTER(?nextStart > ?prevEnd) BIND((xsd:date(?nextStart) - xsd:date(?prevEnd)) AS ?gapDays) # WARNING: Gap > 1 day FILTER(?gapDays > 1) } """ ; ] . ``` **Example Warning**: ```turtle custodian:custody_history ; custodian:custody_history . custodian:transfer_date "2010-01-01"^^xsd:date . custodian:transfer_date "2010-02-01"^^xsd:date . # Gap of 31 days between transfers ``` --- #### Check for Overlaps in Custody Chain ```turtle sh:sparql [ sh:message "Custody overlap detected: collection managed by {?custodian1} until {?end1} and simultaneously by {?custodian2} from {?start2}" ; sh:select """ SELECT $this ?custodian1 ?end1 ?custodian2 ?start2 WHERE { $this custodian:custody_history ?event1 ; custodian:custody_history ?event2 . ?event1 custodian:new_custodian ?custodian1 ; custodian:custody_end_date ?end1 . ?event2 custodian:new_custodian ?custodian2 ; custodian:transfer_date ?start2 . FILTER(?custodian1 != ?custodian2) FILTER(?start2 < ?end1) # Overlap! } """ ; ] . ``` --- ### Rule 4: Staff-Unit Temporal Consistency **Shape ID**: `custodian:StaffUnitTemporalConsistencyShape` **Target**: All instances of `custodian:PersonObservation` **Constraints**: Same as Rule 1, but for staff employment dates vs. unit validity period. #### Constraint 4.1: Employment Starts After Unit Founding ```turtle sh:sparql [ sh:message "Staff employment_start_date ({?employmentStart}) must be >= unit valid_from ({?unitStart})" ; sh:select """ SELECT $this ?employmentStart ?unitStart ?unit WHERE { $this custodian:unit_affiliation ?unit ; custodian:employment_start_date ?employmentStart . ?unit custodian:valid_from ?unitStart . FILTER(?employmentStart < ?unitStart) } """ ; ] . ``` **Example Violation**: ```turtle # Unit founded 2015 custodian:valid_from "2015-01-01"^^xsd:date . # Staff employed 2010 (INVALID!) custodian:unit_affiliation ; custodian:employment_start_date "2010-01-01"^^xsd:date . ``` --- ### Rule 5: Staff-Unit Bidirectional Relationships **Shape ID**: `custodian:StaffUnitBidirectionalShape` **Target**: All instances of `custodian:PersonObservation` **Constraint**: If person references `unit_affiliation`, unit must reference person in `staff_members` or `org:hasMember`. ```turtle sh:sparql [ sh:message "Person references unit_affiliation {?unit} but unit does not list person in staff_members" ; sh:select """ SELECT $this ?unit WHERE { $this custodian:unit_affiliation ?unit . # VIOLATION: Unit does not reference person back FILTER NOT EXISTS { { ?unit custodian:staff_members $this } UNION { ?unit org:hasMember $this } } } """ ; ] . ``` --- ### Additional Shapes: Type and Format Constraints #### Type Constraint: managing_unit Must Be OrganizationalStructure ```turtle custodian:CollectionManagingUnitTypeShape sh:property [ sh:path custodian:managing_unit ; sh:class custodian:OrganizationalStructure ; sh:message "managing_unit must be an instance of OrganizationalStructure" ; ] . ``` #### Type Constraint: unit_affiliation Must Be OrganizationalStructure ```turtle custodian:PersonUnitAffiliationTypeShape sh:property [ sh:path custodian:unit_affiliation ; sh:class custodian:OrganizationalStructure ; sh:message "unit_affiliation must be an instance of OrganizationalStructure" ; ] . ``` #### Format Constraint: Dates Must Be xsd:date or xsd:dateTime ```turtle custodian:DatetimeFormatShape sh:property [ sh:path custodian:valid_from ; sh:or ( [ sh:datatype xsd:date ] [ sh:datatype xsd:dateTime ] ) ; ] . ``` --- ## Examples ### Example 1: Valid Collection-Unit Relationship **Valid RDF Data**: ```turtle @prefix custodian: . @prefix xsd: . a custodian:OrganizationalStructure ; custodian:unit_name "Paintings Department" ; custodian:valid_from "1985-01-01"^^xsd:date ; custodian:managed_collections . a custodian:CustodianCollection ; custodian:collection_name "Dutch Paintings" ; custodian:managing_unit ; custodian:valid_from "1995-01-01"^^xsd:date . ``` **Validation**: ```bash python scripts/validate_with_shacl.py valid_data.ttl # ✅ VALIDATION PASSED # No constraint violations found. ``` --- ### Example 2: Invalid - Temporal Violation **Invalid RDF Data**: ```turtle custodian:valid_from "1985-01-01"^^xsd:date . custodian:managing_unit ; custodian:valid_from "1970-01-01"^^xsd:date . # Before unit exists! ``` **Validation**: ```bash python scripts/validate_with_shacl.py invalid_data.ttl # ❌ VALIDATION FAILED # # Constraint Violations: # -------------------------------------------------------------------------------- # Validation Result [Constraint Component: sh:SPARQLConstraintComponent]: # Severity: sh:Violation # Message: Collection valid_from (1970-01-01) must be >= managing unit valid_from (1985-01-01) # Focus Node: https://example.org/collection/dutch-paintings # Result Path: - # Source Shape: custodian:CollectionUnitTemporalConsistencyShape ``` --- ### Example 3: Invalid - Missing Bidirectional Relationship **Invalid RDF Data**: ```turtle custodian:managing_unit . a custodian:OrganizationalStructure . # Missing: custodian:managed_collections ``` **Validation**: ```bash python scripts/validate_with_shacl.py invalid_data.ttl # ❌ VALIDATION FAILED # # Constraint Violations: # -------------------------------------------------------------------------------- # Validation Result: # Severity: sh:Violation # Message: Collection references managing_unit https://example.org/unit/paintings-dept # but unit does not list collection in managed_collections # Focus Node: https://example.org/collection/dutch-paintings ``` --- ## Integration ### CI/CD Pipeline Integration **GitHub Actions Example**: ```yaml name: SHACL Validation on: [push, pull_request] jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install dependencies run: pip install pyshacl rdflib - name: Validate RDF data run: | python scripts/validate_with_shacl.py data/instances/*.ttl - name: Upload validation report if: failure() uses: actions/upload-artifact@v3 with: name: validation-report path: validation_report.ttl ``` --- ### Pre-commit Hook **`.git/hooks/pre-commit`**: ```bash #!/bin/bash # Validate RDF files before commit echo "Running SHACL validation..." for file in data/instances/*.ttl; do python scripts/validate_with_shacl.py "$file" --quiet if [ $? -ne 0 ]; then echo "❌ SHACL validation failed for $file" echo "Fix violations before committing." exit 1 fi done echo "✅ All files pass SHACL validation" exit 0 ``` --- ## Comparison with Python Validator ### Phase 5 Python Validator vs. Phase 7 SHACL Shapes | Aspect | Python Validator (Phase 5) | SHACL Shapes (Phase 7) | |--------|---------------------------|------------------------| | **Input Format** | YAML (LinkML instances) | RDF (Turtle, JSON-LD, etc.) | | **Execution** | Standalone script | Triple store integrated OR pyshacl | | **Performance** | Fast for <1,000 records | Optimized for >10,000 records | | **Deployment** | Python runtime required | RDF triple store native | | **Error Messages** | Custom CLI output | Standardized SHACL reports | | **CI/CD** | Exit codes (0/1/2) | Exit codes (0/1/2) + RDF report | | **Use Case** | Development validation | Production runtime validation | ### When to Use Which? **Use Python Validator** (`validate_temporal_consistency.py`): - ✅ During schema development (fast feedback on YAML instances) - ✅ Pre-commit hooks for LinkML files - ✅ Unit testing LinkML examples - ✅ Before RDF conversion **Use SHACL Shapes** (`validate_with_shacl.py`): - ✅ Production RDF triple stores (GraphDB, Fuseki) - ✅ Data ingestion pipelines - ✅ Continuous monitoring (real-time validation) - ✅ After RDF conversion (final quality gate) **Best Practice**: Use **both**: 1. Python validator during development (YAML → validate → RDF) 2. SHACL shapes in production (RDF → validate → store) --- ## Advanced Usage ### Generate Validation Report ```bash python scripts/validate_with_shacl.py data.ttl --output report.ttl ``` **Report Format** (Turtle): ```turtle @prefix sh: . [ a sh:ValidationReport ; sh:conforms false ; sh:result [ a sh:ValidationResult ; sh:focusNode ; sh:resultMessage "Collection valid_from (1970-01-01) must be >= ..." ; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:SPARQLConstraintComponent ; sh:sourceShape custodian:CollectionUnitTemporalConsistencyShape ] ] . ``` --- ### Custom Severity Levels SHACL supports three severity levels: ```turtle sh:severity sh:Violation ; # ERROR (blocks data loading) sh:severity sh:Warning ; # WARNING (logged but allowed) sh:severity sh:Info ; # INFO (informational only) ``` **Example**: Custody gap is a **warning** (data quality issue but not invalid): ```turtle custodian:CustodyTransferContinuityShape sh:sparql [ sh:severity sh:Warning ; # Allow data but log warning sh:message "Custody gap detected..." ; ... ] . ``` --- ### Extending Shapes Add custom validation rules by creating new shapes: ```turtle # Custom rule: Collection name must not be empty custodian:CollectionNameNotEmptyShape a sh:NodeShape ; sh:targetClass custodian:CustodianCollection ; sh:property [ sh:path custodian:collection_name ; sh:minLength 1 ; sh:message "Collection name must not be empty" ; ] . ``` --- ## Troubleshooting ### Common Issues #### Issue 1: "pyshacl not found" **Solution**: ```bash pip install pyshacl rdflib ``` #### Issue 2: "Parse error: Invalid Turtle syntax" **Solution**: Validate RDF syntax first: ```bash rdfpipe -i turtle data.ttl > /dev/null # If errors, fix syntax before SHACL validation ``` #### Issue 3: "No violations found but data is clearly invalid" **Solution**: Check namespace prefixes match between shapes and data: ```turtle # Shapes file uses: @prefix custodian: . # Data file must use same namespace: ``` --- ## References - **SHACL Specification**: https://www.w3.org/TR/shacl/ - **pyshacl Documentation**: https://github.com/RDFLib/pySHACL - **SHACL Advanced Features**: https://www.w3.org/TR/shacl-af/ - **Python Validator (Phase 5)**: `scripts/validate_temporal_consistency.py` - **SPARQL Queries (Phase 6)**: `docs/SPARQL_QUERIES_ORGANIZATIONAL.md` - **Schema (v0.7.0)**: `schemas/20251121/linkml/01_custodian_name_modular.yaml` --- ## Next Steps ### Phase 8: LinkML Schema Constraints Embed validation rules directly into LinkML schema using: - `minimum_value` / `maximum_value` for date comparisons - `pattern` for format validation - Custom validators with Python functions - Slot-level constraints **Goal**: Validate at **schema definition** level, not just RDF level. --- **Document Version**: 1.0.0 **Schema Version**: v0.7.0 **Last Updated**: 2025-11-22 **SHACL Shapes File**: `schemas/20251121/shacl/custodian_validation_shapes.ttl` (474 lines) **Validation Script**: `scripts/validate_with_shacl.py` (289 lines)