glam/docs/SHACL_VALIDATION_SHAPES.md
kempersc 6eb18700f0 Add SHACL validation shapes and validation script for Heritage Custodian Ontology
- Created SHACL shapes for validating temporal consistency and bidirectional relationships in custodial collections and staff observations.
- Implemented a Python script to validate RDF data against the defined SHACL shapes using the pyshacl library.
- Added command-line interface for validation with options for specifying data formats and output reports.
- Included detailed error handling and reporting for validation results.
2025-11-22 23:22:10 +01:00

22 KiB

SHACL Validation Shapes for Heritage Custodian Ontology

Version: 1.0.0
Schema Version: v0.7.0
Created: 2025-11-22
SHACL Spec: https://www.w3.org/TR/shacl/


Table of Contents

  1. Overview
  2. Installation
  3. Usage
  4. Validation Rules
  5. Shape Definitions
  6. Examples
  7. Integration
  8. Comparison with Python Validator

Overview

This document describes the SHACL (Shapes Constraint Language) validation shapes for the Heritage Custodian Ontology. SHACL shapes enforce data quality constraints at RDF ingestion time, preventing invalid data from entering triple stores.

What is SHACL?

SHACL is a W3C recommendation for validating RDF graphs against a set of conditions (shapes). Unlike SPARQL queries that detect violations after data is stored, SHACL shapes prevent violations during data loading.

Benefits of SHACL Validation

Prevention over Detection: Reject invalid data before storage
Standardized Reports: Machine-readable validation results
Triple Store Integration: Native support in GraphDB, Jena, Virtuoso
Declarative Constraints: Express rules in RDF (no external scripts)
Detailed Error Messages: Precise identification of failing triples


Installation

Prerequisites

Install Python dependencies:

pip install pyshacl rdflib

Libraries:

  • pyshacl (v0.25.0+): SHACL validator for Python
  • rdflib (v7.0.0+): RDF graph library

Verify Installation

python3 -c "import pyshacl; print(pyshacl.__version__)"
# Expected output: 0.25.0 (or later)

Usage

Command Line Validation

Basic Usage:

python scripts/validate_with_shacl.py data.ttl

With Custom Shapes:

python scripts/validate_with_shacl.py data.ttl --shapes custom_shapes.ttl

Different RDF Formats:

# JSON-LD data
python scripts/validate_with_shacl.py data.jsonld --format jsonld

# N-Triples data
python scripts/validate_with_shacl.py data.nt --format nt

Save Validation Report:

python scripts/validate_with_shacl.py data.ttl --output report.ttl

Verbose Output:

python scripts/validate_with_shacl.py data.ttl --verbose

Python Library Usage

from scripts.validate_with_shacl import validate_file

# Validate with default shapes
if validate_file("data.ttl"):
    print("✅ Data is valid")
else:
    print("❌ Data has violations")

# Validate with custom shapes
if validate_file("data.ttl", shapes_file="custom_shapes.ttl"):
    print("✅ Valid")

Triple Store Integration

Apache Jena Fuseki:

# Load shapes into Fuseki dataset
tdbloader2 --loc=/path/to/tdb custodian_validation_shapes.ttl

# Validate data during SPARQL UPDATE
# Fuseki automatically applies SHACL validation if shapes are loaded

GraphDB:

  1. Create repository with SHACL validation enabled
  2. Import shapes file into dedicated context: http://shacl/shapes
  3. GraphDB validates all data changes automatically

Validation Rules

This SHACL shapes file implements 5 core validation rules from Phase 5:

Rule ID Name Severity Description
Rule 1 Collection-Unit Temporal Consistency ERROR Collection custody dates must fall within managing unit's validity period
Rule 2 Collection-Unit Bidirectional ERROR Collection → unit must have inverse unit → collection
Rule 3 Custody Transfer Continuity WARNING Custody transfers must be continuous (no gaps/overlaps)
Rule 4 Staff-Unit Temporal Consistency ERROR Staff employment dates must fall within unit's validity period
Rule 5 Staff-Unit Bidirectional ERROR Person → unit must have inverse unit → person

Plus 3 additional shapes for type and format constraints.


Shape Definitions

Rule 1: Collection-Unit Temporal Consistency

Shape ID: custodian:CollectionUnitTemporalConsistencyShape

Target: All instances of custodian:CustodianCollection

Constraints:

Constraint 1.1: Collection Starts After Unit Founding

sh:sparql [
    sh:message "Collection valid_from ({?collectionStart}) must be >= managing unit valid_from ({?unitStart})" ;
    sh:select """
        SELECT $this ?collectionStart ?unitStart ?managingUnit
        WHERE {
            $this custodian:managing_unit ?managingUnit ;
                  custodian:valid_from ?collectionStart .
            
            ?managingUnit custodian:valid_from ?unitStart .
            
            # VIOLATION: Collection starts before unit exists
            FILTER(?collectionStart < ?unitStart)
        }
    """ ;
] .

Example Violation:

# Unit founded 2010
<https://example.org/unit/dept-1>
    a custodian:OrganizationalStructure ;
    custodian:valid_from "2010-01-01"^^xsd:date .

# Collection started 2005 (INVALID!)
<https://example.org/collection/col-1>
    a custodian:CustodianCollection ;
    custodian:managing_unit <https://example.org/unit/dept-1> ;
    custodian:valid_from "2005-01-01"^^xsd:date .

Violation Report:

❌ Validation Result [Constraint Component: sh:SPARQLConstraintComponent]
    Severity: sh:Violation
    Message: Collection valid_from (2005-01-01) must be >= managing unit valid_from (2010-01-01)
    Focus Node: https://example.org/collection/col-1

Constraint 1.2: Collection Ends Before Unit Dissolution

sh:sparql [
    sh:message "Collection valid_to ({?collectionEnd}) must be <= managing unit valid_to ({?unitEnd})" ;
    sh:select """
        SELECT $this ?collectionEnd ?unitEnd ?managingUnit
        WHERE {
            $this custodian:managing_unit ?managingUnit ;
                  custodian:valid_to ?collectionEnd .
            
            ?managingUnit custodian:valid_to ?unitEnd .
            
            # Unit is dissolved
            FILTER(BOUND(?unitEnd))
            
            # VIOLATION: Collection custody ends after unit dissolution
            FILTER(?collectionEnd > ?unitEnd)
        }
    """ ;
] .

Example Violation:

# Unit dissolved 2020
<https://example.org/unit/dept-1>
    a custodian:OrganizationalStructure ;
    custodian:valid_from "2010-01-01"^^xsd:date ;
    custodian:valid_to "2020-12-31"^^xsd:date .

# Collection custody ended 2023 (INVALID!)
<https://example.org/collection/col-1>
    a custodian:CustodianCollection ;
    custodian:managing_unit <https://example.org/unit/dept-1> ;
    custodian:valid_from "2015-01-01"^^xsd:date ;
    custodian:valid_to "2023-06-01"^^xsd:date .

Warning: Ongoing Custody After Unit Dissolution

sh:sparql [
    sh:severity sh:Warning ;
    sh:message "Collection has ongoing custody but managing unit was dissolved" ;
    sh:select """
        SELECT $this ?managingUnit ?unitEnd
        WHERE {
            $this custodian:managing_unit ?managingUnit .
            
            # Collection has no end date (ongoing)
            FILTER NOT EXISTS { $this custodian:valid_to ?collectionEnd }
            
            # But unit is dissolved
            ?managingUnit custodian:valid_to ?unitEnd .
        }
    """ ;
] .

Example Warning:

# Unit dissolved 2020
<https://example.org/unit/dept-1>
    custodian:valid_to "2020-12-31"^^xsd:date .

# Collection custody ongoing (WARNING!)
<https://example.org/collection/col-1>
    custodian:managing_unit <https://example.org/unit/dept-1> ;
    custodian:valid_from "2015-01-01"^^xsd:date .
    # No valid_to → custody still active

Interpretation: Collection likely transferred to another unit but custody history not updated.


Rule 2: Collection-Unit Bidirectional Relationships

Shape ID: custodian:CollectionUnitBidirectionalShape

Target: All instances of custodian:CustodianCollection

Constraint: If collection references managing_unit, unit must reference collection in managed_collections.

sh:sparql [
    sh:message "Collection references managing_unit {?unit} but unit does not list collection in managed_collections" ;
    sh:select """
        SELECT $this ?unit
        WHERE {
            $this custodian:managing_unit ?unit .
            
            # VIOLATION: Unit does not reference collection back
            FILTER NOT EXISTS {
                ?unit custodian:managed_collections $this
            }
        }
    """ ;
] .

Example Violation:

# Collection references unit
<https://example.org/collection/col-1>
    custodian:managing_unit <https://example.org/unit/dept-1> .

# But unit does NOT reference collection (INVALID!)
<https://example.org/unit/dept-1>
    a custodian:OrganizationalStructure .
    # Missing: custodian:managed_collections <https://example.org/collection/col-1>

Fix:

# Add inverse relationship
<https://example.org/unit/dept-1>
    custodian:managed_collections <https://example.org/collection/col-1> .

Rule 3: Custody Transfer Continuity

Shape ID: custodian:CustodyTransferContinuityShape

Target: All instances of custodian:CustodianCollection

Constraints:

Check for Gaps in Custody Chain

sh:sparql [
    sh:severity sh:Warning ;
    sh:message "Custody gap detected: previous custody ended on {?prevEnd} but next custody started on {?nextStart}" ;
    sh:select """
        SELECT $this ?prevEnd ?nextStart ?gapDays
        WHERE {
            $this custodian:custody_history ?event1 ;
                  custodian:custody_history ?event2 .
            
            ?event1 custodian:transfer_date ?prevEnd .
            ?event2 custodian:transfer_date ?nextStart .
            
            FILTER(?nextStart > ?prevEnd)
            BIND((xsd:date(?nextStart) - xsd:date(?prevEnd)) AS ?gapDays)
            
            # WARNING: Gap > 1 day
            FILTER(?gapDays > 1)
        }
    """ ;
] .

Example Warning:

<https://example.org/collection/col-1>
    custodian:custody_history <https://example.org/event/transfer-1> ;
    custodian:custody_history <https://example.org/event/transfer-2> .

<https://example.org/event/transfer-1>
    custodian:transfer_date "2010-01-01"^^xsd:date .

<https://example.org/event/transfer-2>
    custodian:transfer_date "2010-02-01"^^xsd:date .
    # Gap of 31 days between transfers

Check for Overlaps in Custody Chain

sh:sparql [
    sh:message "Custody overlap detected: collection managed by {?custodian1} until {?end1} and simultaneously by {?custodian2} from {?start2}" ;
    sh:select """
        SELECT $this ?custodian1 ?end1 ?custodian2 ?start2
        WHERE {
            $this custodian:custody_history ?event1 ;
                  custodian:custody_history ?event2 .
            
            ?event1 custodian:new_custodian ?custodian1 ;
                    custodian:custody_end_date ?end1 .
            
            ?event2 custodian:new_custodian ?custodian2 ;
                    custodian:transfer_date ?start2 .
            
            FILTER(?custodian1 != ?custodian2)
            FILTER(?start2 < ?end1)  # Overlap!
        }
    """ ;
] .

Rule 4: Staff-Unit Temporal Consistency

Shape ID: custodian:StaffUnitTemporalConsistencyShape

Target: All instances of custodian:PersonObservation

Constraints: Same as Rule 1, but for staff employment dates vs. unit validity period.

Constraint 4.1: Employment Starts After Unit Founding

sh:sparql [
    sh:message "Staff employment_start_date ({?employmentStart}) must be >= unit valid_from ({?unitStart})" ;
    sh:select """
        SELECT $this ?employmentStart ?unitStart ?unit
        WHERE {
            $this custodian:unit_affiliation ?unit ;
                  custodian:employment_start_date ?employmentStart .
            
            ?unit custodian:valid_from ?unitStart .
            
            FILTER(?employmentStart < ?unitStart)
        }
    """ ;
] .

Example Violation:

# Unit founded 2015
<https://example.org/unit/dept-1>
    custodian:valid_from "2015-01-01"^^xsd:date .

# Staff employed 2010 (INVALID!)
<https://example.org/person/john-doe>
    custodian:unit_affiliation <https://example.org/unit/dept-1> ;
    custodian:employment_start_date "2010-01-01"^^xsd:date .

Rule 5: Staff-Unit Bidirectional Relationships

Shape ID: custodian:StaffUnitBidirectionalShape

Target: All instances of custodian:PersonObservation

Constraint: If person references unit_affiliation, unit must reference person in staff_members or org:hasMember.

sh:sparql [
    sh:message "Person references unit_affiliation {?unit} but unit does not list person in staff_members" ;
    sh:select """
        SELECT $this ?unit
        WHERE {
            $this custodian:unit_affiliation ?unit .
            
            # VIOLATION: Unit does not reference person back
            FILTER NOT EXISTS {
                { ?unit custodian:staff_members $this }
                UNION
                { ?unit org:hasMember $this }
            }
        }
    """ ;
] .

Additional Shapes: Type and Format Constraints

Type Constraint: managing_unit Must Be OrganizationalStructure

custodian:CollectionManagingUnitTypeShape
    sh:property [
        sh:path custodian:managing_unit ;
        sh:class custodian:OrganizationalStructure ;
        sh:message "managing_unit must be an instance of OrganizationalStructure" ;
    ] .

Type Constraint: unit_affiliation Must Be OrganizationalStructure

custodian:PersonUnitAffiliationTypeShape
    sh:property [
        sh:path custodian:unit_affiliation ;
        sh:class custodian:OrganizationalStructure ;
        sh:message "unit_affiliation must be an instance of OrganizationalStructure" ;
    ] .

Format Constraint: Dates Must Be xsd:date or xsd:dateTime

custodian:DatetimeFormatShape
    sh:property [
        sh:path custodian:valid_from ;
        sh:or (
            [ sh:datatype xsd:date ]
            [ sh:datatype xsd:dateTime ]
        ) ;
    ] .

Examples

Example 1: Valid Collection-Unit Relationship

Valid RDF Data:

@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://example.org/unit/paintings-dept>
    a custodian:OrganizationalStructure ;
    custodian:unit_name "Paintings Department" ;
    custodian:valid_from "1985-01-01"^^xsd:date ;
    custodian:managed_collections <https://example.org/collection/dutch-paintings> .

<https://example.org/collection/dutch-paintings>
    a custodian:CustodianCollection ;
    custodian:collection_name "Dutch Paintings" ;
    custodian:managing_unit <https://example.org/unit/paintings-dept> ;
    custodian:valid_from "1995-01-01"^^xsd:date .

Validation:

python scripts/validate_with_shacl.py valid_data.ttl
# ✅ VALIDATION PASSED
# No constraint violations found.

Example 2: Invalid - Temporal Violation

Invalid RDF Data:

<https://example.org/unit/paintings-dept>
    custodian:valid_from "1985-01-01"^^xsd:date .

<https://example.org/collection/dutch-paintings>
    custodian:managing_unit <https://example.org/unit/paintings-dept> ;
    custodian:valid_from "1970-01-01"^^xsd:date .  # Before unit exists!

Validation:

python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
# 
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result [Constraint Component: sh:SPARQLConstraintComponent]:
#     Severity: sh:Violation
#     Message: Collection valid_from (1970-01-01) must be >= managing unit valid_from (1985-01-01)
#     Focus Node: https://example.org/collection/dutch-paintings
#     Result Path: -
#     Source Shape: custodian:CollectionUnitTemporalConsistencyShape

Example 3: Invalid - Missing Bidirectional Relationship

Invalid RDF Data:

<https://example.org/collection/dutch-paintings>
    custodian:managing_unit <https://example.org/unit/paintings-dept> .

<https://example.org/unit/paintings-dept>
    a custodian:OrganizationalStructure .
    # Missing: custodian:managed_collections <https://example.org/collection/dutch-paintings>

Validation:

python scripts/validate_with_shacl.py invalid_data.ttl
# ❌ VALIDATION FAILED
# 
# Constraint Violations:
# --------------------------------------------------------------------------------
# Validation Result:
#     Severity: sh:Violation
#     Message: Collection references managing_unit https://example.org/unit/paintings-dept
#              but unit does not list collection in managed_collections
#     Focus Node: https://example.org/collection/dutch-paintings

Integration

CI/CD Pipeline Integration

GitHub Actions Example:

name: SHACL Validation

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: pip install pyshacl rdflib
      
      - name: Validate RDF data
        run: |
          python scripts/validate_with_shacl.py data/instances/*.ttl          
      
      - name: Upload validation report
        if: failure()
        uses: actions/upload-artifact@v3
        with:
          name: validation-report
          path: validation_report.ttl

Pre-commit Hook

.git/hooks/pre-commit:

#!/bin/bash
# Validate RDF files before commit

echo "Running SHACL validation..."

for file in data/instances/*.ttl; do
    python scripts/validate_with_shacl.py "$file" --quiet
    if [ $? -ne 0 ]; then
        echo "❌ SHACL validation failed for $file"
        echo "Fix violations before committing."
        exit 1
    fi
done

echo "✅ All files pass SHACL validation"
exit 0

Comparison with Python Validator

Phase 5 Python Validator vs. Phase 7 SHACL Shapes

Aspect Python Validator (Phase 5) SHACL Shapes (Phase 7)
Input Format YAML (LinkML instances) RDF (Turtle, JSON-LD, etc.)
Execution Standalone script Triple store integrated OR pyshacl
Performance Fast for <1,000 records Optimized for >10,000 records
Deployment Python runtime required RDF triple store native
Error Messages Custom CLI output Standardized SHACL reports
CI/CD Exit codes (0/1/2) Exit codes (0/1/2) + RDF report
Use Case Development validation Production runtime validation

When to Use Which?

Use Python Validator (validate_temporal_consistency.py):

  • During schema development (fast feedback on YAML instances)
  • Pre-commit hooks for LinkML files
  • Unit testing LinkML examples
  • Before RDF conversion

Use SHACL Shapes (validate_with_shacl.py):

  • Production RDF triple stores (GraphDB, Fuseki)
  • Data ingestion pipelines
  • Continuous monitoring (real-time validation)
  • After RDF conversion (final quality gate)

Best Practice: Use both:

  1. Python validator during development (YAML → validate → RDF)
  2. SHACL shapes in production (RDF → validate → store)

Advanced Usage

Generate Validation Report

python scripts/validate_with_shacl.py data.ttl --output report.ttl

Report Format (Turtle):

@prefix sh: <http://www.w3.org/ns/shacl#> .

[ a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [
        a sh:ValidationResult ;
        sh:focusNode <https://example.org/collection/col-1> ;
        sh:resultMessage "Collection valid_from (1970-01-01) must be >= ..." ;
        sh:resultSeverity sh:Violation ;
        sh:sourceConstraintComponent sh:SPARQLConstraintComponent ;
        sh:sourceShape custodian:CollectionUnitTemporalConsistencyShape
    ]
] .

Custom Severity Levels

SHACL supports three severity levels:

sh:severity sh:Violation ;  # ERROR (blocks data loading)
sh:severity sh:Warning ;    # WARNING (logged but allowed)
sh:severity sh:Info ;       # INFO (informational only)

Example: Custody gap is a warning (data quality issue but not invalid):

custodian:CustodyTransferContinuityShape
    sh:sparql [
        sh:severity sh:Warning ;  # Allow data but log warning
        sh:message "Custody gap detected..." ;
        ...
    ] .

Extending Shapes

Add custom validation rules by creating new shapes:

# Custom rule: Collection name must not be empty
custodian:CollectionNameNotEmptyShape
    a sh:NodeShape ;
    sh:targetClass custodian:CustodianCollection ;
    sh:property [
        sh:path custodian:collection_name ;
        sh:minLength 1 ;
        sh:message "Collection name must not be empty" ;
    ] .

Troubleshooting

Common Issues

Issue 1: "pyshacl not found"

Solution:

pip install pyshacl rdflib

Issue 2: "Parse error: Invalid Turtle syntax"

Solution: Validate RDF syntax first:

rdfpipe -i turtle data.ttl > /dev/null
# If errors, fix syntax before SHACL validation

Issue 3: "No violations found but data is clearly invalid"

Solution: Check namespace prefixes match between shapes and data:

# Shapes file uses:
@prefix custodian: <https://nde.nl/ontology/hc/custodian/> .

# Data file must use same namespace:
<https://nde.nl/ontology/hc/custodian/CustodianCollection>

References


Next Steps

Phase 8: LinkML Schema Constraints

Embed validation rules directly into LinkML schema using:

  • minimum_value / maximum_value for date comparisons
  • pattern for format validation
  • Custom validators with Python functions
  • Slot-level constraints

Goal: Validate at schema definition level, not just RDF level.


Document Version: 1.0.0
Schema Version: v0.7.0
Last Updated: 2025-11-22
SHACL Shapes File: schemas/20251121/shacl/custodian_validation_shapes.ttl (474 lines)
Validation Script: scripts/validate_with_shacl.py (289 lines)