glam/.opencode/SCHEMA_GENERATION_RULES.md
2025-12-01 16:06:34 +01:00

19 KiB

Schema Generation Rules for AI Agents

Date: 2025-11-22
Purpose: Standard rules for generating derived artifacts from LinkML schemas


Rule 1: Always Use Full Timestamps in Generated File Names

MANDATORY: When generating derived artifacts (RDF, UML, etc.) from LinkML schemas, ALWAYS include a full timestamp (date AND time) in the filename.

Format

{base_name}_{YYYYMMDD}_{HHMMSS}.{extension}

Examples

# ✅ CORRECT - Full timestamp (date + time)
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-yuml schemas/linkml/schema.yaml > schemas/uml/mermaid/schema_${TIMESTAMP}.mmd
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/schema_${TIMESTAMP}.owl.ttl

# Examples of correct filenames:
custodian_multi_aspect_20251122_154136.mmd
custodian_multi_aspect_20251122_154430.owl.ttl
custodian_multi_aspect_20251122_154430.nt
custodian_multi_aspect_20251122_154430.jsonld
custodian_multi_aspect_20251122_154430.rdf

# ❌ WRONG - No timestamp
schema.mmd
01_custodian_name.owl.ttl

# ❌ WRONG - Date only (MISSING TIME!)
schema_20251122.mmd
custodian_multi_aspect_20251122.owl.ttl

# ❌ WRONG - Time only (missing date)
schema_154430.mmd

Rationale

  1. Version tracking: Full timestamps enable precise version identification
  2. No overwrites: Multiple generations on same day don't conflict
  3. Debugging: Can identify exact time when changes were made
  4. Rollback: Easy to revert to specific versions
  5. Audit trail: Documents schema evolution with chronological precision
  6. Prevents overwrites: Never lose previous versions
  7. Multiple sessions per day: Teams may generate artifacts multiple times daily
  8. Git-friendly: Easy to diff between versions
  9. Reproducibility: Can correlate generated artifacts with git commits

Critical Note

The timestamp must include BOTH date and time (YYYYMMDD_HHMMSS), not just date. This allows multiple generation runs per day without filename conflicts.


Rule 2: LinkML is the Single Source of Truth

NEVER manually create or edit derived files. Always generate from LinkML.

Correct Workflow

1. Edit LinkML schema (.yaml)
2. Generate RDF formats (gen-owl + rdfpipe)
3. Generate UML diagrams (gen-yuml)
4. Generate TypeDB schema (manual translation, but documented)
5. Validate examples (linkml-validate)

Incorrect Workflow

❌ Editing .ttl files directly
❌ Creating .jsonld manually
❌ Drawing UML diagrams by hand
❌ Modifying TypeDB schema without updating LinkML

Rule 3: Generate All RDF Serialization Formats

When generating RDF from LinkML, produce all standard serialization formats:

Required Formats

  1. OWL/Turtle (.owl.ttl) - Primary, human-readable
  2. N-Triples (.nt) - Simple, line-based
  3. JSON-LD (.jsonld) - Web-friendly, JSON-based
  4. RDF/XML (.rdf) - XML-based, traditional

Generation Commands

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="schema_${TIMESTAMP}"

# 1. Generate OWL/Turtle (primary)
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/${BASE_NAME}.owl.ttl

# 2. Convert to other formats using rdfpipe
rdfpipe --input-format turtle --output-format nt schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.nt
rdfpipe --input-format turtle --output-format json-ld schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.jsonld
rdfpipe --input-format turtle --output-format xml schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.rdf

Rule 4: Validate Before Committing

Before committing schema changes, ALWAYS:

  1. Validate LinkML schema:

    gen-owl -f ttl schemas/linkml/schema.yaml > /tmp/test_validation.ttl
    # Check for errors in output
    
  2. Validate example instances:

    linkml-validate -s schemas/linkml/schema.yaml schemas/examples/instance.yaml
    
  3. Check RDF triples count:

    wc -l schemas/rdf/*.nt  # N-Triples are easy to count
    
  4. Verify class presence:

    grep -c "ClassName" schemas/rdf/*.owl.ttl
    

Rule 5: Document Schema Changes

Every schema change requires:

  1. Quick status document: QUICK_STATUS_{TOPIC}_{YYYYMMDD}.md
  2. Session summary: SESSION_SUMMARY_{YYYYMMDD}_{TOPIC}.md
  3. Updated examples: Add/update instance files demonstrating changes
  4. Commit message: Reference quick status document

Template: Quick Status Document

# Quick Status: {Topic}
Date: YYYY-MM-DD  
Status: ✅ COMPLETE / ⏳ IN PROGRESS  
Priority: HIGH / MEDIUM / LOW

## What We Did
...

## Key Changes
...

## Files Modified
...

## Validation Results
...

## Next Steps
...

Rule 6: Example Instances Are Required

For every new class or major schema change:

  1. Create at least ONE complete example instance
  2. Place in schemas/{version}/examples/
  3. Use descriptive filenames: {class_name}_{use_case}_{timestamp}.yaml
  4. Include all required slots and at least 2-3 optional slots
  5. Add inline comments explaining non-obvious fields

Example Instance Template

---
# Complete Example: {ClassName}
# Date: YYYY-MM-DD
# Use Case: {Description}
# Status: Valid instance conforming to schema version {X.Y.Z}

instances:
  - id: https://example.org/id
    required_field_1: "value"
    required_field_2: "value"
    optional_field: "value"  # Explanation of when to use this field
    # ... more fields

Rule 7: UML Diagram Conventions

When generating UML diagrams:

File Naming

{schema_name}_{diagram_type}_{YYYYMMDD}_{HHMMSS}.mmd

Examples:

  • custodian_class_diagram_20251122_154136.mmd
  • prov_flow_sequence_20251122_154200.mmd

Diagram Types

  • class_diagram - Class hierarchies and relationships
  • sequence - PROV-O temporal flows
  • state - State transitions (e.g., organizational change events)
  • er - Entity-relationship (database perspective)

Storage Location

schemas/{version}/uml/mermaid/{timestamp_files}.mmd

Rule 8: TypeDB Schema Updates

TypeDB schemas are manually translated from LinkML (not auto-generated).

Required Steps

  1. Update LinkML schema first
  2. Regenerate RDF to verify OWL alignment
  3. Manually update TypeDB schema (.tql)
  4. Document translation decisions
  5. Test TypeDB queries

Translation Documentation

Create TYPEDB_TRANSLATION_NOTES.md documenting:

  • LinkML class → TypeDB entity/relation mapping
  • Slot → attribute mapping
  • Constraints and rules
  • Query examples

Rule 9: Version Control for Generated Files

What to Commit

DO commit:

  • LinkML schema files (.yaml)
  • Example instances (.yaml)
  • Documentation (.md)
  • Latest timestamped RDF (keep last 3 versions)
  • Latest timestamped UML (keep last 3 versions)

DO NOT commit:

  • Temporary validation files (/tmp/*)
  • Old versions (>3 generations old)
  • Duplicate non-timestamped files

Cleanup Script

# Keep only last 3 timestamped versions of each schema
cd schemas/rdf
ls -t schema_*.owl.ttl | tail -n +4 | xargs rm -f

Rule 10: Generation Workflow Template

Standard workflow for schema changes:

#!/bin/bash
# Schema Generation Workflow
# Usage: ./generate_schema_artifacts.sh

set -e  # Exit on error

SCHEMA_FILE="schemas/20251121/linkml/01_custodian_name_modular.yaml"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="custodian_${TIMESTAMP}"

echo "=== Schema Generation Workflow ==="
echo "Timestamp: $TIMESTAMP"
echo ""

# Step 1: Validate LinkML
echo "Step 1: Validating LinkML schema..."
gen-owl -f ttl "$SCHEMA_FILE" > /tmp/validation_test.ttl 2>&1
echo "✅ Schema valid"

# Step 2: Generate RDF formats
echo "Step 2: Generating RDF formats..."
gen-owl -f ttl "$SCHEMA_FILE" > "schemas/20251121/rdf/${BASE_NAME}.owl.ttl"
rdfpipe --input-format turtle --output-format nt "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.nt"
rdfpipe --input-format turtle --output-format json-ld "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.jsonld"
rdfpipe --input-format turtle --output-format xml "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.rdf"
echo "✅ RDF formats generated"

# Step 3: Generate UML
echo "Step 3: Generating UML diagrams..."
gen-yuml "$SCHEMA_FILE" > "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd"
echo "✅ UML diagram generated"

# Step 4: Validate examples
echo "Step 4: Validating example instances..."
for example in schemas/20251121/examples/*.yaml; do
    linkml-validate -s "$SCHEMA_FILE" "$example" || echo "⚠️  Warning: $example failed validation"
done
echo "✅ Examples validated"

# Step 5: Report
echo ""
echo "=== Generation Complete ==="
ls -lh "schemas/20251121/rdf/${BASE_NAME}".* | awk '{print $9, "("$5")"}'
ls -lh "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd" | awk '{print $9, "("$5")"}'
echo ""
echo "Next: Update documentation and commit"

Quick Reference Commands

Generate All Artifacts

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-owl -f ttl schema.yaml > schema_${TIMESTAMP}.owl.ttl
gen-yuml schema.yaml > schema_${TIMESTAMP}.mmd

Validate

gen-owl -f ttl schema.yaml > /tmp/test.ttl  # Check for errors
linkml-validate -s schema.yaml instance.yaml

Convert RDF Formats

rdfpipe -i turtle -o nt file.ttl > file.nt
rdfpipe -i turtle -o json-ld file.ttl > file.jsonld
rdfpipe -i turtle -o xml file.ttl > file.rdf

Check RDF Content

grep -c "ClassName" file.owl.ttl  # Count class references
wc -l file.nt  # Count triples

Rule 11: SHACL Generation for Modular Schemas

The Problem

gen-shacl (LinkML's built-in SHACL generator) fails on modular schemas with errors like:

KeyError: 'contributing_agency'

This is a LinkML bug where:

  1. schema_map is keyed by import paths (e.g., modules/classes/ContributingAgency)
  2. But lookups use schema names (e.g., contributing_agency)
  3. This mismatch causes KeyError during from_schema resolution

Other generators (gen-owl, gen-yaml, linkml-lint) work fine because they don't perform this specific lookup.

The Solution: Use scripts/generate_shacl.py

We provide a workaround script that:

  1. Loads schema via SchemaView (correctly resolves all imports)
  2. Merges all classes/slots/enums from imports into main schema
  3. Clears from_schema references that cause the bug
  4. Excludes built-in types (avoid conflicts with linkml:types)
  5. Writes to temp file and runs ShaclGenerator

Usage

# Generate SHACL with default settings (timestamped output)
python scripts/generate_shacl.py

# Generate SHACL with verbose output (shows all steps)
python scripts/generate_shacl.py --verbose

# Generate SHACL to specific file
python scripts/generate_shacl.py --output schemas/20251121/shacl/custom_shapes.ttl

# Use custom schema file
python scripts/generate_shacl.py --schema path/to/schema.yaml

# Write to stdout (for piping)
python scripts/generate_shacl.py --stdout

Output Location

By default, generates timestamped files:

schemas/20251121/shacl/custodian_shacl_{YYYYMMDD}_{HHMMSS}.ttl

What You'll See

The script produces warnings that are expected and safe to ignore:

# Inverse slot warnings (schema design issue, doesn't affect SHACL)
Range of slot 'collections_under_responsibility' (LegalResponsibilityCollection) 
does not line with the domain of its inverse (responsible_legal_entity)

# Unrecognized prefix warnings (prefixes defined in modules, not merged)
File "linkml_shacl_xxx.yaml", line 147, col 10: Unrecognized prefix: geosparql

These warnings don't prevent SHACL generation from succeeding.

Example Output

================================================================================
SHACL GENERATION (with modular schema workaround)
================================================================================
Schema: schemas/20251121/linkml/01_custodian_name_modular.yaml
Output format: turtle
Output file: schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl
================================================================================

Step 1: Loading schema via SchemaView...
  Loaded schema: heritage-custodian-observation-reconstruction
  Classes (via imports): 93
  Slots (via imports): 857
  Enums (via imports): 51

Step 2: Merging imported definitions into main schema...
  Merged 93 classes
  Merged 857 slots
  Merged 51 enums
  Merged 1 types (excluding 19 builtins)

Step 3: Clearing from_schema references...
  Cleared 1002 from_schema references

Step 4: Simplifying imports...
  Original imports: 253
  New imports: ['linkml:types']

Step 5: Writing cleaned schema to temp file...
Step 6: Running ShaclGenerator...
  Generated 14924 lines of SHACL

Step 7: Writing output...

================================================================================
✅ SHACL GENERATION COMPLETE
================================================================================

Validating SHACL Output

After generation, validate the SHACL file:

# Load into rdflib and count shapes
python3 -c "
from rdflib import Graph
from rdflib.namespace import SH

g = Graph()
g.parse('schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl', format='turtle')
print(f'Triples: {len(g)}')
shapes = list(g.subjects(predicate=None, object=SH.NodeShape))
print(f'NodeShapes: {len(shapes)}')
"

Using SHACL for Validation

Use scripts/validate_with_shacl.py to validate RDF data:

# Validate Turtle file
python scripts/validate_with_shacl.py data.ttl --shapes schemas/20251121/shacl/custodian_shacl_*.ttl

# Validate JSON-LD file
python scripts/validate_with_shacl.py data.jsonld --format jsonld

Why Not Just Use gen-shacl Directly?

DON'T DO THIS (it will fail):

# ❌ FAILS with KeyError on modular schemas
gen-shacl schemas/20251121/linkml/01_custodian_name_modular.yaml

DO THIS INSTEAD:

# ✅ Works via workaround script
python scripts/generate_shacl.py

Rule 12: Inverse Slot RDFS Compliance

The Problem

In RDFS/OWL, inverse properties have strict domain/range requirements:

  • If property A has range: ClassX and inverse: B
  • Then property B MUST have domain: ClassX

Violating this creates logically inconsistent RDF graphs and fails RDFS validation.

The Solution: Always Declare Domain for Inverse Slots

Every slot with an inverse: declaration MUST have an explicit domain:

# ✅ CORRECT - Both slots have domain/range aligned with inverse
slots:
  collections_under_responsibility:
    domain: CustodianLegalStatus      # ← Domain explicitly declared
    range: LegalResponsibilityCollection
    inverse: responsible_legal_entity

  responsible_legal_entity:
    domain: LegalResponsibilityCollection  # ← Must match range of inverse
    range: CustodianLegalStatus            # ← Must match domain of inverse
    inverse: collections_under_responsibility
# ❌ WRONG - Missing domain violates RDFS
slots:
  collections_under_responsibility:
    # domain: ???                    # ← Missing! RDFS non-compliant
    range: LegalResponsibilityCollection
    inverse: responsible_legal_entity

Polymorphic Inverse Slots

For slots used by multiple classes (polymorphic), create an abstract base class:

# ✅ CORRECT - Abstract base class for RDFS compliance
classes:
  ReconstructedEntity:
    abstract: true
    class_uri: prov:Entity
    description: "Abstract base for all entities generated by ReconstructionActivity"

  CustodianLegalStatus:
    is_a: ReconstructedEntity  # ← Inherits from abstract base
    # ...

  CustodianName:
    is_a: ReconstructedEntity  # ← Inherits from abstract base
    # ...

slots:
  generates:
    domain: ReconstructionActivity
    range: ReconstructedEntity  # ← Abstract base class
    inverse: was_generated_by

  was_generated_by:
    domain: ReconstructedEntity  # ← Abstract base class
    range: ReconstructionActivity
    inverse: generates

Validation Checklist

Before committing schema changes with inverse slots:

  1. Every inverse slot pair has explicit domain/range
  2. Domain of slot A = Range of slot B (its inverse)
  3. Range of slot A = Domain of slot B (its inverse)
  4. For polymorphic slots: abstract base class exists and all using classes inherit from it

Quick Reference: Fixed Inverse Pairs

Slot A Domain A Range A Slot B Domain B Range B
collections_under_responsibility CustodianLegalStatus LegalResponsibilityCollection responsible_legal_entity LegalResponsibilityCollection CustodianLegalStatus
staff_members OrganizationalStructure PersonObservation unit_affiliation PersonObservation OrganizationalStructure
portal_data_sources WebPortal CollectionManagementSystem feeds_portal CollectionManagementSystem WebPortal
exposed_via_portal CustodianCollection WebPortal exposes_collections WebPortal CustodianCollection
has_observation Custodian CustodianObservation refers_to_custodian CustodianObservation Custodian
identified_by Custodian CustodianIdentifier identifies_custodian CustodianIdentifier Custodian
encompasses EncompassingBody Custodian encompassing_body Custodian EncompassingBody
generates ReconstructionActivity ReconstructedEntity was_generated_by ReconstructedEntity ReconstructionActivity
used ReconstructionActivity CustodianObservation used_by CustodianObservation ReconstructionActivity
affects_organization OrganizationalChangeEvent Custodian organizational_change_events Custodian OrganizationalChangeEvent
platform_of DigitalPlatform Custodian digital_platform Custodian DigitalPlatform
identifies CustodianIdentifier Custodian identifiers Custodian CustodianIdentifier
allocates AllocationAgency CustodianIdentifier allocated_by CustodianIdentifier AllocationAgency
is_legal_status_of CustodianLegalStatus Custodian legal_status Custodian CustodianLegalStatus

Abstract Base Class: ReconstructedEntity

Created to ensure RDFS compliance for generates/was_generated_by inverse pair:

File: schemas/20251121/linkml/modules/classes/ReconstructedEntity.yaml

Subclasses (20 classes inherit from ReconstructedEntity):

  • ArticlesOfAssociation
  • AuxiliaryDigitalPlatform
  • AuxiliaryPlace
  • Budget
  • CollectionManagementSystem
  • CustodianAdministration
  • CustodianArchive
  • CustodianCollection (and its subclass LegalResponsibilityCollection)
  • CustodianLegalStatus
  • CustodianName
  • CustodianPlace
  • DigitalPlatform
  • FeaturePlace
  • FinancialStatement
  • GiftShop
  • InternetOfThings
  • OrganizationBranch
  • SocialMediaProfile
  • WebPortal

Status: ACTIVE RULES
Version: 1.2
Last Updated: 2025-12-01
Applies To: All LinkML schema work in this project

See Also:

  • .opencode/HYPER_MODULAR_STRUCTURE.md - Module organization
  • .opencode/SLOT_NAMING_CONVENTIONS.md - Slot naming patterns
  • AGENTS.md - AI agent instructions