19 KiB
Schema Generation Rules for AI Agents
Date: 2025-11-22
Purpose: Standard rules for generating derived artifacts from LinkML schemas
Rule 1: Always Use Full Timestamps in Generated File Names
MANDATORY: When generating derived artifacts (RDF, UML, etc.) from LinkML schemas, ALWAYS include a full timestamp (date AND time) in the filename.
Format
{base_name}_{YYYYMMDD}_{HHMMSS}.{extension}
Examples
# ✅ CORRECT - Full timestamp (date + time)
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-yuml schemas/linkml/schema.yaml > schemas/uml/mermaid/schema_${TIMESTAMP}.mmd
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/schema_${TIMESTAMP}.owl.ttl
# Examples of correct filenames:
custodian_multi_aspect_20251122_154136.mmd
custodian_multi_aspect_20251122_154430.owl.ttl
custodian_multi_aspect_20251122_154430.nt
custodian_multi_aspect_20251122_154430.jsonld
custodian_multi_aspect_20251122_154430.rdf
# ❌ WRONG - No timestamp
schema.mmd
01_custodian_name.owl.ttl
# ❌ WRONG - Date only (MISSING TIME!)
schema_20251122.mmd
custodian_multi_aspect_20251122.owl.ttl
# ❌ WRONG - Time only (missing date)
schema_154430.mmd
Rationale
- Version tracking: Full timestamps enable precise version identification
- No overwrites: Multiple generations on same day don't conflict
- Debugging: Can identify exact time when changes were made
- Rollback: Easy to revert to specific versions
- Audit trail: Documents schema evolution with chronological precision
- Prevents overwrites: Never lose previous versions
- Multiple sessions per day: Teams may generate artifacts multiple times daily
- Git-friendly: Easy to diff between versions
- Reproducibility: Can correlate generated artifacts with git commits
Critical Note
The timestamp must include BOTH date and time (YYYYMMDD_HHMMSS), not just date. This allows multiple generation runs per day without filename conflicts.
Rule 2: LinkML is the Single Source of Truth
NEVER manually create or edit derived files. Always generate from LinkML.
Correct Workflow ✅
1. Edit LinkML schema (.yaml)
2. Generate RDF formats (gen-owl + rdfpipe)
3. Generate UML diagrams (gen-yuml)
4. Generate TypeDB schema (manual translation, but documented)
5. Validate examples (linkml-validate)
Incorrect Workflow ❌
❌ Editing .ttl files directly
❌ Creating .jsonld manually
❌ Drawing UML diagrams by hand
❌ Modifying TypeDB schema without updating LinkML
Rule 3: Generate All RDF Serialization Formats
When generating RDF from LinkML, produce all standard serialization formats:
Required Formats
- OWL/Turtle (.owl.ttl) - Primary, human-readable
- N-Triples (.nt) - Simple, line-based
- JSON-LD (.jsonld) - Web-friendly, JSON-based
- RDF/XML (.rdf) - XML-based, traditional
Generation Commands
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="schema_${TIMESTAMP}"
# 1. Generate OWL/Turtle (primary)
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/${BASE_NAME}.owl.ttl
# 2. Convert to other formats using rdfpipe
rdfpipe --input-format turtle --output-format nt schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.nt
rdfpipe --input-format turtle --output-format json-ld schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.jsonld
rdfpipe --input-format turtle --output-format xml schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.rdf
Rule 4: Validate Before Committing
Before committing schema changes, ALWAYS:
-
Validate LinkML schema:
gen-owl -f ttl schemas/linkml/schema.yaml > /tmp/test_validation.ttl # Check for errors in output -
Validate example instances:
linkml-validate -s schemas/linkml/schema.yaml schemas/examples/instance.yaml -
Check RDF triples count:
wc -l schemas/rdf/*.nt # N-Triples are easy to count -
Verify class presence:
grep -c "ClassName" schemas/rdf/*.owl.ttl
Rule 5: Document Schema Changes
Every schema change requires:
- Quick status document:
QUICK_STATUS_{TOPIC}_{YYYYMMDD}.md - Session summary:
SESSION_SUMMARY_{YYYYMMDD}_{TOPIC}.md - Updated examples: Add/update instance files demonstrating changes
- Commit message: Reference quick status document
Template: Quick Status Document
# Quick Status: {Topic}
Date: YYYY-MM-DD
Status: ✅ COMPLETE / ⏳ IN PROGRESS
Priority: HIGH / MEDIUM / LOW
## What We Did
...
## Key Changes
...
## Files Modified
...
## Validation Results
...
## Next Steps
...
Rule 6: Example Instances Are Required
For every new class or major schema change:
- Create at least ONE complete example instance
- Place in
schemas/{version}/examples/ - Use descriptive filenames:
{class_name}_{use_case}_{timestamp}.yaml - Include all required slots and at least 2-3 optional slots
- Add inline comments explaining non-obvious fields
Example Instance Template
---
# Complete Example: {ClassName}
# Date: YYYY-MM-DD
# Use Case: {Description}
# Status: Valid instance conforming to schema version {X.Y.Z}
instances:
- id: https://example.org/id
required_field_1: "value"
required_field_2: "value"
optional_field: "value" # Explanation of when to use this field
# ... more fields
Rule 7: UML Diagram Conventions
When generating UML diagrams:
File Naming
{schema_name}_{diagram_type}_{YYYYMMDD}_{HHMMSS}.mmd
Examples:
custodian_class_diagram_20251122_154136.mmdprov_flow_sequence_20251122_154200.mmd
Diagram Types
class_diagram- Class hierarchies and relationshipssequence- PROV-O temporal flowsstate- State transitions (e.g., organizational change events)er- Entity-relationship (database perspective)
Storage Location
schemas/{version}/uml/mermaid/{timestamp_files}.mmd
Rule 8: TypeDB Schema Updates
TypeDB schemas are manually translated from LinkML (not auto-generated).
Required Steps
- Update LinkML schema first
- Regenerate RDF to verify OWL alignment
- Manually update TypeDB schema (.tql)
- Document translation decisions
- Test TypeDB queries
Translation Documentation
Create TYPEDB_TRANSLATION_NOTES.md documenting:
- LinkML class → TypeDB entity/relation mapping
- Slot → attribute mapping
- Constraints and rules
- Query examples
Rule 9: Version Control for Generated Files
What to Commit
✅ DO commit:
- LinkML schema files (.yaml)
- Example instances (.yaml)
- Documentation (.md)
- Latest timestamped RDF (keep last 3 versions)
- Latest timestamped UML (keep last 3 versions)
❌ DO NOT commit:
- Temporary validation files (/tmp/*)
- Old versions (>3 generations old)
- Duplicate non-timestamped files
Cleanup Script
# Keep only last 3 timestamped versions of each schema
cd schemas/rdf
ls -t schema_*.owl.ttl | tail -n +4 | xargs rm -f
Rule 10: Generation Workflow Template
Standard workflow for schema changes:
#!/bin/bash
# Schema Generation Workflow
# Usage: ./generate_schema_artifacts.sh
set -e # Exit on error
SCHEMA_FILE="schemas/20251121/linkml/01_custodian_name_modular.yaml"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="custodian_${TIMESTAMP}"
echo "=== Schema Generation Workflow ==="
echo "Timestamp: $TIMESTAMP"
echo ""
# Step 1: Validate LinkML
echo "Step 1: Validating LinkML schema..."
gen-owl -f ttl "$SCHEMA_FILE" > /tmp/validation_test.ttl 2>&1
echo "✅ Schema valid"
# Step 2: Generate RDF formats
echo "Step 2: Generating RDF formats..."
gen-owl -f ttl "$SCHEMA_FILE" > "schemas/20251121/rdf/${BASE_NAME}.owl.ttl"
rdfpipe --input-format turtle --output-format nt "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.nt"
rdfpipe --input-format turtle --output-format json-ld "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.jsonld"
rdfpipe --input-format turtle --output-format xml "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.rdf"
echo "✅ RDF formats generated"
# Step 3: Generate UML
echo "Step 3: Generating UML diagrams..."
gen-yuml "$SCHEMA_FILE" > "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd"
echo "✅ UML diagram generated"
# Step 4: Validate examples
echo "Step 4: Validating example instances..."
for example in schemas/20251121/examples/*.yaml; do
linkml-validate -s "$SCHEMA_FILE" "$example" || echo "⚠️ Warning: $example failed validation"
done
echo "✅ Examples validated"
# Step 5: Report
echo ""
echo "=== Generation Complete ==="
ls -lh "schemas/20251121/rdf/${BASE_NAME}".* | awk '{print $9, "("$5")"}'
ls -lh "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd" | awk '{print $9, "("$5")"}'
echo ""
echo "Next: Update documentation and commit"
Quick Reference Commands
Generate All Artifacts
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-owl -f ttl schema.yaml > schema_${TIMESTAMP}.owl.ttl
gen-yuml schema.yaml > schema_${TIMESTAMP}.mmd
Validate
gen-owl -f ttl schema.yaml > /tmp/test.ttl # Check for errors
linkml-validate -s schema.yaml instance.yaml
Convert RDF Formats
rdfpipe -i turtle -o nt file.ttl > file.nt
rdfpipe -i turtle -o json-ld file.ttl > file.jsonld
rdfpipe -i turtle -o xml file.ttl > file.rdf
Check RDF Content
grep -c "ClassName" file.owl.ttl # Count class references
wc -l file.nt # Count triples
Rule 11: SHACL Generation for Modular Schemas
The Problem
gen-shacl (LinkML's built-in SHACL generator) fails on modular schemas with errors like:
KeyError: 'contributing_agency'
This is a LinkML bug where:
schema_mapis keyed by import paths (e.g.,modules/classes/ContributingAgency)- But lookups use schema names (e.g.,
contributing_agency) - This mismatch causes
KeyErrorduringfrom_schemaresolution
Other generators (gen-owl, gen-yaml, linkml-lint) work fine because they don't perform this specific lookup.
The Solution: Use scripts/generate_shacl.py
We provide a workaround script that:
- Loads schema via
SchemaView(correctly resolves all imports) - Merges all classes/slots/enums from imports into main schema
- Clears
from_schemareferences that cause the bug - Excludes built-in types (avoid conflicts with
linkml:types) - Writes to temp file and runs
ShaclGenerator
Usage
# Generate SHACL with default settings (timestamped output)
python scripts/generate_shacl.py
# Generate SHACL with verbose output (shows all steps)
python scripts/generate_shacl.py --verbose
# Generate SHACL to specific file
python scripts/generate_shacl.py --output schemas/20251121/shacl/custom_shapes.ttl
# Use custom schema file
python scripts/generate_shacl.py --schema path/to/schema.yaml
# Write to stdout (for piping)
python scripts/generate_shacl.py --stdout
Output Location
By default, generates timestamped files:
schemas/20251121/shacl/custodian_shacl_{YYYYMMDD}_{HHMMSS}.ttl
What You'll See
The script produces warnings that are expected and safe to ignore:
# Inverse slot warnings (schema design issue, doesn't affect SHACL)
Range of slot 'collections_under_responsibility' (LegalResponsibilityCollection)
does not line with the domain of its inverse (responsible_legal_entity)
# Unrecognized prefix warnings (prefixes defined in modules, not merged)
File "linkml_shacl_xxx.yaml", line 147, col 10: Unrecognized prefix: geosparql
These warnings don't prevent SHACL generation from succeeding.
Example Output
================================================================================
SHACL GENERATION (with modular schema workaround)
================================================================================
Schema: schemas/20251121/linkml/01_custodian_name_modular.yaml
Output format: turtle
Output file: schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl
================================================================================
Step 1: Loading schema via SchemaView...
Loaded schema: heritage-custodian-observation-reconstruction
Classes (via imports): 93
Slots (via imports): 857
Enums (via imports): 51
Step 2: Merging imported definitions into main schema...
Merged 93 classes
Merged 857 slots
Merged 51 enums
Merged 1 types (excluding 19 builtins)
Step 3: Clearing from_schema references...
Cleared 1002 from_schema references
Step 4: Simplifying imports...
Original imports: 253
New imports: ['linkml:types']
Step 5: Writing cleaned schema to temp file...
Step 6: Running ShaclGenerator...
Generated 14924 lines of SHACL
Step 7: Writing output...
================================================================================
✅ SHACL GENERATION COMPLETE
================================================================================
Validating SHACL Output
After generation, validate the SHACL file:
# Load into rdflib and count shapes
python3 -c "
from rdflib import Graph
from rdflib.namespace import SH
g = Graph()
g.parse('schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl', format='turtle')
print(f'Triples: {len(g)}')
shapes = list(g.subjects(predicate=None, object=SH.NodeShape))
print(f'NodeShapes: {len(shapes)}')
"
Using SHACL for Validation
Use scripts/validate_with_shacl.py to validate RDF data:
# Validate Turtle file
python scripts/validate_with_shacl.py data.ttl --shapes schemas/20251121/shacl/custodian_shacl_*.ttl
# Validate JSON-LD file
python scripts/validate_with_shacl.py data.jsonld --format jsonld
Why Not Just Use gen-shacl Directly?
DON'T DO THIS (it will fail):
# ❌ FAILS with KeyError on modular schemas
gen-shacl schemas/20251121/linkml/01_custodian_name_modular.yaml
DO THIS INSTEAD:
# ✅ Works via workaround script
python scripts/generate_shacl.py
Rule 12: Inverse Slot RDFS Compliance
The Problem
In RDFS/OWL, inverse properties have strict domain/range requirements:
- If property A has
range: ClassXandinverse: B - Then property B MUST have
domain: ClassX
Violating this creates logically inconsistent RDF graphs and fails RDFS validation.
The Solution: Always Declare Domain for Inverse Slots
Every slot with an inverse: declaration MUST have an explicit domain:
# ✅ CORRECT - Both slots have domain/range aligned with inverse
slots:
collections_under_responsibility:
domain: CustodianLegalStatus # ← Domain explicitly declared
range: LegalResponsibilityCollection
inverse: responsible_legal_entity
responsible_legal_entity:
domain: LegalResponsibilityCollection # ← Must match range of inverse
range: CustodianLegalStatus # ← Must match domain of inverse
inverse: collections_under_responsibility
# ❌ WRONG - Missing domain violates RDFS
slots:
collections_under_responsibility:
# domain: ??? # ← Missing! RDFS non-compliant
range: LegalResponsibilityCollection
inverse: responsible_legal_entity
Polymorphic Inverse Slots
For slots used by multiple classes (polymorphic), create an abstract base class:
# ✅ CORRECT - Abstract base class for RDFS compliance
classes:
ReconstructedEntity:
abstract: true
class_uri: prov:Entity
description: "Abstract base for all entities generated by ReconstructionActivity"
CustodianLegalStatus:
is_a: ReconstructedEntity # ← Inherits from abstract base
# ...
CustodianName:
is_a: ReconstructedEntity # ← Inherits from abstract base
# ...
slots:
generates:
domain: ReconstructionActivity
range: ReconstructedEntity # ← Abstract base class
inverse: was_generated_by
was_generated_by:
domain: ReconstructedEntity # ← Abstract base class
range: ReconstructionActivity
inverse: generates
Validation Checklist
Before committing schema changes with inverse slots:
- Every inverse slot pair has explicit domain/range
- Domain of slot A = Range of slot B (its inverse)
- Range of slot A = Domain of slot B (its inverse)
- For polymorphic slots: abstract base class exists and all using classes inherit from it
Quick Reference: Fixed Inverse Pairs
| Slot A | Domain A | Range A | ↔ | Slot B | Domain B | Range B |
|---|---|---|---|---|---|---|
collections_under_responsibility |
CustodianLegalStatus | LegalResponsibilityCollection | ↔ | responsible_legal_entity |
LegalResponsibilityCollection | CustodianLegalStatus |
staff_members |
OrganizationalStructure | PersonObservation | ↔ | unit_affiliation |
PersonObservation | OrganizationalStructure |
portal_data_sources |
WebPortal | CollectionManagementSystem | ↔ | feeds_portal |
CollectionManagementSystem | WebPortal |
exposed_via_portal |
CustodianCollection | WebPortal | ↔ | exposes_collections |
WebPortal | CustodianCollection |
has_observation |
Custodian | CustodianObservation | ↔ | refers_to_custodian |
CustodianObservation | Custodian |
identified_by |
Custodian | CustodianIdentifier | ↔ | identifies_custodian |
CustodianIdentifier | Custodian |
encompasses |
EncompassingBody | Custodian | ↔ | encompassing_body |
Custodian | EncompassingBody |
generates |
ReconstructionActivity | ReconstructedEntity | ↔ | was_generated_by |
ReconstructedEntity | ReconstructionActivity |
used |
ReconstructionActivity | CustodianObservation | ↔ | used_by |
CustodianObservation | ReconstructionActivity |
affects_organization |
OrganizationalChangeEvent | Custodian | ↔ | organizational_change_events |
Custodian | OrganizationalChangeEvent |
platform_of |
DigitalPlatform | Custodian | ↔ | digital_platform |
Custodian | DigitalPlatform |
identifies |
CustodianIdentifier | Custodian | ↔ | identifiers |
Custodian | CustodianIdentifier |
allocates |
AllocationAgency | CustodianIdentifier | ↔ | allocated_by |
CustodianIdentifier | AllocationAgency |
is_legal_status_of |
CustodianLegalStatus | Custodian | ↔ | legal_status |
Custodian | CustodianLegalStatus |
Abstract Base Class: ReconstructedEntity
Created to ensure RDFS compliance for generates/was_generated_by inverse pair:
File: schemas/20251121/linkml/modules/classes/ReconstructedEntity.yaml
Subclasses (20 classes inherit from ReconstructedEntity):
- ArticlesOfAssociation
- AuxiliaryDigitalPlatform
- AuxiliaryPlace
- Budget
- CollectionManagementSystem
- CustodianAdministration
- CustodianArchive
- CustodianCollection (and its subclass LegalResponsibilityCollection)
- CustodianLegalStatus
- CustodianName
- CustodianPlace
- DigitalPlatform
- FeaturePlace
- FinancialStatement
- GiftShop
- InternetOfThings
- OrganizationBranch
- SocialMediaProfile
- WebPortal
Status: ✅ ACTIVE RULES
Version: 1.2
Last Updated: 2025-12-01
Applies To: All LinkML schema work in this project
See Also:
.opencode/HYPER_MODULAR_STRUCTURE.md- Module organization.opencode/SLOT_NAMING_CONVENTIONS.md- Slot naming patternsAGENTS.md- AI agent instructions