# Schema Generation Rules for AI Agents **Date**: 2025-11-22 **Purpose**: Standard rules for generating derived artifacts from LinkML schemas --- ## Rule 1: Always Use Full Timestamps in Generated File Names **MANDATORY**: When generating derived artifacts (RDF, UML, etc.) from LinkML schemas, **ALWAYS** include a full timestamp (date AND time) in the filename. ### Format ``` {base_name}_{YYYYMMDD}_{HHMMSS}.{extension} ``` ### Examples ```bash # ✅ CORRECT - Full timestamp (date + time) TIMESTAMP=$(date +%Y%m%d_%H%M%S) gen-yuml schemas/linkml/schema.yaml > schemas/uml/mermaid/schema_${TIMESTAMP}.mmd gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/schema_${TIMESTAMP}.owl.ttl # Examples of correct filenames: custodian_multi_aspect_20251122_154136.mmd custodian_multi_aspect_20251122_154430.owl.ttl custodian_multi_aspect_20251122_154430.nt custodian_multi_aspect_20251122_154430.jsonld custodian_multi_aspect_20251122_154430.rdf # ❌ WRONG - No timestamp schema.mmd 01_custodian_name.owl.ttl # ❌ WRONG - Date only (MISSING TIME!) schema_20251122.mmd custodian_multi_aspect_20251122.owl.ttl # ❌ WRONG - Time only (missing date) schema_154430.mmd ``` ### Rationale 1. **Version tracking**: Full timestamps enable precise version identification 2. **No overwrites**: Multiple generations on same day don't conflict 3. **Debugging**: Can identify exact time when changes were made 4. **Rollback**: Easy to revert to specific versions 5. **Audit trail**: Documents schema evolution with chronological precision 6. **Prevents overwrites**: Never lose previous versions 7. **Multiple sessions per day**: Teams may generate artifacts multiple times daily 8. **Git-friendly**: Easy to diff between versions 9. **Reproducibility**: Can correlate generated artifacts with git commits ### Critical Note The timestamp must include BOTH date and time (YYYYMMDD_HHMMSS), not just date. This allows multiple generation runs per day without filename conflicts. --- ## Rule 2: LinkML is the Single Source of Truth **NEVER** manually create or edit derived files. Always generate from LinkML. ### Correct Workflow ✅ ``` 1. Edit LinkML schema (.yaml) 2. Generate RDF formats (gen-owl + rdfpipe) 3. Generate UML diagrams (gen-yuml) 4. Generate TypeDB schema (manual translation, but documented) 5. Validate examples (linkml-validate) ``` ### Incorrect Workflow ❌ ``` ❌ Editing .ttl files directly ❌ Creating .jsonld manually ❌ Drawing UML diagrams by hand ❌ Modifying TypeDB schema without updating LinkML ``` --- ## Rule 3: Generate All RDF Serialization Formats When generating RDF from LinkML, produce all standard serialization formats: ### Required Formats 1. **OWL/Turtle** (.owl.ttl) - Primary, human-readable 2. **N-Triples** (.nt) - Simple, line-based 3. **JSON-LD** (.jsonld) - Web-friendly, JSON-based 4. **RDF/XML** (.rdf) - XML-based, traditional ### Generation Commands ```bash TIMESTAMP=$(date +%Y%m%d_%H%M%S) BASE_NAME="schema_${TIMESTAMP}" # 1. Generate OWL/Turtle (primary) gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/${BASE_NAME}.owl.ttl # 2. Convert to other formats using rdfpipe rdfpipe --input-format turtle --output-format nt schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.nt rdfpipe --input-format turtle --output-format json-ld schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.jsonld rdfpipe --input-format turtle --output-format xml schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.rdf ``` --- ## Rule 4: Validate Before Committing Before committing schema changes, **ALWAYS**: 1. **Validate LinkML schema**: ```bash gen-owl -f ttl schemas/linkml/schema.yaml > /tmp/test_validation.ttl # Check for errors in output ``` 2. **Validate example instances**: ```bash linkml-validate -s schemas/linkml/schema.yaml schemas/examples/instance.yaml ``` 3. **Check RDF triples count**: ```bash wc -l schemas/rdf/*.nt # N-Triples are easy to count ``` 4. **Verify class presence**: ```bash grep -c "ClassName" schemas/rdf/*.owl.ttl ``` --- ## Rule 5: Document Schema Changes Every schema change requires: 1. **Quick status document**: `QUICK_STATUS_{TOPIC}_{YYYYMMDD}.md` 2. **Session summary**: `SESSION_SUMMARY_{YYYYMMDD}_{TOPIC}.md` 3. **Updated examples**: Add/update instance files demonstrating changes 4. **Commit message**: Reference quick status document ### Template: Quick Status Document ```markdown # Quick Status: {Topic} Date: YYYY-MM-DD Status: ✅ COMPLETE / ⏳ IN PROGRESS Priority: HIGH / MEDIUM / LOW ## What We Did ... ## Key Changes ... ## Files Modified ... ## Validation Results ... ## Next Steps ... ``` --- ## Rule 6: Example Instances Are Required For every new class or major schema change: 1. Create at least ONE complete example instance 2. Place in `schemas/{version}/examples/` 3. Use descriptive filenames: `{class_name}_{use_case}_{timestamp}.yaml` 4. Include all required slots and at least 2-3 optional slots 5. Add inline comments explaining non-obvious fields ### Example Instance Template ```yaml --- # Complete Example: {ClassName} # Date: YYYY-MM-DD # Use Case: {Description} # Status: Valid instance conforming to schema version {X.Y.Z} instances: - id: https://example.org/id required_field_1: "value" required_field_2: "value" optional_field: "value" # Explanation of when to use this field # ... more fields ``` --- ## Rule 7: UML Diagram Conventions When generating UML diagrams: ### File Naming ``` {schema_name}_{diagram_type}_{YYYYMMDD}_{HHMMSS}.mmd ``` Examples: - `custodian_class_diagram_20251122_154136.mmd` - `prov_flow_sequence_20251122_154200.mmd` ### Diagram Types - `class_diagram` - Class hierarchies and relationships - `sequence` - PROV-O temporal flows - `state` - State transitions (e.g., organizational change events) - `er` - Entity-relationship (database perspective) ### Storage Location ``` schemas/{version}/uml/mermaid/{timestamp_files}.mmd ``` --- ## Rule 8: TypeDB Schema Updates TypeDB schemas are **manually translated** from LinkML (not auto-generated). ### Required Steps 1. Update LinkML schema first 2. Regenerate RDF to verify OWL alignment 3. Manually update TypeDB schema (.tql) 4. Document translation decisions 5. Test TypeDB queries ### Translation Documentation Create `TYPEDB_TRANSLATION_NOTES.md` documenting: - LinkML class → TypeDB entity/relation mapping - Slot → attribute mapping - Constraints and rules - Query examples --- ## Rule 9: Version Control for Generated Files ### What to Commit ✅ **DO commit**: - LinkML schema files (.yaml) - Example instances (.yaml) - Documentation (.md) - Latest timestamped RDF (keep last 3 versions) - Latest timestamped UML (keep last 3 versions) ❌ **DO NOT commit**: - Temporary validation files (/tmp/*) - Old versions (>3 generations old) - Duplicate non-timestamped files ### Cleanup Script ```bash # Keep only last 3 timestamped versions of each schema cd schemas/rdf ls -t schema_*.owl.ttl | tail -n +4 | xargs rm -f ``` --- ## Rule 10: Generation Workflow Template Standard workflow for schema changes: ```bash #!/bin/bash # Schema Generation Workflow # Usage: ./generate_schema_artifacts.sh set -e # Exit on error SCHEMA_FILE="schemas/20251121/linkml/01_custodian_name_modular.yaml" TIMESTAMP=$(date +%Y%m%d_%H%M%S) BASE_NAME="custodian_${TIMESTAMP}" echo "=== Schema Generation Workflow ===" echo "Timestamp: $TIMESTAMP" echo "" # Step 1: Validate LinkML echo "Step 1: Validating LinkML schema..." gen-owl -f ttl "$SCHEMA_FILE" > /tmp/validation_test.ttl 2>&1 echo "✅ Schema valid" # Step 2: Generate RDF formats echo "Step 2: Generating RDF formats..." gen-owl -f ttl "$SCHEMA_FILE" > "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" rdfpipe --input-format turtle --output-format nt "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.nt" rdfpipe --input-format turtle --output-format json-ld "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.jsonld" rdfpipe --input-format turtle --output-format xml "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.rdf" echo "✅ RDF formats generated" # Step 3: Generate UML echo "Step 3: Generating UML diagrams..." gen-yuml "$SCHEMA_FILE" > "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd" echo "✅ UML diagram generated" # Step 4: Validate examples echo "Step 4: Validating example instances..." for example in schemas/20251121/examples/*.yaml; do linkml-validate -s "$SCHEMA_FILE" "$example" || echo "⚠️ Warning: $example failed validation" done echo "✅ Examples validated" # Step 5: Report echo "" echo "=== Generation Complete ===" ls -lh "schemas/20251121/rdf/${BASE_NAME}".* | awk '{print $9, "("$5")"}' ls -lh "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd" | awk '{print $9, "("$5")"}' echo "" echo "Next: Update documentation and commit" ``` --- ## Quick Reference Commands ### Generate All Artifacts ```bash TIMESTAMP=$(date +%Y%m%d_%H%M%S) gen-owl -f ttl schema.yaml > schema_${TIMESTAMP}.owl.ttl gen-yuml schema.yaml > schema_${TIMESTAMP}.mmd ``` ### Validate ```bash gen-owl -f ttl schema.yaml > /tmp/test.ttl # Check for errors linkml-validate -s schema.yaml instance.yaml ``` ### Convert RDF Formats ```bash rdfpipe -i turtle -o nt file.ttl > file.nt rdfpipe -i turtle -o json-ld file.ttl > file.jsonld rdfpipe -i turtle -o xml file.ttl > file.rdf ``` ### Check RDF Content ```bash grep -c "ClassName" file.owl.ttl # Count class references wc -l file.nt # Count triples ``` --- ## Rule 11: SHACL Generation for Modular Schemas ### The Problem `gen-shacl` (LinkML's built-in SHACL generator) **fails on modular schemas** with errors like: ``` KeyError: 'contributing_agency' ``` This is a LinkML bug where: 1. `schema_map` is keyed by import paths (e.g., `modules/classes/ContributingAgency`) 2. But lookups use schema names (e.g., `contributing_agency`) 3. This mismatch causes `KeyError` during `from_schema` resolution Other generators (`gen-owl`, `gen-yaml`, `linkml-lint`) work fine because they don't perform this specific lookup. ### The Solution: Use `scripts/generate_shacl.py` We provide a workaround script that: 1. Loads schema via `SchemaView` (correctly resolves all imports) 2. Merges all classes/slots/enums from imports into main schema 3. Clears `from_schema` references that cause the bug 4. Excludes built-in types (avoid conflicts with `linkml:types`) 5. Writes to temp file and runs `ShaclGenerator` ### Usage ```bash # Generate SHACL with default settings (timestamped output) python scripts/generate_shacl.py # Generate SHACL with verbose output (shows all steps) python scripts/generate_shacl.py --verbose # Generate SHACL to specific file python scripts/generate_shacl.py --output schemas/20251121/shacl/custom_shapes.ttl # Use custom schema file python scripts/generate_shacl.py --schema path/to/schema.yaml # Write to stdout (for piping) python scripts/generate_shacl.py --stdout ``` ### Output Location By default, generates timestamped files: ``` schemas/20251121/shacl/custodian_shacl_{YYYYMMDD}_{HHMMSS}.ttl ``` ### What You'll See The script produces warnings that are **expected and safe to ignore**: ``` # Inverse slot warnings (schema design issue, doesn't affect SHACL) Range of slot 'collections_under_responsibility' (LegalResponsibilityCollection) does not line with the domain of its inverse (responsible_legal_entity) # Unrecognized prefix warnings (prefixes defined in modules, not merged) File "linkml_shacl_xxx.yaml", line 147, col 10: Unrecognized prefix: geosparql ``` These warnings don't prevent SHACL generation from succeeding. ### Example Output ``` ================================================================================ SHACL GENERATION (with modular schema workaround) ================================================================================ Schema: schemas/20251121/linkml/01_custodian_name_modular.yaml Output format: turtle Output file: schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl ================================================================================ Step 1: Loading schema via SchemaView... Loaded schema: heritage-custodian-observation-reconstruction Classes (via imports): 93 Slots (via imports): 857 Enums (via imports): 51 Step 2: Merging imported definitions into main schema... Merged 93 classes Merged 857 slots Merged 51 enums Merged 1 types (excluding 19 builtins) Step 3: Clearing from_schema references... Cleared 1002 from_schema references Step 4: Simplifying imports... Original imports: 253 New imports: ['linkml:types'] Step 5: Writing cleaned schema to temp file... Step 6: Running ShaclGenerator... Generated 14924 lines of SHACL Step 7: Writing output... ================================================================================ ✅ SHACL GENERATION COMPLETE ================================================================================ ``` ### Validating SHACL Output After generation, validate the SHACL file: ```bash # Load into rdflib and count shapes python3 -c " from rdflib import Graph from rdflib.namespace import SH g = Graph() g.parse('schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl', format='turtle') print(f'Triples: {len(g)}') shapes = list(g.subjects(predicate=None, object=SH.NodeShape)) print(f'NodeShapes: {len(shapes)}') " ``` ### Using SHACL for Validation Use `scripts/validate_with_shacl.py` to validate RDF data: ```bash # Validate Turtle file python scripts/validate_with_shacl.py data.ttl --shapes schemas/20251121/shacl/custodian_shacl_*.ttl # Validate JSON-LD file python scripts/validate_with_shacl.py data.jsonld --format jsonld ``` ### Why Not Just Use `gen-shacl` Directly? **DON'T DO THIS** (it will fail): ```bash # ❌ FAILS with KeyError on modular schemas gen-shacl schemas/20251121/linkml/01_custodian_name_modular.yaml ``` **DO THIS INSTEAD**: ```bash # ✅ Works via workaround script python scripts/generate_shacl.py ``` --- ## Rule 12: Inverse Slot RDFS Compliance ### The Problem In RDFS/OWL, inverse properties have strict domain/range requirements: - If property A has `range: ClassX` and `inverse: B` - Then property B **MUST** have `domain: ClassX` Violating this creates logically inconsistent RDF graphs and fails RDFS validation. ### The Solution: Always Declare Domain for Inverse Slots **Every slot with an `inverse:` declaration MUST have an explicit `domain:`** ```yaml # ✅ CORRECT - Both slots have domain/range aligned with inverse slots: collections_under_responsibility: domain: CustodianLegalStatus # ← Domain explicitly declared range: LegalResponsibilityCollection inverse: responsible_legal_entity responsible_legal_entity: domain: LegalResponsibilityCollection # ← Must match range of inverse range: CustodianLegalStatus # ← Must match domain of inverse inverse: collections_under_responsibility ``` ```yaml # ❌ WRONG - Missing domain violates RDFS slots: collections_under_responsibility: # domain: ??? # ← Missing! RDFS non-compliant range: LegalResponsibilityCollection inverse: responsible_legal_entity ``` ### Polymorphic Inverse Slots For slots used by multiple classes (polymorphic), create an **abstract base class**: ```yaml # ✅ CORRECT - Abstract base class for RDFS compliance classes: ReconstructedEntity: abstract: true class_uri: prov:Entity description: "Abstract base for all entities generated by ReconstructionActivity" CustodianLegalStatus: is_a: ReconstructedEntity # ← Inherits from abstract base # ... CustodianName: is_a: ReconstructedEntity # ← Inherits from abstract base # ... slots: generates: domain: ReconstructionActivity range: ReconstructedEntity # ← Abstract base class inverse: was_generated_by was_generated_by: domain: ReconstructedEntity # ← Abstract base class range: ReconstructionActivity inverse: generates ``` ### Validation Checklist Before committing schema changes with inverse slots: 1. **Every inverse slot pair has explicit domain/range** 2. **Domain of slot A = Range of slot B (its inverse)** 3. **Range of slot A = Domain of slot B (its inverse)** 4. **For polymorphic slots: abstract base class exists and all using classes inherit from it** ### Quick Reference: Fixed Inverse Pairs | Slot A | Domain A | Range A | ↔ | Slot B | Domain B | Range B | |--------|----------|---------|---|--------|----------|---------| | `collections_under_responsibility` | CustodianLegalStatus | LegalResponsibilityCollection | ↔ | `responsible_legal_entity` | LegalResponsibilityCollection | CustodianLegalStatus | | `staff_members` | OrganizationalStructure | PersonObservation | ↔ | `unit_affiliation` | PersonObservation | OrganizationalStructure | | `portal_data_sources` | WebPortal | CollectionManagementSystem | ↔ | `feeds_portal` | CollectionManagementSystem | WebPortal | | `exposed_via_portal` | CustodianCollection | WebPortal | ↔ | `exposes_collections` | WebPortal | CustodianCollection | | `has_observation` | Custodian | CustodianObservation | ↔ | `refers_to_custodian` | CustodianObservation | Custodian | | `identified_by` | Custodian | CustodianIdentifier | ↔ | `identifies_custodian` | CustodianIdentifier | Custodian | | `encompasses` | EncompassingBody | Custodian | ↔ | `encompassing_body` | Custodian | EncompassingBody | | `generates` | ReconstructionActivity | ReconstructedEntity | ↔ | `was_generated_by` | ReconstructedEntity | ReconstructionActivity | | `used` | ReconstructionActivity | CustodianObservation | ↔ | `used_by` | CustodianObservation | ReconstructionActivity | | `affects_organization` | OrganizationalChangeEvent | Custodian | ↔ | `organizational_change_events` | Custodian | OrganizationalChangeEvent | | `platform_of` | DigitalPlatform | Custodian | ↔ | `digital_platform` | Custodian | DigitalPlatform | | `identifies` | CustodianIdentifier | Custodian | ↔ | `identifiers` | Custodian | CustodianIdentifier | | `allocates` | AllocationAgency | CustodianIdentifier | ↔ | `allocated_by` | CustodianIdentifier | AllocationAgency | | `is_legal_status_of` | CustodianLegalStatus | Custodian | ↔ | `legal_status` | Custodian | CustodianLegalStatus | ### Abstract Base Class: ReconstructedEntity Created to ensure RDFS compliance for `generates`/`was_generated_by` inverse pair: **File**: `schemas/20251121/linkml/modules/classes/ReconstructedEntity.yaml` **Subclasses** (20 classes inherit from ReconstructedEntity): - ArticlesOfAssociation - AuxiliaryDigitalPlatform - AuxiliaryPlace - Budget - CollectionManagementSystem - CustodianAdministration - CustodianArchive - CustodianCollection (and its subclass LegalResponsibilityCollection) - CustodianLegalStatus - CustodianName - CustodianPlace - DigitalPlatform - FeaturePlace - FinancialStatement - GiftShop - InternetOfThings - OrganizationBranch - SocialMediaProfile - WebPortal --- **Status**: ✅ ACTIVE RULES **Version**: 1.2 **Last Updated**: 2025-12-01 **Applies To**: All LinkML schema work in this project **See Also**: - `.opencode/HYPER_MODULAR_STRUCTURE.md` - Module organization - `.opencode/SLOT_NAMING_CONVENTIONS.md` - Slot naming patterns - `AGENTS.md` - AI agent instructions