glam/.opencode/SCHEMA_GENERATION_RULES.md
2025-12-01 16:06:34 +01:00

640 lines
19 KiB
Markdown

# Schema Generation Rules for AI Agents
**Date**: 2025-11-22
**Purpose**: Standard rules for generating derived artifacts from LinkML schemas
---
## Rule 1: Always Use Full Timestamps in Generated File Names
**MANDATORY**: When generating derived artifacts (RDF, UML, etc.) from LinkML schemas, **ALWAYS** include a full timestamp (date AND time) in the filename.
### Format
```
{base_name}_{YYYYMMDD}_{HHMMSS}.{extension}
```
### Examples
```bash
# ✅ CORRECT - Full timestamp (date + time)
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-yuml schemas/linkml/schema.yaml > schemas/uml/mermaid/schema_${TIMESTAMP}.mmd
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/schema_${TIMESTAMP}.owl.ttl
# Examples of correct filenames:
custodian_multi_aspect_20251122_154136.mmd
custodian_multi_aspect_20251122_154430.owl.ttl
custodian_multi_aspect_20251122_154430.nt
custodian_multi_aspect_20251122_154430.jsonld
custodian_multi_aspect_20251122_154430.rdf
# ❌ WRONG - No timestamp
schema.mmd
01_custodian_name.owl.ttl
# ❌ WRONG - Date only (MISSING TIME!)
schema_20251122.mmd
custodian_multi_aspect_20251122.owl.ttl
# ❌ WRONG - Time only (missing date)
schema_154430.mmd
```
### Rationale
1. **Version tracking**: Full timestamps enable precise version identification
2. **No overwrites**: Multiple generations on same day don't conflict
3. **Debugging**: Can identify exact time when changes were made
4. **Rollback**: Easy to revert to specific versions
5. **Audit trail**: Documents schema evolution with chronological precision
6. **Prevents overwrites**: Never lose previous versions
7. **Multiple sessions per day**: Teams may generate artifacts multiple times daily
8. **Git-friendly**: Easy to diff between versions
9. **Reproducibility**: Can correlate generated artifacts with git commits
### Critical Note
The timestamp must include BOTH date and time (YYYYMMDD_HHMMSS), not just date. This allows multiple generation runs per day without filename conflicts.
---
## Rule 2: LinkML is the Single Source of Truth
**NEVER** manually create or edit derived files. Always generate from LinkML.
### Correct Workflow ✅
```
1. Edit LinkML schema (.yaml)
2. Generate RDF formats (gen-owl + rdfpipe)
3. Generate UML diagrams (gen-yuml)
4. Generate TypeDB schema (manual translation, but documented)
5. Validate examples (linkml-validate)
```
### Incorrect Workflow ❌
```
❌ Editing .ttl files directly
❌ Creating .jsonld manually
❌ Drawing UML diagrams by hand
❌ Modifying TypeDB schema without updating LinkML
```
---
## Rule 3: Generate All RDF Serialization Formats
When generating RDF from LinkML, produce all standard serialization formats:
### Required Formats
1. **OWL/Turtle** (.owl.ttl) - Primary, human-readable
2. **N-Triples** (.nt) - Simple, line-based
3. **JSON-LD** (.jsonld) - Web-friendly, JSON-based
4. **RDF/XML** (.rdf) - XML-based, traditional
### Generation Commands
```bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="schema_${TIMESTAMP}"
# 1. Generate OWL/Turtle (primary)
gen-owl -f ttl schemas/linkml/schema.yaml > schemas/rdf/${BASE_NAME}.owl.ttl
# 2. Convert to other formats using rdfpipe
rdfpipe --input-format turtle --output-format nt schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.nt
rdfpipe --input-format turtle --output-format json-ld schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.jsonld
rdfpipe --input-format turtle --output-format xml schemas/rdf/${BASE_NAME}.owl.ttl > schemas/rdf/${BASE_NAME}.rdf
```
---
## Rule 4: Validate Before Committing
Before committing schema changes, **ALWAYS**:
1. **Validate LinkML schema**:
```bash
gen-owl -f ttl schemas/linkml/schema.yaml > /tmp/test_validation.ttl
# Check for errors in output
```
2. **Validate example instances**:
```bash
linkml-validate -s schemas/linkml/schema.yaml schemas/examples/instance.yaml
```
3. **Check RDF triples count**:
```bash
wc -l schemas/rdf/*.nt # N-Triples are easy to count
```
4. **Verify class presence**:
```bash
grep -c "ClassName" schemas/rdf/*.owl.ttl
```
---
## Rule 5: Document Schema Changes
Every schema change requires:
1. **Quick status document**: `QUICK_STATUS_{TOPIC}_{YYYYMMDD}.md`
2. **Session summary**: `SESSION_SUMMARY_{YYYYMMDD}_{TOPIC}.md`
3. **Updated examples**: Add/update instance files demonstrating changes
4. **Commit message**: Reference quick status document
### Template: Quick Status Document
```markdown
# Quick Status: {Topic}
Date: YYYY-MM-DD
Status: ✅ COMPLETE / ⏳ IN PROGRESS
Priority: HIGH / MEDIUM / LOW
## What We Did
...
## Key Changes
...
## Files Modified
...
## Validation Results
...
## Next Steps
...
```
---
## Rule 6: Example Instances Are Required
For every new class or major schema change:
1. Create at least ONE complete example instance
2. Place in `schemas/{version}/examples/`
3. Use descriptive filenames: `{class_name}_{use_case}_{timestamp}.yaml`
4. Include all required slots and at least 2-3 optional slots
5. Add inline comments explaining non-obvious fields
### Example Instance Template
```yaml
---
# Complete Example: {ClassName}
# Date: YYYY-MM-DD
# Use Case: {Description}
# Status: Valid instance conforming to schema version {X.Y.Z}
instances:
- id: https://example.org/id
required_field_1: "value"
required_field_2: "value"
optional_field: "value" # Explanation of when to use this field
# ... more fields
```
---
## Rule 7: UML Diagram Conventions
When generating UML diagrams:
### File Naming
```
{schema_name}_{diagram_type}_{YYYYMMDD}_{HHMMSS}.mmd
```
Examples:
- `custodian_class_diagram_20251122_154136.mmd`
- `prov_flow_sequence_20251122_154200.mmd`
### Diagram Types
- `class_diagram` - Class hierarchies and relationships
- `sequence` - PROV-O temporal flows
- `state` - State transitions (e.g., organizational change events)
- `er` - Entity-relationship (database perspective)
### Storage Location
```
schemas/{version}/uml/mermaid/{timestamp_files}.mmd
```
---
## Rule 8: TypeDB Schema Updates
TypeDB schemas are **manually translated** from LinkML (not auto-generated).
### Required Steps
1. Update LinkML schema first
2. Regenerate RDF to verify OWL alignment
3. Manually update TypeDB schema (.tql)
4. Document translation decisions
5. Test TypeDB queries
### Translation Documentation
Create `TYPEDB_TRANSLATION_NOTES.md` documenting:
- LinkML class → TypeDB entity/relation mapping
- Slot → attribute mapping
- Constraints and rules
- Query examples
---
## Rule 9: Version Control for Generated Files
### What to Commit
**DO commit**:
- LinkML schema files (.yaml)
- Example instances (.yaml)
- Documentation (.md)
- Latest timestamped RDF (keep last 3 versions)
- Latest timestamped UML (keep last 3 versions)
**DO NOT commit**:
- Temporary validation files (/tmp/*)
- Old versions (>3 generations old)
- Duplicate non-timestamped files
### Cleanup Script
```bash
# Keep only last 3 timestamped versions of each schema
cd schemas/rdf
ls -t schema_*.owl.ttl | tail -n +4 | xargs rm -f
```
---
## Rule 10: Generation Workflow Template
Standard workflow for schema changes:
```bash
#!/bin/bash
# Schema Generation Workflow
# Usage: ./generate_schema_artifacts.sh
set -e # Exit on error
SCHEMA_FILE="schemas/20251121/linkml/01_custodian_name_modular.yaml"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BASE_NAME="custodian_${TIMESTAMP}"
echo "=== Schema Generation Workflow ==="
echo "Timestamp: $TIMESTAMP"
echo ""
# Step 1: Validate LinkML
echo "Step 1: Validating LinkML schema..."
gen-owl -f ttl "$SCHEMA_FILE" > /tmp/validation_test.ttl 2>&1
echo "✅ Schema valid"
# Step 2: Generate RDF formats
echo "Step 2: Generating RDF formats..."
gen-owl -f ttl "$SCHEMA_FILE" > "schemas/20251121/rdf/${BASE_NAME}.owl.ttl"
rdfpipe --input-format turtle --output-format nt "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.nt"
rdfpipe --input-format turtle --output-format json-ld "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.jsonld"
rdfpipe --input-format turtle --output-format xml "schemas/20251121/rdf/${BASE_NAME}.owl.ttl" > "schemas/20251121/rdf/${BASE_NAME}.rdf"
echo "✅ RDF formats generated"
# Step 3: Generate UML
echo "Step 3: Generating UML diagrams..."
gen-yuml "$SCHEMA_FILE" > "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd"
echo "✅ UML diagram generated"
# Step 4: Validate examples
echo "Step 4: Validating example instances..."
for example in schemas/20251121/examples/*.yaml; do
linkml-validate -s "$SCHEMA_FILE" "$example" || echo "⚠️ Warning: $example failed validation"
done
echo "✅ Examples validated"
# Step 5: Report
echo ""
echo "=== Generation Complete ==="
ls -lh "schemas/20251121/rdf/${BASE_NAME}".* | awk '{print $9, "("$5")"}'
ls -lh "schemas/20251121/uml/mermaid/${BASE_NAME}.mmd" | awk '{print $9, "("$5")"}'
echo ""
echo "Next: Update documentation and commit"
```
---
## Quick Reference Commands
### Generate All Artifacts
```bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
gen-owl -f ttl schema.yaml > schema_${TIMESTAMP}.owl.ttl
gen-yuml schema.yaml > schema_${TIMESTAMP}.mmd
```
### Validate
```bash
gen-owl -f ttl schema.yaml > /tmp/test.ttl # Check for errors
linkml-validate -s schema.yaml instance.yaml
```
### Convert RDF Formats
```bash
rdfpipe -i turtle -o nt file.ttl > file.nt
rdfpipe -i turtle -o json-ld file.ttl > file.jsonld
rdfpipe -i turtle -o xml file.ttl > file.rdf
```
### Check RDF Content
```bash
grep -c "ClassName" file.owl.ttl # Count class references
wc -l file.nt # Count triples
```
---
## Rule 11: SHACL Generation for Modular Schemas
### The Problem
`gen-shacl` (LinkML's built-in SHACL generator) **fails on modular schemas** with errors like:
```
KeyError: 'contributing_agency'
```
This is a LinkML bug where:
1. `schema_map` is keyed by import paths (e.g., `modules/classes/ContributingAgency`)
2. But lookups use schema names (e.g., `contributing_agency`)
3. This mismatch causes `KeyError` during `from_schema` resolution
Other generators (`gen-owl`, `gen-yaml`, `linkml-lint`) work fine because they don't perform this specific lookup.
### The Solution: Use `scripts/generate_shacl.py`
We provide a workaround script that:
1. Loads schema via `SchemaView` (correctly resolves all imports)
2. Merges all classes/slots/enums from imports into main schema
3. Clears `from_schema` references that cause the bug
4. Excludes built-in types (avoid conflicts with `linkml:types`)
5. Writes to temp file and runs `ShaclGenerator`
### Usage
```bash
# Generate SHACL with default settings (timestamped output)
python scripts/generate_shacl.py
# Generate SHACL with verbose output (shows all steps)
python scripts/generate_shacl.py --verbose
# Generate SHACL to specific file
python scripts/generate_shacl.py --output schemas/20251121/shacl/custom_shapes.ttl
# Use custom schema file
python scripts/generate_shacl.py --schema path/to/schema.yaml
# Write to stdout (for piping)
python scripts/generate_shacl.py --stdout
```
### Output Location
By default, generates timestamped files:
```
schemas/20251121/shacl/custodian_shacl_{YYYYMMDD}_{HHMMSS}.ttl
```
### What You'll See
The script produces warnings that are **expected and safe to ignore**:
```
# Inverse slot warnings (schema design issue, doesn't affect SHACL)
Range of slot 'collections_under_responsibility' (LegalResponsibilityCollection)
does not line with the domain of its inverse (responsible_legal_entity)
# Unrecognized prefix warnings (prefixes defined in modules, not merged)
File "linkml_shacl_xxx.yaml", line 147, col 10: Unrecognized prefix: geosparql
```
These warnings don't prevent SHACL generation from succeeding.
### Example Output
```
================================================================================
SHACL GENERATION (with modular schema workaround)
================================================================================
Schema: schemas/20251121/linkml/01_custodian_name_modular.yaml
Output format: turtle
Output file: schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl
================================================================================
Step 1: Loading schema via SchemaView...
Loaded schema: heritage-custodian-observation-reconstruction
Classes (via imports): 93
Slots (via imports): 857
Enums (via imports): 51
Step 2: Merging imported definitions into main schema...
Merged 93 classes
Merged 857 slots
Merged 51 enums
Merged 1 types (excluding 19 builtins)
Step 3: Clearing from_schema references...
Cleared 1002 from_schema references
Step 4: Simplifying imports...
Original imports: 253
New imports: ['linkml:types']
Step 5: Writing cleaned schema to temp file...
Step 6: Running ShaclGenerator...
Generated 14924 lines of SHACL
Step 7: Writing output...
================================================================================
✅ SHACL GENERATION COMPLETE
================================================================================
```
### Validating SHACL Output
After generation, validate the SHACL file:
```bash
# Load into rdflib and count shapes
python3 -c "
from rdflib import Graph
from rdflib.namespace import SH
g = Graph()
g.parse('schemas/20251121/shacl/custodian_shacl_20251201_084946.ttl', format='turtle')
print(f'Triples: {len(g)}')
shapes = list(g.subjects(predicate=None, object=SH.NodeShape))
print(f'NodeShapes: {len(shapes)}')
"
```
### Using SHACL for Validation
Use `scripts/validate_with_shacl.py` to validate RDF data:
```bash
# Validate Turtle file
python scripts/validate_with_shacl.py data.ttl --shapes schemas/20251121/shacl/custodian_shacl_*.ttl
# Validate JSON-LD file
python scripts/validate_with_shacl.py data.jsonld --format jsonld
```
### Why Not Just Use `gen-shacl` Directly?
**DON'T DO THIS** (it will fail):
```bash
# ❌ FAILS with KeyError on modular schemas
gen-shacl schemas/20251121/linkml/01_custodian_name_modular.yaml
```
**DO THIS INSTEAD**:
```bash
# ✅ Works via workaround script
python scripts/generate_shacl.py
```
---
## Rule 12: Inverse Slot RDFS Compliance
### The Problem
In RDFS/OWL, inverse properties have strict domain/range requirements:
- If property A has `range: ClassX` and `inverse: B`
- Then property B **MUST** have `domain: ClassX`
Violating this creates logically inconsistent RDF graphs and fails RDFS validation.
### The Solution: Always Declare Domain for Inverse Slots
**Every slot with an `inverse:` declaration MUST have an explicit `domain:`**
```yaml
# ✅ CORRECT - Both slots have domain/range aligned with inverse
slots:
collections_under_responsibility:
domain: CustodianLegalStatus # ← Domain explicitly declared
range: LegalResponsibilityCollection
inverse: responsible_legal_entity
responsible_legal_entity:
domain: LegalResponsibilityCollection # ← Must match range of inverse
range: CustodianLegalStatus # ← Must match domain of inverse
inverse: collections_under_responsibility
```
```yaml
# ❌ WRONG - Missing domain violates RDFS
slots:
collections_under_responsibility:
# domain: ??? # ← Missing! RDFS non-compliant
range: LegalResponsibilityCollection
inverse: responsible_legal_entity
```
### Polymorphic Inverse Slots
For slots used by multiple classes (polymorphic), create an **abstract base class**:
```yaml
# ✅ CORRECT - Abstract base class for RDFS compliance
classes:
ReconstructedEntity:
abstract: true
class_uri: prov:Entity
description: "Abstract base for all entities generated by ReconstructionActivity"
CustodianLegalStatus:
is_a: ReconstructedEntity # ← Inherits from abstract base
# ...
CustodianName:
is_a: ReconstructedEntity # ← Inherits from abstract base
# ...
slots:
generates:
domain: ReconstructionActivity
range: ReconstructedEntity # ← Abstract base class
inverse: was_generated_by
was_generated_by:
domain: ReconstructedEntity # ← Abstract base class
range: ReconstructionActivity
inverse: generates
```
### Validation Checklist
Before committing schema changes with inverse slots:
1. **Every inverse slot pair has explicit domain/range**
2. **Domain of slot A = Range of slot B (its inverse)**
3. **Range of slot A = Domain of slot B (its inverse)**
4. **For polymorphic slots: abstract base class exists and all using classes inherit from it**
### Quick Reference: Fixed Inverse Pairs
| Slot A | Domain A | Range A | ↔ | Slot B | Domain B | Range B |
|--------|----------|---------|---|--------|----------|---------|
| `collections_under_responsibility` | CustodianLegalStatus | LegalResponsibilityCollection | ↔ | `responsible_legal_entity` | LegalResponsibilityCollection | CustodianLegalStatus |
| `staff_members` | OrganizationalStructure | PersonObservation | ↔ | `unit_affiliation` | PersonObservation | OrganizationalStructure |
| `portal_data_sources` | WebPortal | CollectionManagementSystem | ↔ | `feeds_portal` | CollectionManagementSystem | WebPortal |
| `exposed_via_portal` | CustodianCollection | WebPortal | ↔ | `exposes_collections` | WebPortal | CustodianCollection |
| `has_observation` | Custodian | CustodianObservation | ↔ | `refers_to_custodian` | CustodianObservation | Custodian |
| `identified_by` | Custodian | CustodianIdentifier | ↔ | `identifies_custodian` | CustodianIdentifier | Custodian |
| `encompasses` | EncompassingBody | Custodian | ↔ | `encompassing_body` | Custodian | EncompassingBody |
| `generates` | ReconstructionActivity | ReconstructedEntity | ↔ | `was_generated_by` | ReconstructedEntity | ReconstructionActivity |
| `used` | ReconstructionActivity | CustodianObservation | ↔ | `used_by` | CustodianObservation | ReconstructionActivity |
| `affects_organization` | OrganizationalChangeEvent | Custodian | ↔ | `organizational_change_events` | Custodian | OrganizationalChangeEvent |
| `platform_of` | DigitalPlatform | Custodian | ↔ | `digital_platform` | Custodian | DigitalPlatform |
| `identifies` | CustodianIdentifier | Custodian | ↔ | `identifiers` | Custodian | CustodianIdentifier |
| `allocates` | AllocationAgency | CustodianIdentifier | ↔ | `allocated_by` | CustodianIdentifier | AllocationAgency |
| `is_legal_status_of` | CustodianLegalStatus | Custodian | ↔ | `legal_status` | Custodian | CustodianLegalStatus |
### Abstract Base Class: ReconstructedEntity
Created to ensure RDFS compliance for `generates`/`was_generated_by` inverse pair:
**File**: `schemas/20251121/linkml/modules/classes/ReconstructedEntity.yaml`
**Subclasses** (20 classes inherit from ReconstructedEntity):
- ArticlesOfAssociation
- AuxiliaryDigitalPlatform
- AuxiliaryPlace
- Budget
- CollectionManagementSystem
- CustodianAdministration
- CustodianArchive
- CustodianCollection (and its subclass LegalResponsibilityCollection)
- CustodianLegalStatus
- CustodianName
- CustodianPlace
- DigitalPlatform
- FeaturePlace
- FinancialStatement
- GiftShop
- InternetOfThings
- OrganizationBranch
- SocialMediaProfile
- WebPortal
---
**Status**: ✅ ACTIVE RULES
**Version**: 1.2
**Last Updated**: 2025-12-01
**Applies To**: All LinkML schema work in this project
**See Also**:
- `.opencode/HYPER_MODULAR_STRUCTURE.md` - Module organization
- `.opencode/SLOT_NAMING_CONVENTIONS.md` - Slot naming patterns
- `AGENTS.md` - AI agent instructions