# LinkML Schema Validation Patterns This document captures critical patterns discovered during the GLAM LinkML schema validation effort against 33,444+ custodian YAML data files. ## Overview The GLAM project uses LinkML schemas to define the structure of heritage custodian data. Validation ensures data files conform to the schema before ingestion into databases. **Key Files**: - Main schema entry: `schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml` - Classes: `schemas/20251121/linkml/modules/classes/` - Slots: `schemas/20251121/linkml/modules/slots/` - Enums: `schemas/20251121/linkml/modules/enums/` - Data: `data/custodian/*.yaml` (33,444+ files) ## Validation Command ```bash # Single file linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml data/custodian/FILE.yaml # Batch test for f in data/custodian/A*.yaml; do linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml "$f" 2>&1 | grep -q ERROR && echo "FAIL: $f" done ``` --- ## Critical Patterns ### Pattern 1: Union Types Require `range: Any` **Priority**: 🚨 CRITICAL When using `any_of` to define union types (e.g., string OR integer), you MUST specify `range: Any`. ```yaml # CORRECT attributes: identifier_value: range: Any # ← REQUIRED any_of: - range: string - range: integer # WRONG (validation silently fails) attributes: identifier_value: any_of: - range: string - range: integer ``` **Affected Fields**: - `identifier_value` (string | integer) - `geonames_id` (string | integer) - `youtube_channel_id` (string | array) - `facebook_id` (string | array) - `identifiers` (dict | array) **See**: Rule 59 in AGENTS.md, `.opencode/rules/linkml-union-type-range-any-rule.md` --- ### Pattern 2: Flexible Data Structures (dict OR array) Some fields accept either dict or array format depending on data source. ```yaml # Dict format (common in manual entry) identifiers: isil: "NL-123" wikidata: "Q12345" # Array format (common in API responses) identifiers: - scheme: isil value: "NL-123" - scheme: wikidata value: "Q12345" ``` **Schema Solution**: ```yaml attributes: identifiers: range: Any # Accept both formats description: >- Identifiers from source. Accepts dict format {scheme: value} or array format [{scheme: X, value: Y}]. ``` --- ### Pattern 3: Adding Missing Fields to Classes When data contains fields not in schema, add them to the appropriate class: ```yaml # In the class YAML file (e.g., CustodianSourceFile.yaml) attributes: new_field_name: range: string # or appropriate type description: Description of the field # Optional properties: required: false multivalued: false inlined: true # For complex types ``` **Validation Error Example**: ``` Additional properties are not allowed ('new_field' was unexpected) ``` **Fix**: Add the field to the class's `attributes` section. --- ### Pattern 4: Nested Class Inlining When a field references another class, use `inlined: true`: ```yaml attributes: location: range: Location # References Location class inlined: true # Embed the object, don't use reference locations: range: Location multivalued: true inlined_as_list: true # For arrays of objects ``` --- ### Pattern 5: Optional vs Required Fields Most fields should be optional to handle incomplete data: ```yaml attributes: field_name: range: string required: false # Default, but explicit is clearer ``` Required fields cause validation failures when missing: ```yaml attributes: name: range: string required: true # Data MUST have this field ``` --- ## Common Validation Errors and Solutions ### Error: Additional properties not allowed **Cause**: Data has field not defined in schema **Fix**: Add field to class's `attributes` section ### Error: None is not of type 'string' **Cause**: Field defined as string but data has null **Fix**: Make field optional or use `any_of` with null ### Error: 123 is not of type 'string' **Cause**: Data has integer, schema expects string **Fix**: Use union type with `range: Any` and `any_of` ### Error: [...] is not of type 'object' **Cause**: Data has array, schema expects single object **Fix**: Add `multivalued: true` or use `range: Any` for flexibility --- ## Validation Session Tracking When performing validation sessions, track fixes systematically: | Fix # | File Modified | Change Made | Error Fixed | |-------|---------------|-------------|-------------| | 1 | ClassName.yaml | Added field X | "X was unexpected" | | 2 | ClassName.yaml | Added range: Any | Type mismatch | This table format allows continuation across sessions. --- ## References - [LinkML Documentation](https://linkml.io/linkml/) - [Union Types](https://linkml.io/linkml/schemas/advanced.html#union-types) - AGENTS.md Rules 48-59 - `.opencode/rules/` for detailed rule documentation