glam/docs/plan/linkml_schema/00-validation-patterns.md
2026-01-19 00:09:28 +01:00

202 lines
4.9 KiB
Markdown

# LinkML Schema Validation Patterns
This document captures critical patterns discovered during the GLAM LinkML schema validation effort against 33,444+ custodian YAML data files.
## Overview
The GLAM project uses LinkML schemas to define the structure of heritage custodian data. Validation ensures data files conform to the schema before ingestion into databases.
**Key Files**:
- Main schema entry: `schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml`
- Classes: `schemas/20251121/linkml/modules/classes/`
- Slots: `schemas/20251121/linkml/modules/slots/`
- Enums: `schemas/20251121/linkml/modules/enums/`
- Data: `data/custodian/*.yaml` (33,444+ files)
## Validation Command
```bash
# Single file
linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml data/custodian/FILE.yaml
# Batch test
for f in data/custodian/A*.yaml; do
linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml "$f" 2>&1 | grep -q ERROR && echo "FAIL: $f"
done
```
---
## Critical Patterns
### Pattern 1: Union Types Require `range: Any`
**Priority**: 🚨 CRITICAL
When using `any_of` to define union types (e.g., string OR integer), you MUST specify `range: Any`.
```yaml
# CORRECT
attributes:
identifier_value:
range: Any # ← REQUIRED
any_of:
- range: string
- range: integer
# WRONG (validation silently fails)
attributes:
identifier_value:
any_of:
- range: string
- range: integer
```
**Affected Fields**:
- `identifier_value` (string | integer)
- `geonames_id` (string | integer)
- `youtube_channel_id` (string | array)
- `facebook_id` (string | array)
- `identifiers` (dict | array)
**See**: Rule 59 in AGENTS.md, `.opencode/rules/linkml-union-type-range-any-rule.md`
---
### Pattern 2: Flexible Data Structures (dict OR array)
Some fields accept either dict or array format depending on data source.
```yaml
# Dict format (common in manual entry)
identifiers:
isil: "NL-123"
wikidata: "Q12345"
# Array format (common in API responses)
identifiers:
- scheme: isil
value: "NL-123"
- scheme: wikidata
value: "Q12345"
```
**Schema Solution**:
```yaml
attributes:
identifiers:
range: Any # Accept both formats
description: >-
Identifiers from source. Accepts dict format {scheme: value}
or array format [{scheme: X, value: Y}].
```
---
### Pattern 3: Adding Missing Fields to Classes
When data contains fields not in schema, add them to the appropriate class:
```yaml
# In the class YAML file (e.g., CustodianSourceFile.yaml)
attributes:
new_field_name:
range: string # or appropriate type
description: Description of the field
# Optional properties:
required: false
multivalued: false
inlined: true # For complex types
```
**Validation Error Example**:
```
Additional properties are not allowed ('new_field' was unexpected)
```
**Fix**: Add the field to the class's `attributes` section.
---
### Pattern 4: Nested Class Inlining
When a field references another class, use `inlined: true`:
```yaml
attributes:
location:
range: Location # References Location class
inlined: true # Embed the object, don't use reference
locations:
range: Location
multivalued: true
inlined_as_list: true # For arrays of objects
```
---
### Pattern 5: Optional vs Required Fields
Most fields should be optional to handle incomplete data:
```yaml
attributes:
field_name:
range: string
required: false # Default, but explicit is clearer
```
Required fields cause validation failures when missing:
```yaml
attributes:
name:
range: string
required: true # Data MUST have this field
```
---
## Common Validation Errors and Solutions
### Error: Additional properties not allowed
**Cause**: Data has field not defined in schema
**Fix**: Add field to class's `attributes` section
### Error: None is not of type 'string'
**Cause**: Field defined as string but data has null
**Fix**: Make field optional or use `any_of` with null
### Error: 123 is not of type 'string'
**Cause**: Data has integer, schema expects string
**Fix**: Use union type with `range: Any` and `any_of`
### Error: [...] is not of type 'object'
**Cause**: Data has array, schema expects single object
**Fix**: Add `multivalued: true` or use `range: Any` for flexibility
---
## Validation Session Tracking
When performing validation sessions, track fixes systematically:
| Fix # | File Modified | Change Made | Error Fixed |
|-------|---------------|-------------|-------------|
| 1 | ClassName.yaml | Added field X | "X was unexpected" |
| 2 | ClassName.yaml | Added range: Any | Type mismatch |
This table format allows continuation across sessions.
---
## References
- [LinkML Documentation](https://linkml.io/linkml/)
- [Union Types](https://linkml.io/linkml/schemas/advanced.html#union-types)
- AGENTS.md Rules 48-59
- `.opencode/rules/` for detailed rule documentation