202 lines
4.9 KiB
Markdown
202 lines
4.9 KiB
Markdown
# LinkML Schema Validation Patterns
|
|
|
|
This document captures critical patterns discovered during the GLAM LinkML schema validation effort against 33,444+ custodian YAML data files.
|
|
|
|
## Overview
|
|
|
|
The GLAM project uses LinkML schemas to define the structure of heritage custodian data. Validation ensures data files conform to the schema before ingestion into databases.
|
|
|
|
**Key Files**:
|
|
- Main schema entry: `schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml`
|
|
- Classes: `schemas/20251121/linkml/modules/classes/`
|
|
- Slots: `schemas/20251121/linkml/modules/slots/`
|
|
- Enums: `schemas/20251121/linkml/modules/enums/`
|
|
- Data: `data/custodian/*.yaml` (33,444+ files)
|
|
|
|
## Validation Command
|
|
|
|
```bash
|
|
# Single file
|
|
linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml data/custodian/FILE.yaml
|
|
|
|
# Batch test
|
|
for f in data/custodian/A*.yaml; do
|
|
linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml "$f" 2>&1 | grep -q ERROR && echo "FAIL: $f"
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
## Critical Patterns
|
|
|
|
### Pattern 1: Union Types Require `range: Any`
|
|
|
|
**Priority**: 🚨 CRITICAL
|
|
|
|
When using `any_of` to define union types (e.g., string OR integer), you MUST specify `range: Any`.
|
|
|
|
```yaml
|
|
# CORRECT
|
|
attributes:
|
|
identifier_value:
|
|
range: Any # ← REQUIRED
|
|
any_of:
|
|
- range: string
|
|
- range: integer
|
|
|
|
# WRONG (validation silently fails)
|
|
attributes:
|
|
identifier_value:
|
|
any_of:
|
|
- range: string
|
|
- range: integer
|
|
```
|
|
|
|
**Affected Fields**:
|
|
- `identifier_value` (string | integer)
|
|
- `geonames_id` (string | integer)
|
|
- `youtube_channel_id` (string | array)
|
|
- `facebook_id` (string | array)
|
|
- `identifiers` (dict | array)
|
|
|
|
**See**: Rule 59 in AGENTS.md, `.opencode/rules/linkml-union-type-range-any-rule.md`
|
|
|
|
---
|
|
|
|
### Pattern 2: Flexible Data Structures (dict OR array)
|
|
|
|
Some fields accept either dict or array format depending on data source.
|
|
|
|
```yaml
|
|
# Dict format (common in manual entry)
|
|
identifiers:
|
|
isil: "NL-123"
|
|
wikidata: "Q12345"
|
|
|
|
# Array format (common in API responses)
|
|
identifiers:
|
|
- scheme: isil
|
|
value: "NL-123"
|
|
- scheme: wikidata
|
|
value: "Q12345"
|
|
```
|
|
|
|
**Schema Solution**:
|
|
```yaml
|
|
attributes:
|
|
identifiers:
|
|
range: Any # Accept both formats
|
|
description: >-
|
|
Identifiers from source. Accepts dict format {scheme: value}
|
|
or array format [{scheme: X, value: Y}].
|
|
```
|
|
|
|
---
|
|
|
|
### Pattern 3: Adding Missing Fields to Classes
|
|
|
|
When data contains fields not in schema, add them to the appropriate class:
|
|
|
|
```yaml
|
|
# In the class YAML file (e.g., CustodianSourceFile.yaml)
|
|
attributes:
|
|
new_field_name:
|
|
range: string # or appropriate type
|
|
description: Description of the field
|
|
# Optional properties:
|
|
required: false
|
|
multivalued: false
|
|
inlined: true # For complex types
|
|
```
|
|
|
|
**Validation Error Example**:
|
|
```
|
|
Additional properties are not allowed ('new_field' was unexpected)
|
|
```
|
|
|
|
**Fix**: Add the field to the class's `attributes` section.
|
|
|
|
---
|
|
|
|
### Pattern 4: Nested Class Inlining
|
|
|
|
When a field references another class, use `inlined: true`:
|
|
|
|
```yaml
|
|
attributes:
|
|
location:
|
|
range: Location # References Location class
|
|
inlined: true # Embed the object, don't use reference
|
|
|
|
locations:
|
|
range: Location
|
|
multivalued: true
|
|
inlined_as_list: true # For arrays of objects
|
|
```
|
|
|
|
---
|
|
|
|
### Pattern 5: Optional vs Required Fields
|
|
|
|
Most fields should be optional to handle incomplete data:
|
|
|
|
```yaml
|
|
attributes:
|
|
field_name:
|
|
range: string
|
|
required: false # Default, but explicit is clearer
|
|
```
|
|
|
|
Required fields cause validation failures when missing:
|
|
```yaml
|
|
attributes:
|
|
name:
|
|
range: string
|
|
required: true # Data MUST have this field
|
|
```
|
|
|
|
---
|
|
|
|
## Common Validation Errors and Solutions
|
|
|
|
### Error: Additional properties not allowed
|
|
|
|
**Cause**: Data has field not defined in schema
|
|
**Fix**: Add field to class's `attributes` section
|
|
|
|
### Error: None is not of type 'string'
|
|
|
|
**Cause**: Field defined as string but data has null
|
|
**Fix**: Make field optional or use `any_of` with null
|
|
|
|
### Error: 123 is not of type 'string'
|
|
|
|
**Cause**: Data has integer, schema expects string
|
|
**Fix**: Use union type with `range: Any` and `any_of`
|
|
|
|
### Error: [...] is not of type 'object'
|
|
|
|
**Cause**: Data has array, schema expects single object
|
|
**Fix**: Add `multivalued: true` or use `range: Any` for flexibility
|
|
|
|
---
|
|
|
|
## Validation Session Tracking
|
|
|
|
When performing validation sessions, track fixes systematically:
|
|
|
|
| Fix # | File Modified | Change Made | Error Fixed |
|
|
|-------|---------------|-------------|-------------|
|
|
| 1 | ClassName.yaml | Added field X | "X was unexpected" |
|
|
| 2 | ClassName.yaml | Added range: Any | Type mismatch |
|
|
|
|
This table format allows continuation across sessions.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [LinkML Documentation](https://linkml.io/linkml/)
|
|
- [Union Types](https://linkml.io/linkml/schemas/advanced.html#union-types)
|
|
- AGENTS.md Rules 48-59
|
|
- `.opencode/rules/` for detailed rule documentation
|