glam/docs/plan/linkml_schema/00-validation-patterns.md
2026-01-19 00:09:28 +01:00

4.9 KiB

LinkML Schema Validation Patterns

This document captures critical patterns discovered during the GLAM LinkML schema validation effort against 33,444+ custodian YAML data files.

Overview

The GLAM project uses LinkML schemas to define the structure of heritage custodian data. Validation ensures data files conform to the schema before ingestion into databases.

Key Files:

  • Main schema entry: schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml
  • Classes: schemas/20251121/linkml/modules/classes/
  • Slots: schemas/20251121/linkml/modules/slots/
  • Enums: schemas/20251121/linkml/modules/enums/
  • Data: data/custodian/*.yaml (33,444+ files)

Validation Command

# Single file
linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml data/custodian/FILE.yaml

# Batch test
for f in data/custodian/A*.yaml; do
  linkml-validate -s schemas/20251121/linkml/modules/classes/CustodianSourceFile.yaml "$f" 2>&1 | grep -q ERROR && echo "FAIL: $f"
done

Critical Patterns

Pattern 1: Union Types Require range: Any

Priority: 🚨 CRITICAL

When using any_of to define union types (e.g., string OR integer), you MUST specify range: Any.

# CORRECT
attributes:
  identifier_value:
    range: Any  # ← REQUIRED
    any_of:
      - range: string
      - range: integer

# WRONG (validation silently fails)
attributes:
  identifier_value:
    any_of:
      - range: string
      - range: integer

Affected Fields:

  • identifier_value (string | integer)
  • geonames_id (string | integer)
  • youtube_channel_id (string | array)
  • facebook_id (string | array)
  • identifiers (dict | array)

See: Rule 59 in AGENTS.md, .opencode/rules/linkml-union-type-range-any-rule.md


Pattern 2: Flexible Data Structures (dict OR array)

Some fields accept either dict or array format depending on data source.

# Dict format (common in manual entry)
identifiers:
  isil: "NL-123"
  wikidata: "Q12345"

# Array format (common in API responses)
identifiers:
  - scheme: isil
    value: "NL-123"
  - scheme: wikidata
    value: "Q12345"

Schema Solution:

attributes:
  identifiers:
    range: Any  # Accept both formats
    description: >-
      Identifiers from source. Accepts dict format {scheme: value}
      or array format [{scheme: X, value: Y}].      

Pattern 3: Adding Missing Fields to Classes

When data contains fields not in schema, add them to the appropriate class:

# In the class YAML file (e.g., CustodianSourceFile.yaml)
attributes:
  new_field_name:
    range: string  # or appropriate type
    description: Description of the field
    # Optional properties:
    required: false
    multivalued: false
    inlined: true  # For complex types

Validation Error Example:

Additional properties are not allowed ('new_field' was unexpected)

Fix: Add the field to the class's attributes section.


Pattern 4: Nested Class Inlining

When a field references another class, use inlined: true:

attributes:
  location:
    range: Location  # References Location class
    inlined: true    # Embed the object, don't use reference
    
  locations:
    range: Location
    multivalued: true
    inlined_as_list: true  # For arrays of objects

Pattern 5: Optional vs Required Fields

Most fields should be optional to handle incomplete data:

attributes:
  field_name:
    range: string
    required: false  # Default, but explicit is clearer

Required fields cause validation failures when missing:

attributes:
  name:
    range: string
    required: true  # Data MUST have this field

Common Validation Errors and Solutions

Error: Additional properties not allowed

Cause: Data has field not defined in schema
Fix: Add field to class's attributes section

Error: None is not of type 'string'

Cause: Field defined as string but data has null
Fix: Make field optional or use any_of with null

Error: 123 is not of type 'string'

Cause: Data has integer, schema expects string
Fix: Use union type with range: Any and any_of

Error: [...] is not of type 'object'

Cause: Data has array, schema expects single object
Fix: Add multivalued: true or use range: Any for flexibility


Validation Session Tracking

When performing validation sessions, track fixes systematically:

Fix # File Modified Change Made Error Fixed
1 ClassName.yaml Added field X "X was unexpected"
2 ClassName.yaml Added range: Any Type mismatch

This table format allows continuation across sessions.


References