glam/.opencode/rules/types-classes-as-template-variables.md

11 KiB

Rule: LinkML "Types" Classes Define SPARQL Template Variables

Created: 2025-01-08 Status: Active Applies to: SPARQL template design, RAG pipeline slot extraction

Core Principle

LinkML classes following the *Type / *Types naming pattern (Rule 0b) serve as the single source of truth for valid values in SPARQL template slot variables.

When designing SPARQL templates, extract variables from the schema rather than hardcoding values. This enables:

  • Flexibility: Same template works across all institution types
  • Extensibility: Adding new types to schema automatically extends templates
  • Consistency: Variable values always align with ontology
  • Multilingual support: Type labels in multiple languages available from schema

Template Variable Sources

1. Institution Type Variable (institution_type)

Schema Source: CustodianType abstract class and its 19 subclasses

Subclass Code Description
ArchiveOrganizationType A Archives
BioCustodianType B Botanical gardens, zoos
CommercialOrganizationType C Corporations
DigitalPlatformType D Digital platforms
EducationProviderType E Universities, schools
FeatureCustodianType F Geographic features
GalleryType G Art galleries
HolySacredSiteType H Religious sites
IntangibleHeritageGroupType I Folklore organizations
LibraryType L Libraries
MuseumType M Museums
NonProfitType N NGOs
OfficialInstitutionType O Government agencies
PersonalCollectionType P Private collectors
ResearchOrganizationType R Research centers
HeritageSocietyType S Historical societies
TasteScentHeritageType T Culinary heritage
UnspecifiedType U Unknown
MixedCustodianType X Multiple types

Template Slot Definition:

slots:
  institution_type:
    type: institution_type
    required: true
    schema_source: "modules/classes/CustodianType.yaml"
    # Valid values derived from CustodianType subclasses

2. Geographic Scope Variable (location)

Geographic scope is a hierarchical variable with three levels:

Level Schema Source SPARQL Property Example
Country ISO 3166-1 alpha-2 hc:countryCode NL, DE, BE
Subregion ISO 3166-2 hc:subregionCode NL-NH, DE-BY
Settlement GeoNames hc:settlementName Amsterdam, Berlin

Template Slot Definition:

slots:
  location:
    type: location
    required: true
    schema_source: 
      - "modules/enums/CountryCodeEnum.yaml" (if exists)
      - "data/reference/geonames.db"
    resolution_order: [settlement, subregion, country]
    # SlotExtractor detects which level user specified

3. Digital Platform Type Variable (platform_type)

Schema Source: DigitalPlatformType abstract class and 69+ subclasses in DigitalPlatformTypes.yaml

Categories include:

  • REPOSITORY: DigitalLibrary, DigitalArchivePlatform, OpenAccessRepository
  • AGGREGATOR: Europeana-type aggregators, BibliographicDatabasePlatform
  • DISCOVERY: WebPortal, OnlineDatabase, OpenDataPortal
  • VIRTUAL_HERITAGE: VirtualMuseum, VirtualLibrary, OnlineArtGallery
  • RESEARCH: DisciplinaryRepository, PrePrintServer, GenealogyDatabase
  • ...and many more

Template Slot Definition:

slots:
  platform_type:
    type: platform_type
    required: false
    schema_source: "modules/classes/DigitalPlatformTypes.yaml"

Template Design Pattern

Before (Hardcoded - WRONG)

# Separate templates for each institution type - DO NOT DO THIS
templates:
  count_museums_in_region:
    sparql: |
      SELECT (COUNT(?s) AS ?count) WHERE {
        ?s hc:institutionType "M" ;
           hc:subregionCode "{{ region }}" .
      }
            
  count_archives_in_region:
    sparql: |
      SELECT (COUNT(?s) AS ?count) WHERE {
        ?s hc:institutionType "A" ;
           hc:subregionCode "{{ region }}" .
      }      

After (Parameterized - CORRECT)

# Single template with institution_type as variable
templates:
  count_institutions_by_type_location:
    description: "Count heritage institutions by type and location"
    slots:
      institution_type:
        type: institution_type
        required: true
        schema_source: "modules/classes/CustodianType.yaml"
      location:
        type: location
        required: true
        resolution_order: [settlement, subregion, country]
        
    # Multiple SPARQL variants based on location resolution
    sparql_template: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:settlementName "{{ location }}" .
      }
            
    sparql_template_region: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:subregionCode "{{ location }}" .
      }
            
    sparql_template_country: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:countryCode "{{ location }}" .
      }      

SlotExtractor Responsibilities

The SlotExtractor module must:

  1. Detect institution type from user query:

    • "musea" → M (Dutch plural)
    • "archives" → A (English)
    • "bibliotheken" → L (Dutch)
    • Use synonyms from _slot_types.institution_type.synonyms
  2. Detect location level from user query:

    • "Amsterdam" → settlement level → use sparql_template
    • "Noord-Holland" → subregion level → use sparql_template_region
    • "Nederland" → country level → use sparql_template_country
  3. Normalize values to schema-compliant codes:

    • "Noord-Holland" → "NL-NH"
    • "museum" → "M"

Dynamic Label Resolution (NO HARDCODING)

CRITICAL: Labels MUST be resolved at runtime from schema/reference files, NOT hardcoded in templates or code.

Institution Type Labels

The CustodianType classes contain multilingual labels via type_label slot:

MuseumType:
  type_label:
    - "Museum"@en
    - "museum"@nl
    - "Museum"@de
    - "museo"@es

Label Resolution Chain:

  1. Load CustodianType.yaml and subclass files
  2. Parse type_label slot for each type code (M, L, A, etc.)
  3. Build runtime label dictionary keyed by code + language

Geographic Labels

Subregion/settlement names come from reference data files, not hardcoded:

label_sources:
  - "data/reference/iso_3166_2_{country}.json"  # e.g., iso_3166_2_nl.json
  - "data/reference/geonames.db"                 # GeoNames database
  - "data/reference/admin1CodesASCII.txt"        # GeoNames fallback

Example: iso_3166_2_nl.json contains:

{
  "provinces": {
    "Noord-Holland": "NH",
    "Zuid-Holland": "ZH",
    "North Holland": "NH"  // English synonym
  }
}

SlotExtractor Label Loading

class SlotExtractor:
    def __init__(self, schema_path: str, reference_path: str):
        # Load institution type labels from schema
        self.type_labels = self._load_custodian_type_labels(schema_path)
        
        # Load geographic labels from reference files
        self.subregion_labels = self._load_subregion_labels(reference_path)
    
    def _load_custodian_type_labels(self, schema_path: str) -> dict:
        """Load multilingual labels from CustodianType schema files."""
        # Parse YAML, extract type_label slots
        # Return: {"M": {"nl": "musea", "en": "museums"}, ...}
        
    def _load_subregion_labels(self, reference_path: str) -> dict:
        """Load subregion labels from ISO 3166-2 JSON files."""
        # Load iso_3166_2_nl.json, iso_3166_2_de.json, etc.
        # Return: {"NL-NH": {"nl": "Noord-Holland", "en": "North Holland"}, ...}

UI Template Interpolation

ui_template:
  nl: "Er zijn {{ count }} {{ institution_type_nl }} in {{ location }}."
  en: "There are {{ count }} {{ institution_type_en }} in {{ location }}."

The RAG pipeline populates institution_type_nl / institution_type_en from dynamically loaded labels:

# At runtime, NOT hardcoded
template_context["institution_type_nl"] = slot_extractor.type_labels[type_code]["nl"]
template_context["institution_type_en"] = slot_extractor.type_labels[type_code]["en"]

Adding New Types

When the schema gains new institution types:

  1. No template changes needed - parameterized templates automatically support new types
  2. Update synonyms in _slot_types.institution_type.synonyms for NLP recognition
  3. Labels auto-discovered from schema files - no code changes needed

Anti-Patterns (FORBIDDEN)

Hardcoded Labels in Templates

# WRONG - Hardcoded labels
labels:
  NL-NH: {nl: "Noord-Holland", en: "North Holland"}
  NL-ZH: {nl: "Zuid-Holland", en: "South Holland"}
# WRONG - Hardcoded labels in code
INSTITUTION_TYPE_LABELS_NL = {
    "M": "musea", "L": "bibliotheken", ...
}

Correct Approach

# CORRECT - Reference to schema/data source
label_sources:
  - "schemas/20251121/linkml/modules/classes/CustodianType.yaml"
  - "data/reference/iso_3166_2_{country}.json"
# CORRECT - Load labels at runtime
type_labels = load_labels_from_schema("CustodianType.yaml")
region_labels = load_labels_from_reference("iso_3166_2_nl.json")

Why?

  1. Single source of truth - Labels defined once in schema/reference files
  2. Automatic sync - Schema changes automatically propagate to UI
  3. Extensibility - Adding new countries/types doesn't require code changes
  4. Multilingual - All language variants come from same source

Validation

Templates MUST validate slot values against schema:

def validate_institution_type(value: str) -> bool:
    """Validate institution_type against CustodianType schema."""
    valid_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 
                   'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'X']
    return value in valid_codes
  • Rule 0b: Type/Types file naming convention
  • Rule 13: Custodian type annotations on LinkML schema elements
  • Rule 37: Specificity score annotations for template filtering

References

  • Schema: schemas/20251121/linkml/modules/classes/CustodianType.yaml
  • Types: schemas/20251121/linkml/modules/classes/*Types.yaml
  • Enums: schemas/20251121/linkml/modules/enums/InstitutionTypeCodeEnum.yaml
  • Templates: data/sparql_templates.yaml