# Rule: LinkML "Types" Classes Define SPARQL Template Variables **Created**: 2025-01-08 **Status**: Active **Applies to**: SPARQL template design, RAG pipeline slot extraction ## Core Principle LinkML classes following the `*Type` / `*Types` naming pattern (Rule 0b) serve as the **single source of truth** for valid values in SPARQL template slot variables. When designing SPARQL templates, **extract variables from the schema** rather than hardcoding values. This enables: - **Flexibility**: Same template works across all institution types - **Extensibility**: Adding new types to schema automatically extends templates - **Consistency**: Variable values always align with ontology - **Multilingual support**: Type labels in multiple languages available from schema ## Template Variable Sources ### 1. Institution Type Variable (`institution_type`) **Schema Source**: `CustodianType` abstract class and its 19 subclasses | Subclass | Code | Description | |----------|------|-------------| | `ArchiveOrganizationType` | A | Archives | | `BioCustodianType` | B | Botanical gardens, zoos | | `CommercialOrganizationType` | C | Corporations | | `DigitalPlatformType` | D | Digital platforms | | `EducationProviderType` | E | Universities, schools | | `FeatureCustodianType` | F | Geographic features | | `GalleryType` | G | Art galleries | | `HolySacredSiteType` | H | Religious sites | | `IntangibleHeritageGroupType` | I | Folklore organizations | | `LibraryType` | L | Libraries | | `MuseumType` | M | Museums | | `NonProfitType` | N | NGOs | | `OfficialInstitutionType` | O | Government agencies | | `PersonalCollectionType` | P | Private collectors | | `ResearchOrganizationType` | R | Research centers | | `HeritageSocietyType` | S | Historical societies | | `TasteScentHeritageType` | T | Culinary heritage | | `UnspecifiedType` | U | Unknown | | `MixedCustodianType` | X | Multiple types | **Template Slot Definition**: ```yaml slots: institution_type: type: institution_type required: true schema_source: "modules/classes/CustodianType.yaml" # Valid values derived from CustodianType subclasses ``` ### 2. Geographic Scope Variable (`location`) Geographic scope is a **hierarchical variable** with three levels: | Level | Schema Source | SPARQL Property | Example | |-------|---------------|-----------------|---------| | Country | ISO 3166-1 alpha-2 | `hc:countryCode` | NL, DE, BE | | Subregion | ISO 3166-2 | `hc:subregionCode` | NL-NH, DE-BY | | Settlement | GeoNames | `hc:settlementName` | Amsterdam, Berlin | **Template Slot Definition**: ```yaml slots: location: type: location required: true schema_source: - "modules/enums/CountryCodeEnum.yaml" (if exists) - "data/reference/geonames.db" resolution_order: [settlement, subregion, country] # SlotExtractor detects which level user specified ``` ### 3. Digital Platform Type Variable (`platform_type`) **Schema Source**: `DigitalPlatformType` abstract class and 69+ subclasses in `DigitalPlatformTypes.yaml` Categories include: - REPOSITORY: DigitalLibrary, DigitalArchivePlatform, OpenAccessRepository - AGGREGATOR: Europeana-type aggregators, BibliographicDatabasePlatform - DISCOVERY: WebPortal, OnlineDatabase, OpenDataPortal - VIRTUAL_HERITAGE: VirtualMuseum, VirtualLibrary, OnlineArtGallery - RESEARCH: DisciplinaryRepository, PrePrintServer, GenealogyDatabase - ...and many more **Template Slot Definition**: ```yaml slots: platform_type: type: platform_type required: false schema_source: "modules/classes/DigitalPlatformTypes.yaml" ``` ## Template Design Pattern ### Before (Hardcoded - WRONG) ```yaml # Separate templates for each institution type - DO NOT DO THIS templates: count_museums_in_region: sparql: | SELECT (COUNT(?s) AS ?count) WHERE { ?s hc:institutionType "M" ; hc:subregionCode "{{ region }}" . } count_archives_in_region: sparql: | SELECT (COUNT(?s) AS ?count) WHERE { ?s hc:institutionType "A" ; hc:subregionCode "{{ region }}" . } ``` ### After (Parameterized - CORRECT) ```yaml # Single template with institution_type as variable templates: count_institutions_by_type_location: description: "Count heritage institutions by type and location" slots: institution_type: type: institution_type required: true schema_source: "modules/classes/CustodianType.yaml" location: type: location required: true resolution_order: [settlement, subregion, country] # Multiple SPARQL variants based on location resolution sparql_template: | SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE { ?institution a hcc:Custodian ; hc:institutionType "{{ institution_type }}" ; hc:settlementName "{{ location }}" . } sparql_template_region: | SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE { ?institution a hcc:Custodian ; hc:institutionType "{{ institution_type }}" ; hc:subregionCode "{{ location }}" . } sparql_template_country: | SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE { ?institution a hcc:Custodian ; hc:institutionType "{{ institution_type }}" ; hc:countryCode "{{ location }}" . } ``` ## SlotExtractor Responsibilities The SlotExtractor module must: 1. **Detect institution type** from user query: - "musea" → M (Dutch plural) - "archives" → A (English) - "bibliotheken" → L (Dutch) - Use synonyms from `_slot_types.institution_type.synonyms` 2. **Detect location level** from user query: - "Amsterdam" → settlement level → use `sparql_template` - "Noord-Holland" → subregion level → use `sparql_template_region` - "Nederland" → country level → use `sparql_template_country` 3. **Normalize values** to schema-compliant codes: - "Noord-Holland" → "NL-NH" - "museum" → "M" ## Dynamic Label Resolution (NO HARDCODING) **CRITICAL**: Labels MUST be resolved at runtime from schema/reference files, NOT hardcoded in templates or code. ### Institution Type Labels The `CustodianType` classes contain multilingual labels via `type_label` slot: ```yaml MuseumType: type_label: - "Museum"@en - "museum"@nl - "Museum"@de - "museo"@es ``` **Label Resolution Chain**: 1. Load `CustodianType.yaml` and subclass files 2. Parse `type_label` slot for each type code (M, L, A, etc.) 3. Build runtime label dictionary keyed by code + language ### Geographic Labels Subregion/settlement names come from **reference data files**, not hardcoded: ```yaml label_sources: - "data/reference/iso_3166_2_{country}.json" # e.g., iso_3166_2_nl.json - "data/reference/geonames.db" # GeoNames database - "data/reference/admin1CodesASCII.txt" # GeoNames fallback ``` **Example**: `iso_3166_2_nl.json` contains: ```json { "provinces": { "Noord-Holland": "NH", "Zuid-Holland": "ZH", "North Holland": "NH" // English synonym } } ``` ### SlotExtractor Label Loading ```python class SlotExtractor: def __init__(self, schema_path: str, reference_path: str): # Load institution type labels from schema self.type_labels = self._load_custodian_type_labels(schema_path) # Load geographic labels from reference files self.subregion_labels = self._load_subregion_labels(reference_path) def _load_custodian_type_labels(self, schema_path: str) -> dict: """Load multilingual labels from CustodianType schema files.""" # Parse YAML, extract type_label slots # Return: {"M": {"nl": "musea", "en": "museums"}, ...} def _load_subregion_labels(self, reference_path: str) -> dict: """Load subregion labels from ISO 3166-2 JSON files.""" # Load iso_3166_2_nl.json, iso_3166_2_de.json, etc. # Return: {"NL-NH": {"nl": "Noord-Holland", "en": "North Holland"}, ...} ``` ### UI Template Interpolation ```yaml ui_template: nl: "Er zijn {{ count }} {{ institution_type_nl }} in {{ location }}." en: "There are {{ count }} {{ institution_type_en }} in {{ location }}." ``` The RAG pipeline populates `institution_type_nl` / `institution_type_en` from dynamically loaded labels: ```python # At runtime, NOT hardcoded template_context["institution_type_nl"] = slot_extractor.type_labels[type_code]["nl"] template_context["institution_type_en"] = slot_extractor.type_labels[type_code]["en"] ``` ## Adding New Types When the schema gains new institution types: 1. **No template changes needed** - parameterized templates automatically support new types 2. **Update synonyms** in `_slot_types.institution_type.synonyms` for NLP recognition 3. **Labels auto-discovered** from schema files - no code changes needed ## Anti-Patterns (FORBIDDEN) ### Hardcoded Labels in Templates ```yaml # WRONG - Hardcoded labels labels: NL-NH: {nl: "Noord-Holland", en: "North Holland"} NL-ZH: {nl: "Zuid-Holland", en: "South Holland"} ``` ```python # WRONG - Hardcoded labels in code INSTITUTION_TYPE_LABELS_NL = { "M": "musea", "L": "bibliotheken", ... } ``` ### Correct Approach ```yaml # CORRECT - Reference to schema/data source label_sources: - "schemas/20251121/linkml/modules/classes/CustodianType.yaml" - "data/reference/iso_3166_2_{country}.json" ``` ```python # CORRECT - Load labels at runtime type_labels = load_labels_from_schema("CustodianType.yaml") region_labels = load_labels_from_reference("iso_3166_2_nl.json") ``` **Why?** 1. **Single source of truth** - Labels defined once in schema/reference files 2. **Automatic sync** - Schema changes automatically propagate to UI 3. **Extensibility** - Adding new countries/types doesn't require code changes 4. **Multilingual** - All language variants come from same source ## Validation Templates MUST validate slot values against schema: ```python def validate_institution_type(value: str) -> bool: """Validate institution_type against CustodianType schema.""" valid_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'X'] return value in valid_codes ``` ## Related Rules - **Rule 0b**: Type/Types file naming convention - **Rule 13**: Custodian type annotations on LinkML schema elements - **Rule 37**: Specificity score annotations for template filtering ## References - Schema: `schemas/20251121/linkml/modules/classes/CustodianType.yaml` - Types: `schemas/20251121/linkml/modules/classes/*Types.yaml` - Enums: `schemas/20251121/linkml/modules/enums/InstitutionTypeCodeEnum.yaml` - Templates: `data/sparql_templates.yaml`