kempersc c88fd3af70 Refactor code structure for improved readability and maintainability

2026-01-09 11:05:26 +01:00

11 KiB

Raw Blame History

Rule: LinkML "Types" Classes Define SPARQL Template Variables

Created: 2025-01-08 Status: Active Applies to: SPARQL template design, RAG pipeline slot extraction

Core Principle

LinkML classes following the *Type / *Types naming pattern (Rule 0b) serve as the single source of truth for valid values in SPARQL template slot variables.

When designing SPARQL templates, extract variables from the schema rather than hardcoding values. This enables:

Flexibility: Same template works across all institution types
Extensibility: Adding new types to schema automatically extends templates
Consistency: Variable values always align with ontology
Multilingual support: Type labels in multiple languages available from schema

Template Variable Sources

1. Institution Type Variable (`institution_type`)

Schema Source: CustodianType abstract class and its 19 subclasses

Subclass	Code	Description
`ArchiveOrganizationType`	A	Archives
`BioCustodianType`	B	Botanical gardens, zoos
`CommercialOrganizationType`	C	Corporations
`DigitalPlatformType`	D	Digital platforms
`EducationProviderType`	E	Universities, schools
`FeatureCustodianType`	F	Geographic features
`GalleryType`	G	Art galleries
`HolySacredSiteType`	H	Religious sites
`IntangibleHeritageGroupType`	I	Folklore organizations
`LibraryType`	L	Libraries
`MuseumType`	M	Museums
`NonProfitType`	N	NGOs
`OfficialInstitutionType`	O	Government agencies
`PersonalCollectionType`	P	Private collectors
`ResearchOrganizationType`	R	Research centers
`HeritageSocietyType`	S	Historical societies
`TasteScentHeritageType`	T	Culinary heritage
`UnspecifiedType`	U	Unknown
`MixedCustodianType`	X	Multiple types

Template Slot Definition:

slots:
  institution_type:
    type: institution_type
    required: true
    schema_source: "modules/classes/CustodianType.yaml"
    # Valid values derived from CustodianType subclasses

2. Geographic Scope Variable (`location`)

Geographic scope is a hierarchical variable with three levels:

Level	Schema Source	SPARQL Property	Example
Country	ISO 3166-1 alpha-2	`hc:countryCode`	NL, DE, BE
Subregion	ISO 3166-2	`hc:subregionCode`	NL-NH, DE-BY
Settlement	GeoNames	`hc:settlementName`	Amsterdam, Berlin

Template Slot Definition:

slots:
  location:
    type: location
    required: true
    schema_source: 
      - "modules/enums/CountryCodeEnum.yaml" (if exists)
      - "data/reference/geonames.db"
    resolution_order: [settlement, subregion, country]
    # SlotExtractor detects which level user specified

3. Digital Platform Type Variable (`platform_type`)

Schema Source: DigitalPlatformType abstract class and 69+ subclasses in DigitalPlatformTypes.yaml

Categories include:

REPOSITORY: DigitalLibrary, DigitalArchivePlatform, OpenAccessRepository
AGGREGATOR: Europeana-type aggregators, BibliographicDatabasePlatform
DISCOVERY: WebPortal, OnlineDatabase, OpenDataPortal
VIRTUAL_HERITAGE: VirtualMuseum, VirtualLibrary, OnlineArtGallery
RESEARCH: DisciplinaryRepository, PrePrintServer, GenealogyDatabase
...and many more

Template Slot Definition:

slots:
  platform_type:
    type: platform_type
    required: false
    schema_source: "modules/classes/DigitalPlatformTypes.yaml"

Template Design Pattern

Before (Hardcoded - WRONG)

# Separate templates for each institution type - DO NOT DO THIS
templates:
  count_museums_in_region:
    sparql: |
      SELECT (COUNT(?s) AS ?count) WHERE {
        ?s hc:institutionType "M" ;
           hc:subregionCode "{{ region }}" .
      }
            
  count_archives_in_region:
    sparql: |
      SELECT (COUNT(?s) AS ?count) WHERE {
        ?s hc:institutionType "A" ;
           hc:subregionCode "{{ region }}" .
      }

After (Parameterized - CORRECT)

# Single template with institution_type as variable
templates:
  count_institutions_by_type_location:
    description: "Count heritage institutions by type and location"
    slots:
      institution_type:
        type: institution_type
        required: true
        schema_source: "modules/classes/CustodianType.yaml"
      location:
        type: location
        required: true
        resolution_order: [settlement, subregion, country]
        
    # Multiple SPARQL variants based on location resolution
    sparql_template: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:settlementName "{{ location }}" .
      }
            
    sparql_template_region: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:subregionCode "{{ location }}" .
      }
            
    sparql_template_country: |
      SELECT (COUNT(DISTINCT ?institution) AS ?count) WHERE {
        ?institution a hcc:Custodian ;
                     hc:institutionType "{{ institution_type }}" ;
                     hc:countryCode "{{ location }}" .
      }

SlotExtractor Responsibilities

The SlotExtractor module must:

Detect institution type from user query:
- "musea" → M (Dutch plural)
- "archives" → A (English)
- "bibliotheken" → L (Dutch)
- Use synonyms from _slot_types.institution_type.synonyms
Detect location level from user query:
- "Amsterdam" → settlement level → use sparql_template
- "Noord-Holland" → subregion level → use sparql_template_region
- "Nederland" → country level → use sparql_template_country
Normalize values to schema-compliant codes:
- "Noord-Holland" → "NL-NH"
- "museum" → "M"

Dynamic Label Resolution (NO HARDCODING)

CRITICAL: Labels MUST be resolved at runtime from schema/reference files, NOT hardcoded in templates or code.

Institution Type Labels

The CustodianType classes contain multilingual labels via type_label slot:

MuseumType:
  type_label:
    - "Museum"@en
    - "museum"@nl
    - "Museum"@de
    - "museo"@es

Label Resolution Chain:

Load CustodianType.yaml and subclass files
Parse type_label slot for each type code (M, L, A, etc.)
Build runtime label dictionary keyed by code + language

Geographic Labels

Subregion/settlement names come from reference data files, not hardcoded:

label_sources:
  - "data/reference/iso_3166_2_{country}.json"  # e.g., iso_3166_2_nl.json
  - "data/reference/geonames.db"                 # GeoNames database
  - "data/reference/admin1CodesASCII.txt"        # GeoNames fallback

Example: iso_3166_2_nl.json contains:

{
  "provinces": {
    "Noord-Holland": "NH",
    "Zuid-Holland": "ZH",
    "North Holland": "NH"  // English synonym
  }
}

SlotExtractor Label Loading

class SlotExtractor:
    def __init__(self, schema_path: str, reference_path: str):
        # Load institution type labels from schema
        self.type_labels = self._load_custodian_type_labels(schema_path)
        
        # Load geographic labels from reference files
        self.subregion_labels = self._load_subregion_labels(reference_path)
    
    def _load_custodian_type_labels(self, schema_path: str) -> dict:
        """Load multilingual labels from CustodianType schema files."""
        # Parse YAML, extract type_label slots
        # Return: {"M": {"nl": "musea", "en": "museums"}, ...}
        
    def _load_subregion_labels(self, reference_path: str) -> dict:
        """Load subregion labels from ISO 3166-2 JSON files."""
        # Load iso_3166_2_nl.json, iso_3166_2_de.json, etc.
        # Return: {"NL-NH": {"nl": "Noord-Holland", "en": "North Holland"}, ...}

UI Template Interpolation

ui_template:
  nl: "Er zijn {{ count }} {{ institution_type_nl }} in {{ location }}."
  en: "There are {{ count }} {{ institution_type_en }} in {{ location }}."

The RAG pipeline populates institution_type_nl / institution_type_en from dynamically loaded labels:

# At runtime, NOT hardcoded
template_context["institution_type_nl"] = slot_extractor.type_labels[type_code]["nl"]
template_context["institution_type_en"] = slot_extractor.type_labels[type_code]["en"]

Adding New Types

When the schema gains new institution types:

No template changes needed - parameterized templates automatically support new types
Update synonyms in _slot_types.institution_type.synonyms for NLP recognition
Labels auto-discovered from schema files - no code changes needed

Anti-Patterns (FORBIDDEN)

Hardcoded Labels in Templates

# WRONG - Hardcoded labels
labels:
  NL-NH: {nl: "Noord-Holland", en: "North Holland"}
  NL-ZH: {nl: "Zuid-Holland", en: "South Holland"}

# WRONG - Hardcoded labels in code
INSTITUTION_TYPE_LABELS_NL = {
    "M": "musea", "L": "bibliotheken", ...
}

Correct Approach

# CORRECT - Reference to schema/data source
label_sources:
  - "schemas/20251121/linkml/modules/classes/CustodianType.yaml"
  - "data/reference/iso_3166_2_{country}.json"

# CORRECT - Load labels at runtime
type_labels = load_labels_from_schema("CustodianType.yaml")
region_labels = load_labels_from_reference("iso_3166_2_nl.json")

Why?

Single source of truth - Labels defined once in schema/reference files
Automatic sync - Schema changes automatically propagate to UI
Extensibility - Adding new countries/types doesn't require code changes
Multilingual - All language variants come from same source

Validation

Templates MUST validate slot values against schema:

def validate_institution_type(value: str) -> bool:
    """Validate institution_type against CustodianType schema."""
    valid_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 
                   'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'X']
    return value in valid_codes

Rule 0b: Type/Types file naming convention
Rule 13: Custodian type annotations on LinkML schema elements
Rule 37: Specificity score annotations for template filtering

References

Schema: schemas/20251121/linkml/modules/classes/CustodianType.yaml
Types: schemas/20251121/linkml/modules/classes/*Types.yaml
Enums: schemas/20251121/linkml/modules/enums/InstitutionTypeCodeEnum.yaml
Templates: data/sparql_templates.yaml

11 KiB Raw Blame History

Rule: LinkML "Types" Classes Define SPARQL Template Variables

Core Principle

Template Variable Sources

1. Institution Type Variable (institution_type)

2. Geographic Scope Variable (location)

3. Digital Platform Type Variable (platform_type)

Template Design Pattern

Before (Hardcoded - WRONG)

After (Parameterized - CORRECT)

SlotExtractor Responsibilities

Dynamic Label Resolution (NO HARDCODING)

Institution Type Labels

Geographic Labels

SlotExtractor Label Loading

UI Template Interpolation

Adding New Types

Anti-Patterns (FORBIDDEN)

Hardcoded Labels in Templates

Correct Approach

Validation

Related Rules

References

11 KiB

Raw Blame History

1. Institution Type Variable (`institution_type`)

2. Geographic Scope Variable (`location`)

3. Digital Platform Type Variable (`platform_type`)