glam/.opencode/rules/ontology-driven-cache-segmentation.md
kempersc 01b9d77566 feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation
Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46):

- types-vocab.json: 10,142 lines of institution type vocabulary from LinkML
  - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels
  - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.)
  - Extracted from CustodianType.yaml and CustodianTypes.yaml

- types-vocabulary.ts: TypeScript module for entity extraction
  - Exports INSTITUTION_TYPES with regex patterns per type code
  - Replaces hardcoded patterns with schema-derived vocabulary
  - Supports multilingual matching

- Rule 46 documentation (.opencode/rules/)
  - Specifies vocabulary extraction workflow
  - Defines cache key generation algorithm
  - Migration path from hardcoded patterns
2026-01-10 12:57:03 +01:00

21 KiB

Rule 46: Ontology-Driven Cache Segmentation

🚨 CRITICAL: The semantic cache MUST use vocabulary derived from LinkML *Type.yaml and *Types.yaml schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.

Problem Statement

The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:

Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅

However, the current implementation uses hardcoded regex patterns:

// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
  M: /\b(muse(um|a|ums?)|musea)/i,
  A: /\b(archie[fv]en?|archives?|archief)/i,
  // ... 19 patterns to maintain manually
};

Problems with hardcoded patterns:

  1. Maintenance burden - Every new institution type requires code changes
  2. Missing subtypes - "kunstmuseum" vs "museum" should cache separately
  3. No multilingual support - Only Dutch/English, misses German/French labels
  4. Duplication - Same vocabulary exists in LinkML schemas
  5. No record type awareness - "burgerlijke stand" queries mixed with general archive queries

Solution: Schema-Derived Vocabulary

The LinkML schema already contains rich vocabulary:

Schema File Content Cache Utility
CustodianType.yaml 19 top-level types Primary segmentation (M/A/L/G...)
MuseumType.yaml 187+ museum subtypes Subtype segmentation
ArchiveOrganizationType.yaml 144+ archive subtypes Subtype segmentation
*RecordSetTypes.yaml Record type taxonomies Finding aids specificity

Vocabulary Sources in Schema

  1. type_label - Multilingual labels via skos:prefLabel
  2. structured_aliases - Language-tagged alternative names
  3. keywords - Search terms for entity recognition
  4. wikidata_entity - Linked Data identifiers

Architecture

Overview: Two-Tier Embedding Hierarchy

The system uses a hierarchical embedding approach for fast semantic routing:

  1. Tier 1: Types File Embeddings - Which category? (Museum vs Archive vs Library)
  2. Tier 2: Individual Type Embeddings - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)
┌─────────────────────────────────────────────────────────────────────────┐
│  BUILD TIME: Extract vocabulary + generate embeddings                  │
│                                                                         │
│  schemas/20251121/linkml/modules/classes/*Type.yaml                     │
│  schemas/20251121/linkml/modules/classes/*Types.yaml                    │
│                          ↓                                              │
│  scripts/extract-types-vocab.ts                                         │
│                          ↓                                              │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  types-vocab.json                                                 │  │
│  │  ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] }   │  │
│  │  ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│  │
│  │  └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│  RUNTIME: Two-Tier Semantic Routing                                    │
│                                                                         │
│  Query: "Hoeveel gemeentearchieven in Amsterdam?"                       │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 1: Types File Selection                                    │    │
│  │ Query embedding vs Tier1 embeddings (19 categories)             │    │
│  │ Result: ArchiveOrganizationType (similarity: 0.89)              │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 2: Specific Type Selection                                 │    │
│  │ Query embedding vs Tier2 embeddings (144 archive subtypes)      │    │
│  │ Result: MunicipalArchive (similarity: 0.94)                     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam"            │
└─────────────────────────────────────────────────────────────────────────┘

Tier 1: Types File Embeddings

Each Types file (e.g., MuseumType.yaml, ArchiveOrganizationType.yaml) gets ONE embedding representing the accumulated vocabulary of all types within that file.

Embedding Text Construction:

MuseumType: museum musea kunstmuseum art museum natural history museum 
            science museum open-air museum ecomuseum virtual museum 
            heritage farm national museum regional museum university museum
            [... all keywords from all 187 subtypes ...]

Purpose: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.

Types File Code Accumulated Terms Count
MuseumType M ~500+ terms from 187 subtypes
ArchiveOrganizationType A ~400+ terms from 144 subtypes
LibraryType L ~200+ terms from subtypes
GalleryType G ~100+ terms from subtypes
... ... ...

Tier 2: Individual Type Embeddings

Each specific type within a Types file gets its own embedding from its accumulated terms.

Embedding Text Construction:

MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
                  town archive local government records burgerlijke stand
                  bevolkingsregister council minutes building permits
                  [... all keywords + structured_aliases + labels ...]

Purpose: Precise subtype identification after Tier 1 narrows the category.

Term Log Structure

A lookup table mapping every extracted term to its type/subtype:

{
  "termLog": {
    "kunstmuseum": {
      "typeCode": "M",
      "typeName": "MuseumType", 
      "subtypeName": "ART_MUSEUM",
      "wikidata": "Q207694",
      "language": "nl"
    },
    "art museum": {
      "typeCode": "M",
      "typeName": "MuseumType",
      "subtypeName": "ART_MUSEUM", 
      "wikidata": "Q207694",
      "language": "en"
    },
    "gemeentearchief": {
      "typeCode": "A",
      "typeName": "ArchiveOrganizationType",
      "subtypeName": "MUNICIPAL_ARCHIVE",
      "wikidata": "Q8362876",
      "language": "nl"
    }
  }
}

Purpose:

  1. Fast O(1) keyword lookup (no embedding needed for exact matches)
  2. Audit trail of which terms map to which types
  3. Debugging which queries match which types

Runtime Lookup Strategy

async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
  const vocab = await loadTypesVocabulary();
  const normalized = query.toLowerCase();
  
  // FAST PATH: Check termLog for exact keyword matches
  for (const [term, mapping] of Object.entries(vocab.termLog)) {
    if (normalized.includes(term)) {
      return {
        institutionType: mapping.typeCode,
        institutionSubtype: mapping.subtypeName,
        subtypeWikidata: mapping.wikidata,
        // ... location and intent extraction
      };
    }
  }
  
  // SLOW PATH: Embedding-based semantic matching
  const queryEmbedding = await generateEmbedding(query);
  
  // Tier 1: Find best matching Types file
  let bestType: string | null = null;
  let bestTypeSimilarity = 0;
  for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
    const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
    if (similarity > bestTypeSimilarity && similarity > 0.7) {
      bestTypeSimilarity = similarity;
      bestType = typeName;
    }
  }
  
  if (!bestType) return {}; // No type matched
  
  // Tier 2: Find best matching subtype within the Types file
  const typeCode = vocab.institutionTypes[bestType].code;
  let bestSubtype: string | null = null;
  let bestSubtypeSimilarity = 0;
  
  for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
    const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
    if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
      bestSubtypeSimilarity = similarity;
      bestSubtype = subtypeName;
    }
  }
  
  return {
    institutionType: typeCode,
    institutionSubtype: bestSubtype,
    // ... location and intent extraction
  };
}

Embedding Model Choice

For build-time embedding generation, use the same model as the semantic cache:

Option Model Dimensions Quality
Primary sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 384 Good multilingual
Fallback all-MiniLM-L6-v2 384 English-focused
High Quality multilingual-e5-large 1024 Best multilingual

Build-time generation: Embeddings are generated ONCE at build time and stored in JSON. This avoids runtime embedding API calls for type classification.

TypesVocabulary JSON Structure

Generated at build time with pre-computed embeddings:

{
  "version": "2026-01-10T12:00:00Z",
  "schemaVersion": "20251121",
  "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
  "embeddingDimensions": 384,
  
  "tier1Embeddings": {
    "MuseumType": [0.023, -0.045, 0.087, ...],
    "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
    "LibraryType": [-0.034, 0.089, 0.012, ...],
    "GalleryType": [0.045, -0.023, 0.067, ...]
  },
  
  "tier2Embeddings": {
    "M": {
      "ART_MUSEUM": [0.034, -0.056, 0.078, ...],
      "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
      "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
    },
    "A": {
      "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
      "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
      "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
    }
  },
  
  "termLog": {
    "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
    "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
    "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
    "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
    "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
  },
  
  "institutionTypes": {
    "M": {
      "code": "M",
      "className": "MuseumType",
      "baseWikidata": "Q33506",
      "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
      "keywords": {
        "nl": ["museum", "musea"],
        "en": ["museum", "museums"],
        "de": ["Museum", "Museen"]
      },
      "subtypes": {
        "ART_MUSEUM": {
          "className": "ArtMuseum",
          "wikidata": "Q207694",
          "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
          "keywords": {
            "nl": ["kunstmuseum", "kunstmusea"],
            "en": ["art museum", "art museums"]
          }
        },
        "NATURAL_HISTORY_MUSEUM": {
          "className": "NaturalHistoryMuseum",
          "wikidata": "Q559049",
          "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
          "keywords": {
            "nl": ["natuurhistorisch museum", "natuurmuseum"],
            "en": ["natural history museum"]
          }
        }
      }
    },
    "A": {
      "code": "A",
      "className": "ArchiveOrganizationType",
      "baseWikidata": "Q166118",
      "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
      "keywords": {
        "nl": ["archief", "archieven"],
        "en": ["archive", "archives"]
      },
      "subtypes": {
        "MUNICIPAL_ARCHIVE": {
          "className": "MunicipalArchive",
          "wikidata": "Q8362876",
          "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
          "keywords": {
            "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
            "en": ["municipal archive", "city archive", "town archive"]
          }
        },
        "NATIONAL_ARCHIVE": {
          "className": "NationalArchive",
          "wikidata": "Q1188452",
          "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
          "keywords": {
            "nl": ["nationaal archief", "rijksarchief"],
            "en": ["national archive", "state archive"]
          }
        }
      }
    }
  },
  
  "recordSetTypes": {
    "CIVIL_REGISTRY": {
      "className": "CivilRegistrySeries",
      "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
      "keywords": {
        "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
        "en": ["civil registry", "birth records", "marriage records", "death records"]
      }
    },
    "COUNCIL_GOVERNANCE": {
      "className": "CouncilGovernanceFonds",
      "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
      "keywords": {
        "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
        "en": ["council minutes", "ordinances", "resolutions"]
      }
    }
  }
}

Key Additions for Embedding Support

Field Purpose
tier1Embeddings Pre-computed embeddings for each Types file (19 categories)
tier2Embeddings Pre-computed embeddings for each subtype (500+ types)
termLog Fast O(1) lookup table for exact keyword matches
accumulatedTerms Raw text used to generate embeddings (for debugging/regeneration)
embeddingModel Model used to generate embeddings (for reproducibility)

Enhanced ExtractedEntities Interface

export interface ExtractedEntities {
  // Existing fields
  institutionType?: InstitutionTypeCode | null;
  location?: string | null;
  locationType?: 'city' | 'province' | null;
  intent?: 'count' | 'list' | 'info' | null;
  
  // NEW: Ontology-derived fields
  institutionSubtype?: string | null;  // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
  recordSetType?: string | null;        // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
  subtypeWikidata?: string | null;      // e.g., 'Q8362876' for LOD integration
}

Enhanced Cache Key Format

{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}

Examples:
- "count:m:amsterdam"                        # Basic museum count
- "count:m.art_museum:amsterdam"             # Art museum count (subtype)
- "list:a.municipal_archive:nh"              # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht"           # Civil registry in Utrecht
- "info:a.national_archive::nl"              # National archive info (no location filter)

Implementation Files

File Purpose
scripts/extract-types-vocab.ts Build-time vocabulary extraction from LinkML
apps/archief-assistent/public/types-vocab.json Generated vocabulary file
apps/archief-assistent/src/lib/types-vocabulary.ts Runtime vocabulary loader
apps/archief-assistent/src/lib/semantic-cache.ts Updated entity extraction

Build Integration

Add to apps/archief-assistent/package.json:

{
  "scripts": {
    "prebuild": "tsx ../../scripts/extract-types-vocab.ts",
    "build": "vite build"
  }
}

Keyword Extraction Priority

When extracting keywords from schema files:

  1. keywords array (highest priority) - Explicit search terms
  2. structured_aliases.literal_form - Multilingual alternative names
  3. type_label - Preferred labels per language
  4. Class name conversion - MunicipalArchive → "municipal archive"

Cache Segmentation Rules

Rule 1: Subtype Specificity

Queries with specific subtypes should NOT match generic type cache entries:

Query: "kunstmusea in Amsterdam"     → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam"         → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅

Rule 2: Record Set Type Isolation

Queries about specific record types should cache separately:

Query: "burgerlijke stand Utrecht"   → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht"       → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅

Rule 3: Subtype-to-Type Fallback

Generic queries CAN match subtype cache entries (broader is acceptable):

Query: "musea in Amsterdam"          → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam"    → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)

Migration Notes

  1. Backwards Compatible: Existing cache entries without institutionSubtype continue to work
  2. Gradual Rollout: New cache entries get subtype, old entries remain valid
  3. Cache Clear: Consider clearing cache after deployment to ensure consistency

Validation

Run E2E tests to verify:

cd apps/archief-assistent
npm run test:e2e

Key test cases:

  • Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
  • Subtype isolation (kunstmuseum ≠ museum)
  • Record set isolation (burgerlijke stand ≠ archive)
  • Intent isolation (count ≠ list ≠ info)

References

  • Rule 41: Types classes define SPARQL template variables
  • Rule 0b: Type/Types file naming convention
  • CustodianType.yaml: Base taxonomy definition
  • AGENTS.md: GLAMORCUBESFIXPHDNT taxonomy documentation

Created: 2026-01-10 Author: OpenCode Agent Status: Implementing