feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation

Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46): - types-vocab.json: 10,142 lines of institution type vocabulary from LinkML - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.) - Extracted from CustodianType.yaml and CustodianTypes.yaml - types-vocabulary.ts: TypeScript module for entity extraction - Exports INSTITUTION_TYPES with regex patterns per type code - Replaces hardcoded patterns with schema-derived vocabulary - Supports multilingual matching - Rule 46 documentation (.opencode/rules/) - Specifies vocabulary extraction workflow - Defines cache key generation algorithm - Migration path from hardcoded patterns
2026-01-10 12:57:03 +01:00 · 2026-01-10 12:57:03 +01:00 · 01b9d77566
commit 01b9d77566
parent 30cd8842d9
3 changed files with 11077 additions and 0 deletions
--- a/.opencode/rules/ontology-driven-cache-segmentation.md
+++ b/.opencode/rules/ontology-driven-cache-segmentation.md
@ -0,0 +1,503 @@
+# Rule 46: Ontology-Driven Cache Segmentation
+
+🚨 **CRITICAL**: The semantic cache MUST use vocabulary derived from LinkML `*Type.yaml` and `*Types.yaml` schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.
+
+## Problem Statement
+
+The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:
+
+```
+Query: "Hoeveel musea in Amsterdam?"
+Cached: "Hoeveel musea in Noord-Holland?"
+Result: BLOCKED (location mismatch) ✅
+```
+
+However, the current implementation uses **hardcoded regex patterns**:
+
+```typescript
+// DEPRECATED: Hardcoded patterns in semantic-cache.ts
+const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
+  M: /\b(muse(um|a|ums?)|musea)/i,
+  A: /\b(archie[fv]en?|archives?|archief)/i,
+  // ... 19 patterns to maintain manually
+};
+```
+
+**Problems with hardcoded patterns**:
+1. **Maintenance burden** - Every new institution type requires code changes
+2. **Missing subtypes** - "kunstmuseum" vs "museum" should cache separately
+3. **No multilingual support** - Only Dutch/English, misses German/French labels
+4. **Duplication** - Same vocabulary exists in LinkML schemas
+5. **No record type awareness** - "burgerlijke stand" queries mixed with general archive queries
+
+## Solution: Schema-Derived Vocabulary
+
+The LinkML schema already contains rich vocabulary:
+
+| Schema File | Content | Cache Utility |
+|-------------|---------|---------------|
+| `CustodianType.yaml` | 19 top-level types | Primary segmentation (M/A/L/G...) |
+| `MuseumType.yaml` | 187+ museum subtypes | Subtype segmentation |
+| `ArchiveOrganizationType.yaml` | 144+ archive subtypes | Subtype segmentation |
+| `*RecordSetTypes.yaml` | Record type taxonomies | Finding aids specificity |
+
+### Vocabulary Sources in Schema
+
+1. **`type_label`** - Multilingual labels via `skos:prefLabel`
+2. **`structured_aliases`** - Language-tagged alternative names
+3. **`keywords`** - Search terms for entity recognition
+4. **`wikidata_entity`** - Linked Data identifiers
+
+## Architecture
+
+### Overview: Two-Tier Embedding Hierarchy
+
+The system uses a **hierarchical embedding approach** for fast semantic routing:
+
+1. **Tier 1: Types File Embeddings** - Which category? (Museum vs Archive vs Library)
+2. **Tier 2: Individual Type Embeddings** - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│  BUILD TIME: Extract vocabulary + generate embeddings                  │
+│                                                                         │
+│  schemas/20251121/linkml/modules/classes/*Type.yaml                     │
+│  schemas/20251121/linkml/modules/classes/*Types.yaml                    │
+│                          ↓                                              │
+│  scripts/extract-types-vocab.ts                                         │
+│                          ↓                                              │
+│  ┌───────────────────────────────────────────────────────────────────┐  │
+│  │  types-vocab.json                                                 │  │
+│  │  ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] }   │  │
+│  │  ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│  │
+│  │  └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│  │
+│  └───────────────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────────────┘
+                                    │
+                                    ▼ (loaded at runtime)
+┌─────────────────────────────────────────────────────────────────────────┐
+│  RUNTIME: Two-Tier Semantic Routing                                    │
+│                                                                         │
+│  Query: "Hoeveel gemeentearchieven in Amsterdam?"                       │
+│         ↓                                                               │
+│  ┌─────────────────────────────────────────────────────────────────┐    │
+│  │ TIER 1: Types File Selection                                    │    │
+│  │ Query embedding vs Tier1 embeddings (19 categories)             │    │
+│  │ Result: ArchiveOrganizationType (similarity: 0.89)              │    │
+│  └─────────────────────────────────────────────────────────────────┘    │
+│         ↓                                                               │
+│  ┌─────────────────────────────────────────────────────────────────┐    │
+│  │ TIER 2: Specific Type Selection                                 │    │
+│  │ Query embedding vs Tier2 embeddings (144 archive subtypes)      │    │
+│  │ Result: MunicipalArchive (similarity: 0.94)                     │    │
+│  └─────────────────────────────────────────────────────────────────┘    │
+│         ↓                                                               │
+│  Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam"            │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Tier 1: Types File Embeddings
+
+Each Types file (e.g., `MuseumType.yaml`, `ArchiveOrganizationType.yaml`) gets ONE embedding
+representing the **accumulated vocabulary** of all types within that file.
+
+**Embedding Text Construction**:
+```
+MuseumType: museum musea kunstmuseum art museum natural history museum 
+            science museum open-air museum ecomuseum virtual museum 
+            heritage farm national museum regional museum university museum
+            [... all keywords from all 187 subtypes ...]
+```
+
+**Purpose**: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.
+
+| Types File | Code | Accumulated Terms Count |
+|------------|------|------------------------|
+| MuseumType | M | ~500+ terms from 187 subtypes |
+| ArchiveOrganizationType | A | ~400+ terms from 144 subtypes |
+| LibraryType | L | ~200+ terms from subtypes |
+| GalleryType | G | ~100+ terms from subtypes |
+| ... | ... | ... |
+
+### Tier 2: Individual Type Embeddings
+
+Each **specific type** within a Types file gets its own embedding from its accumulated terms.
+
+**Embedding Text Construction**:
+```
+MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
+                  town archive local government records burgerlijke stand
+                  bevolkingsregister council minutes building permits
+                  [... all keywords + structured_aliases + labels ...]
+```
+
+**Purpose**: Precise subtype identification after Tier 1 narrows the category.
+
+### Term Log Structure
+
+A lookup table mapping every extracted term to its type/subtype:
+
+```json
+{
+  "termLog": {
+    "kunstmuseum": {
+      "typeCode": "M",
+      "typeName": "MuseumType", 
+      "subtypeName": "ART_MUSEUM",
+      "wikidata": "Q207694",
+      "language": "nl"
+    },
+    "art museum": {
+      "typeCode": "M",
+      "typeName": "MuseumType",
+      "subtypeName": "ART_MUSEUM", 
+      "wikidata": "Q207694",
+      "language": "en"
+    },
+    "gemeentearchief": {
+      "typeCode": "A",
+      "typeName": "ArchiveOrganizationType",
+      "subtypeName": "MUNICIPAL_ARCHIVE",
+      "wikidata": "Q8362876",
+      "language": "nl"
+    }
+  }
+}
+```
+
+**Purpose**: 
+1. Fast O(1) keyword lookup (no embedding needed for exact matches)
+2. Audit trail of which terms map to which types
+3. Debugging which queries match which types
+
+### Runtime Lookup Strategy
+
+```typescript
+async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
+  const vocab = await loadTypesVocabulary();
+  const normalized = query.toLowerCase();
+  
+  // FAST PATH: Check termLog for exact keyword matches
+  for (const [term, mapping] of Object.entries(vocab.termLog)) {
+    if (normalized.includes(term)) {
+      return {
+        institutionType: mapping.typeCode,
+        institutionSubtype: mapping.subtypeName,
+        subtypeWikidata: mapping.wikidata,
+        // ... location and intent extraction
+      };
+    }
+  }
+  
+  // SLOW PATH: Embedding-based semantic matching
+  const queryEmbedding = await generateEmbedding(query);
+  
+  // Tier 1: Find best matching Types file
+  let bestType: string | null = null;
+  let bestTypeSimilarity = 0;
+  for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
+    const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
+    if (similarity > bestTypeSimilarity && similarity > 0.7) {
+      bestTypeSimilarity = similarity;
+      bestType = typeName;
+    }
+  }
+  
+  if (!bestType) return {}; // No type matched
+  
+  // Tier 2: Find best matching subtype within the Types file
+  const typeCode = vocab.institutionTypes[bestType].code;
+  let bestSubtype: string | null = null;
+  let bestSubtypeSimilarity = 0;
+  
+  for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
+    const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
+    if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
+      bestSubtypeSimilarity = similarity;
+      bestSubtype = subtypeName;
+    }
+  }
+  
+  return {
+    institutionType: typeCode,
+    institutionSubtype: bestSubtype,
+    // ... location and intent extraction
+  };
+}
+```
+
+### Embedding Model Choice
+
+For build-time embedding generation, use the same model as the semantic cache:
+
+| Option | Model | Dimensions | Quality |
+|--------|-------|------------|---------|
+| **Primary** | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Good multilingual |
+| Fallback | `all-MiniLM-L6-v2` | 384 | English-focused |
+| High Quality | `multilingual-e5-large` | 1024 | Best multilingual |
+
+**Build-time generation**: Embeddings are generated ONCE at build time and stored in JSON.
+This avoids runtime embedding API calls for type classification.
+
+## TypesVocabulary JSON Structure
+
+Generated at build time with **pre-computed embeddings**:
+
+```json
+{
+  "version": "2026-01-10T12:00:00Z",
+  "schemaVersion": "20251121",
+  "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
+  "embeddingDimensions": 384,
+  
+  "tier1Embeddings": {
+    "MuseumType": [0.023, -0.045, 0.087, ...],
+    "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
+    "LibraryType": [-0.034, 0.089, 0.012, ...],
+    "GalleryType": [0.045, -0.023, 0.067, ...]
+  },
+  
+  "tier2Embeddings": {
+    "M": {
+      "ART_MUSEUM": [0.034, -0.056, 0.078, ...],
+      "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
+      "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
+    },
+    "A": {
+      "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
+      "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
+      "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
+    }
+  },
+  
+  "termLog": {
+    "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
+    "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
+    "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
+    "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
+    "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
+    "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
+    "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
+  },
+  
+  "institutionTypes": {
+    "M": {
+      "code": "M",
+      "className": "MuseumType",
+      "baseWikidata": "Q33506",
+      "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
+      "keywords": {
+        "nl": ["museum", "musea"],
+        "en": ["museum", "museums"],
+        "de": ["Museum", "Museen"]
+      },
+      "subtypes": {
+        "ART_MUSEUM": {
+          "className": "ArtMuseum",
+          "wikidata": "Q207694",
+          "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
+          "keywords": {
+            "nl": ["kunstmuseum", "kunstmusea"],
+            "en": ["art museum", "art museums"]
+          }
+        },
+        "NATURAL_HISTORY_MUSEUM": {
+          "className": "NaturalHistoryMuseum",
+          "wikidata": "Q559049",
+          "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
+          "keywords": {
+            "nl": ["natuurhistorisch museum", "natuurmuseum"],
+            "en": ["natural history museum"]
+          }
+        }
+      }
+    },
+    "A": {
+      "code": "A",
+      "className": "ArchiveOrganizationType",
+      "baseWikidata": "Q166118",
+      "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
+      "keywords": {
+        "nl": ["archief", "archieven"],
+        "en": ["archive", "archives"]
+      },
+      "subtypes": {
+        "MUNICIPAL_ARCHIVE": {
+          "className": "MunicipalArchive",
+          "wikidata": "Q8362876",
+          "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
+          "keywords": {
+            "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
+            "en": ["municipal archive", "city archive", "town archive"]
+          }
+        },
+        "NATIONAL_ARCHIVE": {
+          "className": "NationalArchive",
+          "wikidata": "Q1188452",
+          "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
+          "keywords": {
+            "nl": ["nationaal archief", "rijksarchief"],
+            "en": ["national archive", "state archive"]
+          }
+        }
+      }
+    }
+  },
+  
+  "recordSetTypes": {
+    "CIVIL_REGISTRY": {
+      "className": "CivilRegistrySeries",
+      "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
+      "keywords": {
+        "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
+        "en": ["civil registry", "birth records", "marriage records", "death records"]
+      }
+    },
+    "COUNCIL_GOVERNANCE": {
+      "className": "CouncilGovernanceFonds",
+      "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
+      "keywords": {
+        "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
+        "en": ["council minutes", "ordinances", "resolutions"]
+      }
+    }
+  }
+}
+```
+
+### Key Additions for Embedding Support
+
+| Field | Purpose |
+|-------|---------|
+| `tier1Embeddings` | Pre-computed embeddings for each Types file (19 categories) |
+| `tier2Embeddings` | Pre-computed embeddings for each subtype (500+ types) |
+| `termLog` | Fast O(1) lookup table for exact keyword matches |
+| `accumulatedTerms` | Raw text used to generate embeddings (for debugging/regeneration) |
+| `embeddingModel` | Model used to generate embeddings (for reproducibility) |
+
+## Enhanced ExtractedEntities Interface
+
+```typescript
+export interface ExtractedEntities {
+  // Existing fields
+  institutionType?: InstitutionTypeCode | null;
+  location?: string | null;
+  locationType?: 'city' | 'province' | null;
+  intent?: 'count' | 'list' | 'info' | null;
+  
+  // NEW: Ontology-derived fields
+  institutionSubtype?: string | null;  // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
+  recordSetType?: string | null;        // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
+  subtypeWikidata?: string | null;      // e.g., 'Q8362876' for LOD integration
+}
+```
+
+## Enhanced Cache Key Format
+
+```
+{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}
+
+Examples:
+- "count:m:amsterdam"                        # Basic museum count
+- "count:m.art_museum:amsterdam"             # Art museum count (subtype)
+- "list:a.municipal_archive:nh"              # Municipal archives in Noord-Holland
+- "query:a:civil_registry:utrecht"           # Civil registry in Utrecht
+- "info:a.national_archive::nl"              # National archive info (no location filter)
+```
+
+## Implementation Files
+
+| File | Purpose |
+|------|---------|
+| `scripts/extract-types-vocab.ts` | Build-time vocabulary extraction from LinkML |
+| `apps/archief-assistent/public/types-vocab.json` | Generated vocabulary file |
+| `apps/archief-assistent/src/lib/types-vocabulary.ts` | Runtime vocabulary loader |
+| `apps/archief-assistent/src/lib/semantic-cache.ts` | Updated entity extraction |
+
+## Build Integration
+
+Add to `apps/archief-assistent/package.json`:
+
+```json
+{
+  "scripts": {
+    "prebuild": "tsx ../../scripts/extract-types-vocab.ts",
+    "build": "vite build"
+  }
+}
+```
+
+## Keyword Extraction Priority
+
+When extracting keywords from schema files:
+
+1. **`keywords`** array (highest priority) - Explicit search terms
+2. **`structured_aliases.literal_form`** - Multilingual alternative names
+3. **`type_label`** - Preferred labels per language
+4. **Class name conversion** - `MunicipalArchive` → "municipal archive"
+
+## Cache Segmentation Rules
+
+### Rule 1: Subtype Specificity
+
+Queries with **specific subtypes** should NOT match **generic type** cache entries:
+
+```
+Query: "kunstmusea in Amsterdam"     → key: "count:m.art_museum:amsterdam"
+Cached: "musea in Amsterdam"         → key: "count:m:amsterdam"
+Result: MISS (subtype mismatch) ✅
+```
+
+### Rule 2: Record Set Type Isolation
+
+Queries about **specific record types** should cache separately:
+
+```
+Query: "burgerlijke stand Utrecht"   → key: "query:a:civil_registry:utrecht"
+Cached: "archieven in Utrecht"       → key: "list:a:utrecht"
+Result: MISS (record set type mismatch) ✅
+```
+
+### Rule 3: Subtype-to-Type Fallback
+
+Generic queries CAN match subtype cache entries (broader is acceptable):
+
+```
+Query: "musea in Amsterdam"          → key: "count:m:amsterdam"
+Cached: "kunstmusea in Amsterdam"    → key: "count:m.art_museum:amsterdam"
+Result: MISS (don't return subset for superset query)
+```
+
+## Migration Notes
+
+1. **Backwards Compatible**: Existing cache entries without `institutionSubtype` continue to work
+2. **Gradual Rollout**: New cache entries get subtype, old entries remain valid
+3. **Cache Clear**: Consider clearing cache after deployment to ensure consistency
+
+## Validation
+
+Run E2E tests to verify:
+
+```bash
+cd apps/archief-assistent
+npm run test:e2e
+```
+
+Key test cases:
+- Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
+- Subtype isolation (kunstmuseum ≠ museum)
+- Record set isolation (burgerlijke stand ≠ archive)
+- Intent isolation (count ≠ list ≠ info)
+
+## References
+
+- **Rule 41**: Types classes define SPARQL template variables
+- **Rule 0b**: Type/Types file naming convention
+- **CustodianType.yaml**: Base taxonomy definition
+- **AGENTS.md**: GLAMORCUBESFIXPHDNT taxonomy documentation
+
+---
+
+**Created**: 2026-01-10
+**Author**: OpenCode Agent
+**Status**: Implementing
--- a/apps/archief-assistent/public/types-vocab.json
+++ b/apps/archief-assistent/public/types-vocab.json
--- a/apps/archief-assistent/src/lib/types-vocabulary.ts
+++ b/apps/archief-assistent/src/lib/types-vocabulary.ts
@ -0,0 +1,432 @@
+/**
+ * types-vocabulary.ts
+ * 
+ * Runtime loader for the TypesVocabulary extracted from LinkML schema files.
+ * Provides two-tier semantic routing for entity extraction:
+ * 
+ * 1. Fast Path: O(1) termLog lookup for exact keyword matches
+ * 2. Slow Path: Embedding-based similarity for fuzzy semantic matching
+ * 
+ * See: .opencode/rules/ontology-driven-cache-segmentation.md (Rule 46)
+ */
+
+import type { InstitutionTypeCode } from './semantic-cache';
+
+// ============================================================================
+// Types
+// ============================================================================
+
+export interface TermLogEntry {
+  typeCode: string;
+  typeName: string;
+  subtypeName?: string;
+  recordSetType?: string;
+  wikidata?: string;
+  lang: string;
+}
+
+export interface SubtypeInfo {
+  className: string;
+  wikidata?: string;
+  accumulatedTerms: string;
+  keywords: Record<string, string[]>;
+}
+
+export interface TypeInfo {
+  code: string;
+  className: string;
+  baseWikidata?: string;
+  accumulatedTerms: string;
+  keywords: Record<string, string[]>;
+  subtypes: Record<string, SubtypeInfo>;
+}
+
+export interface RecordSetTypeInfo {
+  className: string;
+  accumulatedTerms: string;
+  keywords: Record<string, string[]>;
+}
+
+export interface TypesVocabulary {
+  version: string;
+  schemaVersion: string;
+  embeddingModel: string;
+  embeddingDimensions: number;
+  tier1Embeddings: Record<string, number[]>;
+  tier2Embeddings: Record<string, Record<string, number[]>>;
+  termLog: Record<string, TermLogEntry>;
+  institutionTypes: Record<string, TypeInfo>;
+  recordSetTypes: Record<string, RecordSetTypeInfo>;
+}
+
+export interface VocabularyMatch {
+  typeCode: InstitutionTypeCode;
+  typeName: string;
+  subtypeName?: string;
+  recordSetType?: string;
+  wikidata?: string;
+  matchedTerm: string;
+  matchMethod: 'exact' | 'embedding_tier1' | 'embedding_tier2';
+  confidence: number;
+}
+
+// ============================================================================
+// Vocabulary Singleton
+// ============================================================================
+
+let vocabularyCache: TypesVocabulary | null = null;
+let loadPromise: Promise<TypesVocabulary> | null = null;
+
+/**
+ * Load the TypesVocabulary from the static JSON file.
+ * Caches the result for subsequent calls.
+ */
+export async function loadTypesVocabulary(): Promise<TypesVocabulary> {
+  if (vocabularyCache) return vocabularyCache;
+  
+  if (loadPromise) return loadPromise;
+  
+  loadPromise = (async () => {
+    try {
+      const response = await fetch('/types-vocab.json');
+      if (!response.ok) {
+        console.warn('[TypesVocabulary] Failed to load vocabulary:', response.status);
+        return createEmptyVocabulary();
+      }
+      
+      vocabularyCache = await response.json();
+      console.log(
+        `[TypesVocabulary] Loaded: ${Object.keys(vocabularyCache!.institutionTypes).length} types, ` +
+        `${Object.keys(vocabularyCache!.termLog).length} terms`
+      );
+      return vocabularyCache!;
+    } catch (error) {
+      console.warn('[TypesVocabulary] Error loading vocabulary:', error);
+      return createEmptyVocabulary();
+    }
+  })();
+  
+  return loadPromise;
+}
+
+function createEmptyVocabulary(): TypesVocabulary {
+  return {
+    version: 'empty',
+    schemaVersion: '',
+    embeddingModel: '',
+    embeddingDimensions: 0,
+    tier1Embeddings: {},
+    tier2Embeddings: {},
+    termLog: {},
+    institutionTypes: {},
+    recordSetTypes: {},
+  };
+}
+
+// ============================================================================
+// Fast Path: Term Log Lookup
+// ============================================================================
+
+/**
+ * Fast O(1) lookup in the term log for exact keyword matches.
+ * This is the preferred method - no embeddings needed.
+ * 
+ * @param query - Normalized query text (lowercase)
+ * @returns Match info if a term is found, null otherwise
+ */
+export async function lookupTermLog(query: string): Promise<VocabularyMatch | null> {
+  const vocab = await loadTypesVocabulary();
+  const normalized = query.toLowerCase();
+  
+  // Sort terms by length (longest first) to match most specific terms
+  const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
+  
+  for (const term of sortedTerms) {
+    if (normalized.includes(term)) {
+      const entry = vocab.termLog[term];
+      return {
+        typeCode: entry.typeCode as InstitutionTypeCode,
+        typeName: entry.typeName,
+        subtypeName: entry.subtypeName,
+        recordSetType: entry.recordSetType,
+        wikidata: entry.wikidata,
+        matchedTerm: term,
+        matchMethod: 'exact',
+        confidence: 1.0,
+      };
+    }
+  }
+  
+  return null;
+}
+
+/**
+ * Get all matching terms from the term log (for multi-entity queries).
+ * 
+ * @param query - Normalized query text (lowercase)
+ * @returns Array of all matching terms
+ */
+export async function lookupAllTerms(query: string): Promise<VocabularyMatch[]> {
+  const vocab = await loadTypesVocabulary();
+  const normalized = query.toLowerCase();
+  const matches: VocabularyMatch[] = [];
+  
+  // Sort terms by length (longest first)
+  const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
+  const matchedPositions = new Set<number>();
+  
+  for (const term of sortedTerms) {
+    const index = normalized.indexOf(term);
+    if (index !== -1) {
+      // Check if this position range is already matched by a longer term
+      let alreadyMatched = false;
+      for (let i = index; i < index + term.length; i++) {
+        if (matchedPositions.has(i)) {
+          alreadyMatched = true;
+          break;
+        }
+      }
+      
+      if (!alreadyMatched) {
+        // Mark positions as matched
+        for (let i = index; i < index + term.length; i++) {
+          matchedPositions.add(i);
+        }
+        
+        const entry = vocab.termLog[term];
+        matches.push({
+          typeCode: entry.typeCode as InstitutionTypeCode,
+          typeName: entry.typeName,
+          subtypeName: entry.subtypeName,
+          recordSetType: entry.recordSetType,
+          wikidata: entry.wikidata,
+          matchedTerm: term,
+          matchMethod: 'exact',
+          confidence: 1.0,
+        });
+      }
+    }
+  }
+  
+  return matches;
+}
+
+// ============================================================================
+// Slow Path: Embedding-Based Matching
+// ============================================================================
+
+/**
+ * Compute cosine similarity between two vectors.
+ */
+function cosineSimilarity(a: number[], b: number[]): number {
+  if (a.length !== b.length || a.length === 0) return 0;
+  
+  let dotProduct = 0;
+  let normA = 0;
+  let normB = 0;
+  
+  for (let i = 0; i < a.length; i++) {
+    dotProduct += a[i] * b[i];
+    normA += a[i] * a[i];
+    normB += b[i] * b[i];
+  }
+  
+  const magnitude = Math.sqrt(normA) * Math.sqrt(normB);
+  return magnitude === 0 ? 0 : dotProduct / magnitude;
+}
+
+/**
+ * Tier 1: Find the best matching institution type category.
+ * Uses pre-computed embeddings for each Types file.
+ * 
+ * @param queryEmbedding - Embedding of the user's query
+ * @param threshold - Minimum similarity threshold (default 0.7)
+ * @returns Best matching type info or null
+ */
+export async function matchTier1(
+  queryEmbedding: number[],
+  threshold: number = 0.7
+): Promise<{ typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null> {
+  const vocab = await loadTypesVocabulary();
+  
+  let bestMatch: { typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null = null;
+  
+  for (const [typeName, embedding] of Object.entries(vocab.tier1Embeddings)) {
+    if (embedding.length === 0) continue; // Skip empty embeddings
+    
+    const similarity = cosineSimilarity(queryEmbedding, embedding);
+    
+    if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
+      // Find the type code for this type name
+      const typeInfo = Object.values(vocab.institutionTypes).find(t => t.className === typeName);
+      if (typeInfo) {
+        bestMatch = {
+          typeName,
+          typeCode: typeInfo.code as InstitutionTypeCode,
+          similarity,
+        };
+      }
+    }
+  }
+  
+  return bestMatch;
+}
+
+/**
+ * Tier 2: Find the best matching subtype within a category.
+ * Uses pre-computed embeddings for each subtype.
+ * 
+ * @param queryEmbedding - Embedding of the user's query
+ * @param typeCode - The institution type code from Tier 1
+ * @param threshold - Minimum similarity threshold (default 0.75)
+ * @returns Best matching subtype info or null
+ */
+export async function matchTier2(
+  queryEmbedding: number[],
+  typeCode: InstitutionTypeCode,
+  threshold: number = 0.75
+): Promise<{ subtypeName: string; similarity: number } | null> {
+  const vocab = await loadTypesVocabulary();
+  
+  const subtypeEmbeddings = vocab.tier2Embeddings[typeCode];
+  if (!subtypeEmbeddings) return null;
+  
+  let bestMatch: { subtypeName: string; similarity: number } | null = null;
+  
+  for (const [subtypeName, embedding] of Object.entries(subtypeEmbeddings)) {
+    if (embedding.length === 0) continue; // Skip empty embeddings
+    
+    const similarity = cosineSimilarity(queryEmbedding, embedding);
+    
+    if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
+      bestMatch = { subtypeName, similarity };
+    }
+  }
+  
+  return bestMatch;
+}
+
+/**
+ * Full two-tier embedding-based matching.
+ * 
+ * @param queryEmbedding - Embedding of the user's query
+ * @returns Match result or null
+ */
+export async function matchWithEmbeddings(
+  queryEmbedding: number[]
+): Promise<VocabularyMatch | null> {
+  // Tier 1: Find best type category
+  const tier1Match = await matchTier1(queryEmbedding);
+  if (!tier1Match) return null;
+  
+  // Tier 2: Find best subtype within the category
+  const tier2Match = await matchTier2(queryEmbedding, tier1Match.typeCode);
+  
+  return {
+    typeCode: tier1Match.typeCode,
+    typeName: tier1Match.typeName,
+    subtypeName: tier2Match?.subtypeName,
+    matchedTerm: '',
+    matchMethod: tier2Match ? 'embedding_tier2' : 'embedding_tier1',
+    confidence: tier2Match?.similarity || tier1Match.similarity,
+  };
+}
+
+// ============================================================================
+// Combined Lookup (Fast Path + Slow Path)
+// ============================================================================
+
+/**
+ * Primary entry point for vocabulary-based entity extraction.
+ * 
+ * Strategy:
+ * 1. First try fast O(1) term log lookup (no embeddings needed)
+ * 2. If no match and embeddings available, try two-tier semantic matching
+ * 
+ * @param query - The user's query text
+ * @param queryEmbedding - Optional embedding for semantic matching
+ * @returns Best match or null
+ */
+export async function extractEntityFromVocabulary(
+  query: string,
+  queryEmbedding?: number[]
+): Promise<VocabularyMatch | null> {
+  // Fast path: Try term log lookup first
+  const termMatch = await lookupTermLog(query);
+  if (termMatch) {
+    return termMatch;
+  }
+  
+  // Slow path: Try embedding-based matching if embeddings available
+  if (queryEmbedding && queryEmbedding.length > 0) {
+    return matchWithEmbeddings(queryEmbedding);
+  }
+  
+  return null;
+}
+
+// ============================================================================
+// Utility Functions
+// ============================================================================
+
+/**
+ * Get all keywords for a specific institution type.
+ */
+export async function getKeywordsForType(typeCode: InstitutionTypeCode): Promise<string[]> {
+  const vocab = await loadTypesVocabulary();
+  const typeInfo = vocab.institutionTypes[typeCode];
+  if (!typeInfo) return [];
+  
+  const keywords: string[] = [];
+  
+  // Add base type keywords
+  for (const terms of Object.values(typeInfo.keywords)) {
+    keywords.push(...terms);
+  }
+  
+  // Add subtype keywords
+  for (const subtype of Object.values(typeInfo.subtypes)) {
+    for (const terms of Object.values(subtype.keywords)) {
+      keywords.push(...terms);
+    }
+  }
+  
+  return [...new Set(keywords)];
+}
+
+/**
+ * Check if a term exists in the vocabulary.
+ */
+export async function hasTerm(term: string): Promise<boolean> {
+  const vocab = await loadTypesVocabulary();
+  return term.toLowerCase() in vocab.termLog;
+}
+
+/**
+ * Get vocabulary statistics.
+ */
+export async function getVocabularyStats(): Promise<{
+  version: string;
+  institutionTypes: number;
+  subtypes: number;
+  recordSetTypes: number;
+  terms: number;
+  hasEmbeddings: boolean;
+}> {
+  const vocab = await loadTypesVocabulary();
+  
+  const subtypeCount = Object.values(vocab.institutionTypes)
+    .reduce((sum, t) => sum + Object.keys(t.subtypes).length, 0);
+  
+  const hasEmbeddings = Object.values(vocab.tier1Embeddings)
+    .some(e => e.length > 0);
+  
+  return {
+    version: vocab.version,
+    institutionTypes: Object.keys(vocab.institutionTypes).length,
+    subtypes: subtypeCount,
+    recordSetTypes: Object.keys(vocab.recordSetTypes).length,
+    terms: Object.keys(vocab.termLog).length,
+    hasEmbeddings,
+  };
+}