feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation

Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46):

- types-vocab.json: 10,142 lines of institution type vocabulary from LinkML
  - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels
  - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.)
  - Extracted from CustodianType.yaml and CustodianTypes.yaml

- types-vocabulary.ts: TypeScript module for entity extraction
  - Exports INSTITUTION_TYPES with regex patterns per type code
  - Replaces hardcoded patterns with schema-derived vocabulary
  - Supports multilingual matching

- Rule 46 documentation (.opencode/rules/)
  - Specifies vocabulary extraction workflow
  - Defines cache key generation algorithm
  - Migration path from hardcoded patterns
This commit is contained in:
kempersc 2026-01-10 12:57:03 +01:00
parent 30cd8842d9
commit 01b9d77566
3 changed files with 11077 additions and 0 deletions

View file

@ -0,0 +1,503 @@
# Rule 46: Ontology-Driven Cache Segmentation
🚨 **CRITICAL**: The semantic cache MUST use vocabulary derived from LinkML `*Type.yaml` and `*Types.yaml` schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.
## Problem Statement
The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:
```
Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅
```
However, the current implementation uses **hardcoded regex patterns**:
```typescript
// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
M: /\b(muse(um|a|ums?)|musea)/i,
A: /\b(archie[fv]en?|archives?|archief)/i,
// ... 19 patterns to maintain manually
};
```
**Problems with hardcoded patterns**:
1. **Maintenance burden** - Every new institution type requires code changes
2. **Missing subtypes** - "kunstmuseum" vs "museum" should cache separately
3. **No multilingual support** - Only Dutch/English, misses German/French labels
4. **Duplication** - Same vocabulary exists in LinkML schemas
5. **No record type awareness** - "burgerlijke stand" queries mixed with general archive queries
## Solution: Schema-Derived Vocabulary
The LinkML schema already contains rich vocabulary:
| Schema File | Content | Cache Utility |
|-------------|---------|---------------|
| `CustodianType.yaml` | 19 top-level types | Primary segmentation (M/A/L/G...) |
| `MuseumType.yaml` | 187+ museum subtypes | Subtype segmentation |
| `ArchiveOrganizationType.yaml` | 144+ archive subtypes | Subtype segmentation |
| `*RecordSetTypes.yaml` | Record type taxonomies | Finding aids specificity |
### Vocabulary Sources in Schema
1. **`type_label`** - Multilingual labels via `skos:prefLabel`
2. **`structured_aliases`** - Language-tagged alternative names
3. **`keywords`** - Search terms for entity recognition
4. **`wikidata_entity`** - Linked Data identifiers
## Architecture
### Overview: Two-Tier Embedding Hierarchy
The system uses a **hierarchical embedding approach** for fast semantic routing:
1. **Tier 1: Types File Embeddings** - Which category? (Museum vs Archive vs Library)
2. **Tier 2: Individual Type Embeddings** - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)
```
┌─────────────────────────────────────────────────────────────────────────┐
│ BUILD TIME: Extract vocabulary + generate embeddings │
│ │
│ schemas/20251121/linkml/modules/classes/*Type.yaml │
│ schemas/20251121/linkml/modules/classes/*Types.yaml │
│ ↓ │
│ scripts/extract-types-vocab.ts │
│ ↓ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ types-vocab.json │ │
│ │ ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] } │ │
│ │ ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│ │
│ │ └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│ │
│ └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│ RUNTIME: Two-Tier Semantic Routing │
│ │
│ Query: "Hoeveel gemeentearchieven in Amsterdam?" │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TIER 1: Types File Selection │ │
│ │ Query embedding vs Tier1 embeddings (19 categories) │ │
│ │ Result: ArchiveOrganizationType (similarity: 0.89) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TIER 2: Specific Type Selection │ │
│ │ Query embedding vs Tier2 embeddings (144 archive subtypes) │ │
│ │ Result: MunicipalArchive (similarity: 0.94) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam" │
└─────────────────────────────────────────────────────────────────────────┘
```
### Tier 1: Types File Embeddings
Each Types file (e.g., `MuseumType.yaml`, `ArchiveOrganizationType.yaml`) gets ONE embedding
representing the **accumulated vocabulary** of all types within that file.
**Embedding Text Construction**:
```
MuseumType: museum musea kunstmuseum art museum natural history museum
science museum open-air museum ecomuseum virtual museum
heritage farm national museum regional museum university museum
[... all keywords from all 187 subtypes ...]
```
**Purpose**: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.
| Types File | Code | Accumulated Terms Count |
|------------|------|------------------------|
| MuseumType | M | ~500+ terms from 187 subtypes |
| ArchiveOrganizationType | A | ~400+ terms from 144 subtypes |
| LibraryType | L | ~200+ terms from subtypes |
| GalleryType | G | ~100+ terms from subtypes |
| ... | ... | ... |
### Tier 2: Individual Type Embeddings
Each **specific type** within a Types file gets its own embedding from its accumulated terms.
**Embedding Text Construction**:
```
MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
town archive local government records burgerlijke stand
bevolkingsregister council minutes building permits
[... all keywords + structured_aliases + labels ...]
```
**Purpose**: Precise subtype identification after Tier 1 narrows the category.
### Term Log Structure
A lookup table mapping every extracted term to its type/subtype:
```json
{
"termLog": {
"kunstmuseum": {
"typeCode": "M",
"typeName": "MuseumType",
"subtypeName": "ART_MUSEUM",
"wikidata": "Q207694",
"language": "nl"
},
"art museum": {
"typeCode": "M",
"typeName": "MuseumType",
"subtypeName": "ART_MUSEUM",
"wikidata": "Q207694",
"language": "en"
},
"gemeentearchief": {
"typeCode": "A",
"typeName": "ArchiveOrganizationType",
"subtypeName": "MUNICIPAL_ARCHIVE",
"wikidata": "Q8362876",
"language": "nl"
}
}
}
```
**Purpose**:
1. Fast O(1) keyword lookup (no embedding needed for exact matches)
2. Audit trail of which terms map to which types
3. Debugging which queries match which types
### Runtime Lookup Strategy
```typescript
async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
const vocab = await loadTypesVocabulary();
const normalized = query.toLowerCase();
// FAST PATH: Check termLog for exact keyword matches
for (const [term, mapping] of Object.entries(vocab.termLog)) {
if (normalized.includes(term)) {
return {
institutionType: mapping.typeCode,
institutionSubtype: mapping.subtypeName,
subtypeWikidata: mapping.wikidata,
// ... location and intent extraction
};
}
}
// SLOW PATH: Embedding-based semantic matching
const queryEmbedding = await generateEmbedding(query);
// Tier 1: Find best matching Types file
let bestType: string | null = null;
let bestTypeSimilarity = 0;
for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
if (similarity > bestTypeSimilarity && similarity > 0.7) {
bestTypeSimilarity = similarity;
bestType = typeName;
}
}
if (!bestType) return {}; // No type matched
// Tier 2: Find best matching subtype within the Types file
const typeCode = vocab.institutionTypes[bestType].code;
let bestSubtype: string | null = null;
let bestSubtypeSimilarity = 0;
for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
bestSubtypeSimilarity = similarity;
bestSubtype = subtypeName;
}
}
return {
institutionType: typeCode,
institutionSubtype: bestSubtype,
// ... location and intent extraction
};
}
```
### Embedding Model Choice
For build-time embedding generation, use the same model as the semantic cache:
| Option | Model | Dimensions | Quality |
|--------|-------|------------|---------|
| **Primary** | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Good multilingual |
| Fallback | `all-MiniLM-L6-v2` | 384 | English-focused |
| High Quality | `multilingual-e5-large` | 1024 | Best multilingual |
**Build-time generation**: Embeddings are generated ONCE at build time and stored in JSON.
This avoids runtime embedding API calls for type classification.
## TypesVocabulary JSON Structure
Generated at build time with **pre-computed embeddings**:
```json
{
"version": "2026-01-10T12:00:00Z",
"schemaVersion": "20251121",
"embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
"embeddingDimensions": 384,
"tier1Embeddings": {
"MuseumType": [0.023, -0.045, 0.087, ...],
"ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
"LibraryType": [-0.034, 0.089, 0.012, ...],
"GalleryType": [0.045, -0.023, 0.067, ...]
},
"tier2Embeddings": {
"M": {
"ART_MUSEUM": [0.034, -0.056, 0.078, ...],
"NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
"SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
},
"A": {
"MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
"NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
"CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
}
},
"termLog": {
"kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
"art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
"gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
"stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
"city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
"burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
"geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
},
"institutionTypes": {
"M": {
"code": "M",
"className": "MuseumType",
"baseWikidata": "Q33506",
"accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
"keywords": {
"nl": ["museum", "musea"],
"en": ["museum", "museums"],
"de": ["Museum", "Museen"]
},
"subtypes": {
"ART_MUSEUM": {
"className": "ArtMuseum",
"wikidata": "Q207694",
"accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
"keywords": {
"nl": ["kunstmuseum", "kunstmusea"],
"en": ["art museum", "art museums"]
}
},
"NATURAL_HISTORY_MUSEUM": {
"className": "NaturalHistoryMuseum",
"wikidata": "Q559049",
"accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
"keywords": {
"nl": ["natuurhistorisch museum", "natuurmuseum"],
"en": ["natural history museum"]
}
}
}
},
"A": {
"code": "A",
"className": "ArchiveOrganizationType",
"baseWikidata": "Q166118",
"accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
"keywords": {
"nl": ["archief", "archieven"],
"en": ["archive", "archives"]
},
"subtypes": {
"MUNICIPAL_ARCHIVE": {
"className": "MunicipalArchive",
"wikidata": "Q8362876",
"accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
"keywords": {
"nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
"en": ["municipal archive", "city archive", "town archive"]
}
},
"NATIONAL_ARCHIVE": {
"className": "NationalArchive",
"wikidata": "Q1188452",
"accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
"keywords": {
"nl": ["nationaal archief", "rijksarchief"],
"en": ["national archive", "state archive"]
}
}
}
}
},
"recordSetTypes": {
"CIVIL_REGISTRY": {
"className": "CivilRegistrySeries",
"accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
"keywords": {
"nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
"en": ["civil registry", "birth records", "marriage records", "death records"]
}
},
"COUNCIL_GOVERNANCE": {
"className": "CouncilGovernanceFonds",
"accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
"keywords": {
"nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
"en": ["council minutes", "ordinances", "resolutions"]
}
}
}
}
```
### Key Additions for Embedding Support
| Field | Purpose |
|-------|---------|
| `tier1Embeddings` | Pre-computed embeddings for each Types file (19 categories) |
| `tier2Embeddings` | Pre-computed embeddings for each subtype (500+ types) |
| `termLog` | Fast O(1) lookup table for exact keyword matches |
| `accumulatedTerms` | Raw text used to generate embeddings (for debugging/regeneration) |
| `embeddingModel` | Model used to generate embeddings (for reproducibility) |
## Enhanced ExtractedEntities Interface
```typescript
export interface ExtractedEntities {
// Existing fields
institutionType?: InstitutionTypeCode | null;
location?: string | null;
locationType?: 'city' | 'province' | null;
intent?: 'count' | 'list' | 'info' | null;
// NEW: Ontology-derived fields
institutionSubtype?: string | null; // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
recordSetType?: string | null; // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
subtypeWikidata?: string | null; // e.g., 'Q8362876' for LOD integration
}
```
## Enhanced Cache Key Format
```
{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}
Examples:
- "count:m:amsterdam" # Basic museum count
- "count:m.art_museum:amsterdam" # Art museum count (subtype)
- "list:a.municipal_archive:nh" # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht" # Civil registry in Utrecht
- "info:a.national_archive::nl" # National archive info (no location filter)
```
## Implementation Files
| File | Purpose |
|------|---------|
| `scripts/extract-types-vocab.ts` | Build-time vocabulary extraction from LinkML |
| `apps/archief-assistent/public/types-vocab.json` | Generated vocabulary file |
| `apps/archief-assistent/src/lib/types-vocabulary.ts` | Runtime vocabulary loader |
| `apps/archief-assistent/src/lib/semantic-cache.ts` | Updated entity extraction |
## Build Integration
Add to `apps/archief-assistent/package.json`:
```json
{
"scripts": {
"prebuild": "tsx ../../scripts/extract-types-vocab.ts",
"build": "vite build"
}
}
```
## Keyword Extraction Priority
When extracting keywords from schema files:
1. **`keywords`** array (highest priority) - Explicit search terms
2. **`structured_aliases.literal_form`** - Multilingual alternative names
3. **`type_label`** - Preferred labels per language
4. **Class name conversion** - `MunicipalArchive` → "municipal archive"
## Cache Segmentation Rules
### Rule 1: Subtype Specificity
Queries with **specific subtypes** should NOT match **generic type** cache entries:
```
Query: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam" → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅
```
### Rule 2: Record Set Type Isolation
Queries about **specific record types** should cache separately:
```
Query: "burgerlijke stand Utrecht" → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht" → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅
```
### Rule 3: Subtype-to-Type Fallback
Generic queries CAN match subtype cache entries (broader is acceptable):
```
Query: "musea in Amsterdam" → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)
```
## Migration Notes
1. **Backwards Compatible**: Existing cache entries without `institutionSubtype` continue to work
2. **Gradual Rollout**: New cache entries get subtype, old entries remain valid
3. **Cache Clear**: Consider clearing cache after deployment to ensure consistency
## Validation
Run E2E tests to verify:
```bash
cd apps/archief-assistent
npm run test:e2e
```
Key test cases:
- Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
- Subtype isolation (kunstmuseum ≠ museum)
- Record set isolation (burgerlijke stand ≠ archive)
- Intent isolation (count ≠ list ≠ info)
## References
- **Rule 41**: Types classes define SPARQL template variables
- **Rule 0b**: Type/Types file naming convention
- **CustodianType.yaml**: Base taxonomy definition
- **AGENTS.md**: GLAMORCUBESFIXPHDNT taxonomy documentation
---
**Created**: 2026-01-10
**Author**: OpenCode Agent
**Status**: Implementing

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,432 @@
/**
* types-vocabulary.ts
*
* Runtime loader for the TypesVocabulary extracted from LinkML schema files.
* Provides two-tier semantic routing for entity extraction:
*
* 1. Fast Path: O(1) termLog lookup for exact keyword matches
* 2. Slow Path: Embedding-based similarity for fuzzy semantic matching
*
* See: .opencode/rules/ontology-driven-cache-segmentation.md (Rule 46)
*/
import type { InstitutionTypeCode } from './semantic-cache';
// ============================================================================
// Types
// ============================================================================
export interface TermLogEntry {
typeCode: string;
typeName: string;
subtypeName?: string;
recordSetType?: string;
wikidata?: string;
lang: string;
}
export interface SubtypeInfo {
className: string;
wikidata?: string;
accumulatedTerms: string;
keywords: Record<string, string[]>;
}
export interface TypeInfo {
code: string;
className: string;
baseWikidata?: string;
accumulatedTerms: string;
keywords: Record<string, string[]>;
subtypes: Record<string, SubtypeInfo>;
}
export interface RecordSetTypeInfo {
className: string;
accumulatedTerms: string;
keywords: Record<string, string[]>;
}
export interface TypesVocabulary {
version: string;
schemaVersion: string;
embeddingModel: string;
embeddingDimensions: number;
tier1Embeddings: Record<string, number[]>;
tier2Embeddings: Record<string, Record<string, number[]>>;
termLog: Record<string, TermLogEntry>;
institutionTypes: Record<string, TypeInfo>;
recordSetTypes: Record<string, RecordSetTypeInfo>;
}
export interface VocabularyMatch {
typeCode: InstitutionTypeCode;
typeName: string;
subtypeName?: string;
recordSetType?: string;
wikidata?: string;
matchedTerm: string;
matchMethod: 'exact' | 'embedding_tier1' | 'embedding_tier2';
confidence: number;
}
// ============================================================================
// Vocabulary Singleton
// ============================================================================
let vocabularyCache: TypesVocabulary | null = null;
let loadPromise: Promise<TypesVocabulary> | null = null;
/**
* Load the TypesVocabulary from the static JSON file.
* Caches the result for subsequent calls.
*/
export async function loadTypesVocabulary(): Promise<TypesVocabulary> {
if (vocabularyCache) return vocabularyCache;
if (loadPromise) return loadPromise;
loadPromise = (async () => {
try {
const response = await fetch('/types-vocab.json');
if (!response.ok) {
console.warn('[TypesVocabulary] Failed to load vocabulary:', response.status);
return createEmptyVocabulary();
}
vocabularyCache = await response.json();
console.log(
`[TypesVocabulary] Loaded: ${Object.keys(vocabularyCache!.institutionTypes).length} types, ` +
`${Object.keys(vocabularyCache!.termLog).length} terms`
);
return vocabularyCache!;
} catch (error) {
console.warn('[TypesVocabulary] Error loading vocabulary:', error);
return createEmptyVocabulary();
}
})();
return loadPromise;
}
function createEmptyVocabulary(): TypesVocabulary {
return {
version: 'empty',
schemaVersion: '',
embeddingModel: '',
embeddingDimensions: 0,
tier1Embeddings: {},
tier2Embeddings: {},
termLog: {},
institutionTypes: {},
recordSetTypes: {},
};
}
// ============================================================================
// Fast Path: Term Log Lookup
// ============================================================================
/**
* Fast O(1) lookup in the term log for exact keyword matches.
* This is the preferred method - no embeddings needed.
*
* @param query - Normalized query text (lowercase)
* @returns Match info if a term is found, null otherwise
*/
export async function lookupTermLog(query: string): Promise<VocabularyMatch | null> {
const vocab = await loadTypesVocabulary();
const normalized = query.toLowerCase();
// Sort terms by length (longest first) to match most specific terms
const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
for (const term of sortedTerms) {
if (normalized.includes(term)) {
const entry = vocab.termLog[term];
return {
typeCode: entry.typeCode as InstitutionTypeCode,
typeName: entry.typeName,
subtypeName: entry.subtypeName,
recordSetType: entry.recordSetType,
wikidata: entry.wikidata,
matchedTerm: term,
matchMethod: 'exact',
confidence: 1.0,
};
}
}
return null;
}
/**
* Get all matching terms from the term log (for multi-entity queries).
*
* @param query - Normalized query text (lowercase)
* @returns Array of all matching terms
*/
export async function lookupAllTerms(query: string): Promise<VocabularyMatch[]> {
const vocab = await loadTypesVocabulary();
const normalized = query.toLowerCase();
const matches: VocabularyMatch[] = [];
// Sort terms by length (longest first)
const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
const matchedPositions = new Set<number>();
for (const term of sortedTerms) {
const index = normalized.indexOf(term);
if (index !== -1) {
// Check if this position range is already matched by a longer term
let alreadyMatched = false;
for (let i = index; i < index + term.length; i++) {
if (matchedPositions.has(i)) {
alreadyMatched = true;
break;
}
}
if (!alreadyMatched) {
// Mark positions as matched
for (let i = index; i < index + term.length; i++) {
matchedPositions.add(i);
}
const entry = vocab.termLog[term];
matches.push({
typeCode: entry.typeCode as InstitutionTypeCode,
typeName: entry.typeName,
subtypeName: entry.subtypeName,
recordSetType: entry.recordSetType,
wikidata: entry.wikidata,
matchedTerm: term,
matchMethod: 'exact',
confidence: 1.0,
});
}
}
}
return matches;
}
// ============================================================================
// Slow Path: Embedding-Based Matching
// ============================================================================
/**
* Compute cosine similarity between two vectors.
*/
function cosineSimilarity(a: number[], b: number[]): number {
if (a.length !== b.length || a.length === 0) return 0;
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
const magnitude = Math.sqrt(normA) * Math.sqrt(normB);
return magnitude === 0 ? 0 : dotProduct / magnitude;
}
/**
* Tier 1: Find the best matching institution type category.
* Uses pre-computed embeddings for each Types file.
*
* @param queryEmbedding - Embedding of the user's query
* @param threshold - Minimum similarity threshold (default 0.7)
* @returns Best matching type info or null
*/
export async function matchTier1(
queryEmbedding: number[],
threshold: number = 0.7
): Promise<{ typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null> {
const vocab = await loadTypesVocabulary();
let bestMatch: { typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null = null;
for (const [typeName, embedding] of Object.entries(vocab.tier1Embeddings)) {
if (embedding.length === 0) continue; // Skip empty embeddings
const similarity = cosineSimilarity(queryEmbedding, embedding);
if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
// Find the type code for this type name
const typeInfo = Object.values(vocab.institutionTypes).find(t => t.className === typeName);
if (typeInfo) {
bestMatch = {
typeName,
typeCode: typeInfo.code as InstitutionTypeCode,
similarity,
};
}
}
}
return bestMatch;
}
/**
* Tier 2: Find the best matching subtype within a category.
* Uses pre-computed embeddings for each subtype.
*
* @param queryEmbedding - Embedding of the user's query
* @param typeCode - The institution type code from Tier 1
* @param threshold - Minimum similarity threshold (default 0.75)
* @returns Best matching subtype info or null
*/
export async function matchTier2(
queryEmbedding: number[],
typeCode: InstitutionTypeCode,
threshold: number = 0.75
): Promise<{ subtypeName: string; similarity: number } | null> {
const vocab = await loadTypesVocabulary();
const subtypeEmbeddings = vocab.tier2Embeddings[typeCode];
if (!subtypeEmbeddings) return null;
let bestMatch: { subtypeName: string; similarity: number } | null = null;
for (const [subtypeName, embedding] of Object.entries(subtypeEmbeddings)) {
if (embedding.length === 0) continue; // Skip empty embeddings
const similarity = cosineSimilarity(queryEmbedding, embedding);
if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
bestMatch = { subtypeName, similarity };
}
}
return bestMatch;
}
/**
* Full two-tier embedding-based matching.
*
* @param queryEmbedding - Embedding of the user's query
* @returns Match result or null
*/
export async function matchWithEmbeddings(
queryEmbedding: number[]
): Promise<VocabularyMatch | null> {
// Tier 1: Find best type category
const tier1Match = await matchTier1(queryEmbedding);
if (!tier1Match) return null;
// Tier 2: Find best subtype within the category
const tier2Match = await matchTier2(queryEmbedding, tier1Match.typeCode);
return {
typeCode: tier1Match.typeCode,
typeName: tier1Match.typeName,
subtypeName: tier2Match?.subtypeName,
matchedTerm: '',
matchMethod: tier2Match ? 'embedding_tier2' : 'embedding_tier1',
confidence: tier2Match?.similarity || tier1Match.similarity,
};
}
// ============================================================================
// Combined Lookup (Fast Path + Slow Path)
// ============================================================================
/**
* Primary entry point for vocabulary-based entity extraction.
*
* Strategy:
* 1. First try fast O(1) term log lookup (no embeddings needed)
* 2. If no match and embeddings available, try two-tier semantic matching
*
* @param query - The user's query text
* @param queryEmbedding - Optional embedding for semantic matching
* @returns Best match or null
*/
export async function extractEntityFromVocabulary(
query: string,
queryEmbedding?: number[]
): Promise<VocabularyMatch | null> {
// Fast path: Try term log lookup first
const termMatch = await lookupTermLog(query);
if (termMatch) {
return termMatch;
}
// Slow path: Try embedding-based matching if embeddings available
if (queryEmbedding && queryEmbedding.length > 0) {
return matchWithEmbeddings(queryEmbedding);
}
return null;
}
// ============================================================================
// Utility Functions
// ============================================================================
/**
* Get all keywords for a specific institution type.
*/
export async function getKeywordsForType(typeCode: InstitutionTypeCode): Promise<string[]> {
const vocab = await loadTypesVocabulary();
const typeInfo = vocab.institutionTypes[typeCode];
if (!typeInfo) return [];
const keywords: string[] = [];
// Add base type keywords
for (const terms of Object.values(typeInfo.keywords)) {
keywords.push(...terms);
}
// Add subtype keywords
for (const subtype of Object.values(typeInfo.subtypes)) {
for (const terms of Object.values(subtype.keywords)) {
keywords.push(...terms);
}
}
return [...new Set(keywords)];
}
/**
* Check if a term exists in the vocabulary.
*/
export async function hasTerm(term: string): Promise<boolean> {
const vocab = await loadTypesVocabulary();
return term.toLowerCase() in vocab.termLog;
}
/**
* Get vocabulary statistics.
*/
export async function getVocabularyStats(): Promise<{
version: string;
institutionTypes: number;
subtypes: number;
recordSetTypes: number;
terms: number;
hasEmbeddings: boolean;
}> {
const vocab = await loadTypesVocabulary();
const subtypeCount = Object.values(vocab.institutionTypes)
.reduce((sum, t) => sum + Object.keys(t.subtypes).length, 0);
const hasEmbeddings = Object.values(vocab.tier1Embeddings)
.some(e => e.length > 0);
return {
version: vocab.version,
institutionTypes: Object.keys(vocab.institutionTypes).length,
subtypes: subtypeCount,
recordSetTypes: Object.keys(vocab.recordSetTypes).length,
terms: Object.keys(vocab.termLog).length,
hasEmbeddings,
};
}