feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation
Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46): - types-vocab.json: 10,142 lines of institution type vocabulary from LinkML - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.) - Extracted from CustodianType.yaml and CustodianTypes.yaml - types-vocabulary.ts: TypeScript module for entity extraction - Exports INSTITUTION_TYPES with regex patterns per type code - Replaces hardcoded patterns with schema-derived vocabulary - Supports multilingual matching - Rule 46 documentation (.opencode/rules/) - Specifies vocabulary extraction workflow - Defines cache key generation algorithm - Migration path from hardcoded patterns
This commit is contained in:
parent
30cd8842d9
commit
01b9d77566
3 changed files with 11077 additions and 0 deletions
503
.opencode/rules/ontology-driven-cache-segmentation.md
Normal file
503
.opencode/rules/ontology-driven-cache-segmentation.md
Normal file
|
|
@ -0,0 +1,503 @@
|
|||
# Rule 46: Ontology-Driven Cache Segmentation
|
||||
|
||||
🚨 **CRITICAL**: The semantic cache MUST use vocabulary derived from LinkML `*Type.yaml` and `*Types.yaml` schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:
|
||||
|
||||
```
|
||||
Query: "Hoeveel musea in Amsterdam?"
|
||||
Cached: "Hoeveel musea in Noord-Holland?"
|
||||
Result: BLOCKED (location mismatch) ✅
|
||||
```
|
||||
|
||||
However, the current implementation uses **hardcoded regex patterns**:
|
||||
|
||||
```typescript
|
||||
// DEPRECATED: Hardcoded patterns in semantic-cache.ts
|
||||
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
|
||||
M: /\b(muse(um|a|ums?)|musea)/i,
|
||||
A: /\b(archie[fv]en?|archives?|archief)/i,
|
||||
// ... 19 patterns to maintain manually
|
||||
};
|
||||
```
|
||||
|
||||
**Problems with hardcoded patterns**:
|
||||
1. **Maintenance burden** - Every new institution type requires code changes
|
||||
2. **Missing subtypes** - "kunstmuseum" vs "museum" should cache separately
|
||||
3. **No multilingual support** - Only Dutch/English, misses German/French labels
|
||||
4. **Duplication** - Same vocabulary exists in LinkML schemas
|
||||
5. **No record type awareness** - "burgerlijke stand" queries mixed with general archive queries
|
||||
|
||||
## Solution: Schema-Derived Vocabulary
|
||||
|
||||
The LinkML schema already contains rich vocabulary:
|
||||
|
||||
| Schema File | Content | Cache Utility |
|
||||
|-------------|---------|---------------|
|
||||
| `CustodianType.yaml` | 19 top-level types | Primary segmentation (M/A/L/G...) |
|
||||
| `MuseumType.yaml` | 187+ museum subtypes | Subtype segmentation |
|
||||
| `ArchiveOrganizationType.yaml` | 144+ archive subtypes | Subtype segmentation |
|
||||
| `*RecordSetTypes.yaml` | Record type taxonomies | Finding aids specificity |
|
||||
|
||||
### Vocabulary Sources in Schema
|
||||
|
||||
1. **`type_label`** - Multilingual labels via `skos:prefLabel`
|
||||
2. **`structured_aliases`** - Language-tagged alternative names
|
||||
3. **`keywords`** - Search terms for entity recognition
|
||||
4. **`wikidata_entity`** - Linked Data identifiers
|
||||
|
||||
## Architecture
|
||||
|
||||
### Overview: Two-Tier Embedding Hierarchy
|
||||
|
||||
The system uses a **hierarchical embedding approach** for fast semantic routing:
|
||||
|
||||
1. **Tier 1: Types File Embeddings** - Which category? (Museum vs Archive vs Library)
|
||||
2. **Tier 2: Individual Type Embeddings** - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ BUILD TIME: Extract vocabulary + generate embeddings │
|
||||
│ │
|
||||
│ schemas/20251121/linkml/modules/classes/*Type.yaml │
|
||||
│ schemas/20251121/linkml/modules/classes/*Types.yaml │
|
||||
│ ↓ │
|
||||
│ scripts/extract-types-vocab.ts │
|
||||
│ ↓ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ types-vocab.json │ │
|
||||
│ │ ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] } │ │
|
||||
│ │ ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│ │
|
||||
│ │ └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ (loaded at runtime)
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ RUNTIME: Two-Tier Semantic Routing │
|
||||
│ │
|
||||
│ Query: "Hoeveel gemeentearchieven in Amsterdam?" │
|
||||
│ ↓ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ TIER 1: Types File Selection │ │
|
||||
│ │ Query embedding vs Tier1 embeddings (19 categories) │ │
|
||||
│ │ Result: ArchiveOrganizationType (similarity: 0.89) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ ↓ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ TIER 2: Specific Type Selection │ │
|
||||
│ │ Query embedding vs Tier2 embeddings (144 archive subtypes) │ │
|
||||
│ │ Result: MunicipalArchive (similarity: 0.94) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ ↓ │
|
||||
│ Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam" │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Tier 1: Types File Embeddings
|
||||
|
||||
Each Types file (e.g., `MuseumType.yaml`, `ArchiveOrganizationType.yaml`) gets ONE embedding
|
||||
representing the **accumulated vocabulary** of all types within that file.
|
||||
|
||||
**Embedding Text Construction**:
|
||||
```
|
||||
MuseumType: museum musea kunstmuseum art museum natural history museum
|
||||
science museum open-air museum ecomuseum virtual museum
|
||||
heritage farm national museum regional museum university museum
|
||||
[... all keywords from all 187 subtypes ...]
|
||||
```
|
||||
|
||||
**Purpose**: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.
|
||||
|
||||
| Types File | Code | Accumulated Terms Count |
|
||||
|------------|------|------------------------|
|
||||
| MuseumType | M | ~500+ terms from 187 subtypes |
|
||||
| ArchiveOrganizationType | A | ~400+ terms from 144 subtypes |
|
||||
| LibraryType | L | ~200+ terms from subtypes |
|
||||
| GalleryType | G | ~100+ terms from subtypes |
|
||||
| ... | ... | ... |
|
||||
|
||||
### Tier 2: Individual Type Embeddings
|
||||
|
||||
Each **specific type** within a Types file gets its own embedding from its accumulated terms.
|
||||
|
||||
**Embedding Text Construction**:
|
||||
```
|
||||
MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
|
||||
town archive local government records burgerlijke stand
|
||||
bevolkingsregister council minutes building permits
|
||||
[... all keywords + structured_aliases + labels ...]
|
||||
```
|
||||
|
||||
**Purpose**: Precise subtype identification after Tier 1 narrows the category.
|
||||
|
||||
### Term Log Structure
|
||||
|
||||
A lookup table mapping every extracted term to its type/subtype:
|
||||
|
||||
```json
|
||||
{
|
||||
"termLog": {
|
||||
"kunstmuseum": {
|
||||
"typeCode": "M",
|
||||
"typeName": "MuseumType",
|
||||
"subtypeName": "ART_MUSEUM",
|
||||
"wikidata": "Q207694",
|
||||
"language": "nl"
|
||||
},
|
||||
"art museum": {
|
||||
"typeCode": "M",
|
||||
"typeName": "MuseumType",
|
||||
"subtypeName": "ART_MUSEUM",
|
||||
"wikidata": "Q207694",
|
||||
"language": "en"
|
||||
},
|
||||
"gemeentearchief": {
|
||||
"typeCode": "A",
|
||||
"typeName": "ArchiveOrganizationType",
|
||||
"subtypeName": "MUNICIPAL_ARCHIVE",
|
||||
"wikidata": "Q8362876",
|
||||
"language": "nl"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Purpose**:
|
||||
1. Fast O(1) keyword lookup (no embedding needed for exact matches)
|
||||
2. Audit trail of which terms map to which types
|
||||
3. Debugging which queries match which types
|
||||
|
||||
### Runtime Lookup Strategy
|
||||
|
||||
```typescript
|
||||
async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
const normalized = query.toLowerCase();
|
||||
|
||||
// FAST PATH: Check termLog for exact keyword matches
|
||||
for (const [term, mapping] of Object.entries(vocab.termLog)) {
|
||||
if (normalized.includes(term)) {
|
||||
return {
|
||||
institutionType: mapping.typeCode,
|
||||
institutionSubtype: mapping.subtypeName,
|
||||
subtypeWikidata: mapping.wikidata,
|
||||
// ... location and intent extraction
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// SLOW PATH: Embedding-based semantic matching
|
||||
const queryEmbedding = await generateEmbedding(query);
|
||||
|
||||
// Tier 1: Find best matching Types file
|
||||
let bestType: string | null = null;
|
||||
let bestTypeSimilarity = 0;
|
||||
for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
|
||||
const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
|
||||
if (similarity > bestTypeSimilarity && similarity > 0.7) {
|
||||
bestTypeSimilarity = similarity;
|
||||
bestType = typeName;
|
||||
}
|
||||
}
|
||||
|
||||
if (!bestType) return {}; // No type matched
|
||||
|
||||
// Tier 2: Find best matching subtype within the Types file
|
||||
const typeCode = vocab.institutionTypes[bestType].code;
|
||||
let bestSubtype: string | null = null;
|
||||
let bestSubtypeSimilarity = 0;
|
||||
|
||||
for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
|
||||
const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
|
||||
if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
|
||||
bestSubtypeSimilarity = similarity;
|
||||
bestSubtype = subtypeName;
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
institutionType: typeCode,
|
||||
institutionSubtype: bestSubtype,
|
||||
// ... location and intent extraction
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Embedding Model Choice
|
||||
|
||||
For build-time embedding generation, use the same model as the semantic cache:
|
||||
|
||||
| Option | Model | Dimensions | Quality |
|
||||
|--------|-------|------------|---------|
|
||||
| **Primary** | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Good multilingual |
|
||||
| Fallback | `all-MiniLM-L6-v2` | 384 | English-focused |
|
||||
| High Quality | `multilingual-e5-large` | 1024 | Best multilingual |
|
||||
|
||||
**Build-time generation**: Embeddings are generated ONCE at build time and stored in JSON.
|
||||
This avoids runtime embedding API calls for type classification.
|
||||
|
||||
## TypesVocabulary JSON Structure
|
||||
|
||||
Generated at build time with **pre-computed embeddings**:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "2026-01-10T12:00:00Z",
|
||||
"schemaVersion": "20251121",
|
||||
"embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
|
||||
"embeddingDimensions": 384,
|
||||
|
||||
"tier1Embeddings": {
|
||||
"MuseumType": [0.023, -0.045, 0.087, ...],
|
||||
"ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
|
||||
"LibraryType": [-0.034, 0.089, 0.012, ...],
|
||||
"GalleryType": [0.045, -0.023, 0.067, ...]
|
||||
},
|
||||
|
||||
"tier2Embeddings": {
|
||||
"M": {
|
||||
"ART_MUSEUM": [0.034, -0.056, 0.078, ...],
|
||||
"NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
|
||||
"SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
|
||||
},
|
||||
"A": {
|
||||
"MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
|
||||
"NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
|
||||
"CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
|
||||
}
|
||||
},
|
||||
|
||||
"termLog": {
|
||||
"kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
|
||||
"art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
|
||||
"gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
|
||||
"stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
|
||||
"city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
|
||||
"burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
|
||||
"geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
|
||||
},
|
||||
|
||||
"institutionTypes": {
|
||||
"M": {
|
||||
"code": "M",
|
||||
"className": "MuseumType",
|
||||
"baseWikidata": "Q33506",
|
||||
"accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
|
||||
"keywords": {
|
||||
"nl": ["museum", "musea"],
|
||||
"en": ["museum", "museums"],
|
||||
"de": ["Museum", "Museen"]
|
||||
},
|
||||
"subtypes": {
|
||||
"ART_MUSEUM": {
|
||||
"className": "ArtMuseum",
|
||||
"wikidata": "Q207694",
|
||||
"accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
|
||||
"keywords": {
|
||||
"nl": ["kunstmuseum", "kunstmusea"],
|
||||
"en": ["art museum", "art museums"]
|
||||
}
|
||||
},
|
||||
"NATURAL_HISTORY_MUSEUM": {
|
||||
"className": "NaturalHistoryMuseum",
|
||||
"wikidata": "Q559049",
|
||||
"accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
|
||||
"keywords": {
|
||||
"nl": ["natuurhistorisch museum", "natuurmuseum"],
|
||||
"en": ["natural history museum"]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"A": {
|
||||
"code": "A",
|
||||
"className": "ArchiveOrganizationType",
|
||||
"baseWikidata": "Q166118",
|
||||
"accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
|
||||
"keywords": {
|
||||
"nl": ["archief", "archieven"],
|
||||
"en": ["archive", "archives"]
|
||||
},
|
||||
"subtypes": {
|
||||
"MUNICIPAL_ARCHIVE": {
|
||||
"className": "MunicipalArchive",
|
||||
"wikidata": "Q8362876",
|
||||
"accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
|
||||
"keywords": {
|
||||
"nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
|
||||
"en": ["municipal archive", "city archive", "town archive"]
|
||||
}
|
||||
},
|
||||
"NATIONAL_ARCHIVE": {
|
||||
"className": "NationalArchive",
|
||||
"wikidata": "Q1188452",
|
||||
"accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
|
||||
"keywords": {
|
||||
"nl": ["nationaal archief", "rijksarchief"],
|
||||
"en": ["national archive", "state archive"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
|
||||
"recordSetTypes": {
|
||||
"CIVIL_REGISTRY": {
|
||||
"className": "CivilRegistrySeries",
|
||||
"accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
|
||||
"keywords": {
|
||||
"nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
|
||||
"en": ["civil registry", "birth records", "marriage records", "death records"]
|
||||
}
|
||||
},
|
||||
"COUNCIL_GOVERNANCE": {
|
||||
"className": "CouncilGovernanceFonds",
|
||||
"accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
|
||||
"keywords": {
|
||||
"nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
|
||||
"en": ["council minutes", "ordinances", "resolutions"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Key Additions for Embedding Support
|
||||
|
||||
| Field | Purpose |
|
||||
|-------|---------|
|
||||
| `tier1Embeddings` | Pre-computed embeddings for each Types file (19 categories) |
|
||||
| `tier2Embeddings` | Pre-computed embeddings for each subtype (500+ types) |
|
||||
| `termLog` | Fast O(1) lookup table for exact keyword matches |
|
||||
| `accumulatedTerms` | Raw text used to generate embeddings (for debugging/regeneration) |
|
||||
| `embeddingModel` | Model used to generate embeddings (for reproducibility) |
|
||||
|
||||
## Enhanced ExtractedEntities Interface
|
||||
|
||||
```typescript
|
||||
export interface ExtractedEntities {
|
||||
// Existing fields
|
||||
institutionType?: InstitutionTypeCode | null;
|
||||
location?: string | null;
|
||||
locationType?: 'city' | 'province' | null;
|
||||
intent?: 'count' | 'list' | 'info' | null;
|
||||
|
||||
// NEW: Ontology-derived fields
|
||||
institutionSubtype?: string | null; // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
|
||||
recordSetType?: string | null; // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
|
||||
subtypeWikidata?: string | null; // e.g., 'Q8362876' for LOD integration
|
||||
}
|
||||
```
|
||||
|
||||
## Enhanced Cache Key Format
|
||||
|
||||
```
|
||||
{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}
|
||||
|
||||
Examples:
|
||||
- "count:m:amsterdam" # Basic museum count
|
||||
- "count:m.art_museum:amsterdam" # Art museum count (subtype)
|
||||
- "list:a.municipal_archive:nh" # Municipal archives in Noord-Holland
|
||||
- "query:a:civil_registry:utrecht" # Civil registry in Utrecht
|
||||
- "info:a.national_archive::nl" # National archive info (no location filter)
|
||||
```
|
||||
|
||||
## Implementation Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `scripts/extract-types-vocab.ts` | Build-time vocabulary extraction from LinkML |
|
||||
| `apps/archief-assistent/public/types-vocab.json` | Generated vocabulary file |
|
||||
| `apps/archief-assistent/src/lib/types-vocabulary.ts` | Runtime vocabulary loader |
|
||||
| `apps/archief-assistent/src/lib/semantic-cache.ts` | Updated entity extraction |
|
||||
|
||||
## Build Integration
|
||||
|
||||
Add to `apps/archief-assistent/package.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"scripts": {
|
||||
"prebuild": "tsx ../../scripts/extract-types-vocab.ts",
|
||||
"build": "vite build"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Keyword Extraction Priority
|
||||
|
||||
When extracting keywords from schema files:
|
||||
|
||||
1. **`keywords`** array (highest priority) - Explicit search terms
|
||||
2. **`structured_aliases.literal_form`** - Multilingual alternative names
|
||||
3. **`type_label`** - Preferred labels per language
|
||||
4. **Class name conversion** - `MunicipalArchive` → "municipal archive"
|
||||
|
||||
## Cache Segmentation Rules
|
||||
|
||||
### Rule 1: Subtype Specificity
|
||||
|
||||
Queries with **specific subtypes** should NOT match **generic type** cache entries:
|
||||
|
||||
```
|
||||
Query: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
|
||||
Cached: "musea in Amsterdam" → key: "count:m:amsterdam"
|
||||
Result: MISS (subtype mismatch) ✅
|
||||
```
|
||||
|
||||
### Rule 2: Record Set Type Isolation
|
||||
|
||||
Queries about **specific record types** should cache separately:
|
||||
|
||||
```
|
||||
Query: "burgerlijke stand Utrecht" → key: "query:a:civil_registry:utrecht"
|
||||
Cached: "archieven in Utrecht" → key: "list:a:utrecht"
|
||||
Result: MISS (record set type mismatch) ✅
|
||||
```
|
||||
|
||||
### Rule 3: Subtype-to-Type Fallback
|
||||
|
||||
Generic queries CAN match subtype cache entries (broader is acceptable):
|
||||
|
||||
```
|
||||
Query: "musea in Amsterdam" → key: "count:m:amsterdam"
|
||||
Cached: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
|
||||
Result: MISS (don't return subset for superset query)
|
||||
```
|
||||
|
||||
## Migration Notes
|
||||
|
||||
1. **Backwards Compatible**: Existing cache entries without `institutionSubtype` continue to work
|
||||
2. **Gradual Rollout**: New cache entries get subtype, old entries remain valid
|
||||
3. **Cache Clear**: Consider clearing cache after deployment to ensure consistency
|
||||
|
||||
## Validation
|
||||
|
||||
Run E2E tests to verify:
|
||||
|
||||
```bash
|
||||
cd apps/archief-assistent
|
||||
npm run test:e2e
|
||||
```
|
||||
|
||||
Key test cases:
|
||||
- Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
|
||||
- Subtype isolation (kunstmuseum ≠ museum)
|
||||
- Record set isolation (burgerlijke stand ≠ archive)
|
||||
- Intent isolation (count ≠ list ≠ info)
|
||||
|
||||
## References
|
||||
|
||||
- **Rule 41**: Types classes define SPARQL template variables
|
||||
- **Rule 0b**: Type/Types file naming convention
|
||||
- **CustodianType.yaml**: Base taxonomy definition
|
||||
- **AGENTS.md**: GLAMORCUBESFIXPHDNT taxonomy documentation
|
||||
|
||||
---
|
||||
|
||||
**Created**: 2026-01-10
|
||||
**Author**: OpenCode Agent
|
||||
**Status**: Implementing
|
||||
10142
apps/archief-assistent/public/types-vocab.json
Normal file
10142
apps/archief-assistent/public/types-vocab.json
Normal file
File diff suppressed because it is too large
Load diff
432
apps/archief-assistent/src/lib/types-vocabulary.ts
Normal file
432
apps/archief-assistent/src/lib/types-vocabulary.ts
Normal file
|
|
@ -0,0 +1,432 @@
|
|||
/**
|
||||
* types-vocabulary.ts
|
||||
*
|
||||
* Runtime loader for the TypesVocabulary extracted from LinkML schema files.
|
||||
* Provides two-tier semantic routing for entity extraction:
|
||||
*
|
||||
* 1. Fast Path: O(1) termLog lookup for exact keyword matches
|
||||
* 2. Slow Path: Embedding-based similarity for fuzzy semantic matching
|
||||
*
|
||||
* See: .opencode/rules/ontology-driven-cache-segmentation.md (Rule 46)
|
||||
*/
|
||||
|
||||
import type { InstitutionTypeCode } from './semantic-cache';
|
||||
|
||||
// ============================================================================
|
||||
// Types
|
||||
// ============================================================================
|
||||
|
||||
export interface TermLogEntry {
|
||||
typeCode: string;
|
||||
typeName: string;
|
||||
subtypeName?: string;
|
||||
recordSetType?: string;
|
||||
wikidata?: string;
|
||||
lang: string;
|
||||
}
|
||||
|
||||
export interface SubtypeInfo {
|
||||
className: string;
|
||||
wikidata?: string;
|
||||
accumulatedTerms: string;
|
||||
keywords: Record<string, string[]>;
|
||||
}
|
||||
|
||||
export interface TypeInfo {
|
||||
code: string;
|
||||
className: string;
|
||||
baseWikidata?: string;
|
||||
accumulatedTerms: string;
|
||||
keywords: Record<string, string[]>;
|
||||
subtypes: Record<string, SubtypeInfo>;
|
||||
}
|
||||
|
||||
export interface RecordSetTypeInfo {
|
||||
className: string;
|
||||
accumulatedTerms: string;
|
||||
keywords: Record<string, string[]>;
|
||||
}
|
||||
|
||||
export interface TypesVocabulary {
|
||||
version: string;
|
||||
schemaVersion: string;
|
||||
embeddingModel: string;
|
||||
embeddingDimensions: number;
|
||||
tier1Embeddings: Record<string, number[]>;
|
||||
tier2Embeddings: Record<string, Record<string, number[]>>;
|
||||
termLog: Record<string, TermLogEntry>;
|
||||
institutionTypes: Record<string, TypeInfo>;
|
||||
recordSetTypes: Record<string, RecordSetTypeInfo>;
|
||||
}
|
||||
|
||||
export interface VocabularyMatch {
|
||||
typeCode: InstitutionTypeCode;
|
||||
typeName: string;
|
||||
subtypeName?: string;
|
||||
recordSetType?: string;
|
||||
wikidata?: string;
|
||||
matchedTerm: string;
|
||||
matchMethod: 'exact' | 'embedding_tier1' | 'embedding_tier2';
|
||||
confidence: number;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Vocabulary Singleton
|
||||
// ============================================================================
|
||||
|
||||
let vocabularyCache: TypesVocabulary | null = null;
|
||||
let loadPromise: Promise<TypesVocabulary> | null = null;
|
||||
|
||||
/**
|
||||
* Load the TypesVocabulary from the static JSON file.
|
||||
* Caches the result for subsequent calls.
|
||||
*/
|
||||
export async function loadTypesVocabulary(): Promise<TypesVocabulary> {
|
||||
if (vocabularyCache) return vocabularyCache;
|
||||
|
||||
if (loadPromise) return loadPromise;
|
||||
|
||||
loadPromise = (async () => {
|
||||
try {
|
||||
const response = await fetch('/types-vocab.json');
|
||||
if (!response.ok) {
|
||||
console.warn('[TypesVocabulary] Failed to load vocabulary:', response.status);
|
||||
return createEmptyVocabulary();
|
||||
}
|
||||
|
||||
vocabularyCache = await response.json();
|
||||
console.log(
|
||||
`[TypesVocabulary] Loaded: ${Object.keys(vocabularyCache!.institutionTypes).length} types, ` +
|
||||
`${Object.keys(vocabularyCache!.termLog).length} terms`
|
||||
);
|
||||
return vocabularyCache!;
|
||||
} catch (error) {
|
||||
console.warn('[TypesVocabulary] Error loading vocabulary:', error);
|
||||
return createEmptyVocabulary();
|
||||
}
|
||||
})();
|
||||
|
||||
return loadPromise;
|
||||
}
|
||||
|
||||
function createEmptyVocabulary(): TypesVocabulary {
|
||||
return {
|
||||
version: 'empty',
|
||||
schemaVersion: '',
|
||||
embeddingModel: '',
|
||||
embeddingDimensions: 0,
|
||||
tier1Embeddings: {},
|
||||
tier2Embeddings: {},
|
||||
termLog: {},
|
||||
institutionTypes: {},
|
||||
recordSetTypes: {},
|
||||
};
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Fast Path: Term Log Lookup
|
||||
// ============================================================================
|
||||
|
||||
/**
|
||||
* Fast O(1) lookup in the term log for exact keyword matches.
|
||||
* This is the preferred method - no embeddings needed.
|
||||
*
|
||||
* @param query - Normalized query text (lowercase)
|
||||
* @returns Match info if a term is found, null otherwise
|
||||
*/
|
||||
export async function lookupTermLog(query: string): Promise<VocabularyMatch | null> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
const normalized = query.toLowerCase();
|
||||
|
||||
// Sort terms by length (longest first) to match most specific terms
|
||||
const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
|
||||
|
||||
for (const term of sortedTerms) {
|
||||
if (normalized.includes(term)) {
|
||||
const entry = vocab.termLog[term];
|
||||
return {
|
||||
typeCode: entry.typeCode as InstitutionTypeCode,
|
||||
typeName: entry.typeName,
|
||||
subtypeName: entry.subtypeName,
|
||||
recordSetType: entry.recordSetType,
|
||||
wikidata: entry.wikidata,
|
||||
matchedTerm: term,
|
||||
matchMethod: 'exact',
|
||||
confidence: 1.0,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get all matching terms from the term log (for multi-entity queries).
|
||||
*
|
||||
* @param query - Normalized query text (lowercase)
|
||||
* @returns Array of all matching terms
|
||||
*/
|
||||
export async function lookupAllTerms(query: string): Promise<VocabularyMatch[]> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
const normalized = query.toLowerCase();
|
||||
const matches: VocabularyMatch[] = [];
|
||||
|
||||
// Sort terms by length (longest first)
|
||||
const sortedTerms = Object.keys(vocab.termLog).sort((a, b) => b.length - a.length);
|
||||
const matchedPositions = new Set<number>();
|
||||
|
||||
for (const term of sortedTerms) {
|
||||
const index = normalized.indexOf(term);
|
||||
if (index !== -1) {
|
||||
// Check if this position range is already matched by a longer term
|
||||
let alreadyMatched = false;
|
||||
for (let i = index; i < index + term.length; i++) {
|
||||
if (matchedPositions.has(i)) {
|
||||
alreadyMatched = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (!alreadyMatched) {
|
||||
// Mark positions as matched
|
||||
for (let i = index; i < index + term.length; i++) {
|
||||
matchedPositions.add(i);
|
||||
}
|
||||
|
||||
const entry = vocab.termLog[term];
|
||||
matches.push({
|
||||
typeCode: entry.typeCode as InstitutionTypeCode,
|
||||
typeName: entry.typeName,
|
||||
subtypeName: entry.subtypeName,
|
||||
recordSetType: entry.recordSetType,
|
||||
wikidata: entry.wikidata,
|
||||
matchedTerm: term,
|
||||
matchMethod: 'exact',
|
||||
confidence: 1.0,
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return matches;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Slow Path: Embedding-Based Matching
|
||||
// ============================================================================
|
||||
|
||||
/**
|
||||
* Compute cosine similarity between two vectors.
|
||||
*/
|
||||
function cosineSimilarity(a: number[], b: number[]): number {
|
||||
if (a.length !== b.length || a.length === 0) return 0;
|
||||
|
||||
let dotProduct = 0;
|
||||
let normA = 0;
|
||||
let normB = 0;
|
||||
|
||||
for (let i = 0; i < a.length; i++) {
|
||||
dotProduct += a[i] * b[i];
|
||||
normA += a[i] * a[i];
|
||||
normB += b[i] * b[i];
|
||||
}
|
||||
|
||||
const magnitude = Math.sqrt(normA) * Math.sqrt(normB);
|
||||
return magnitude === 0 ? 0 : dotProduct / magnitude;
|
||||
}
|
||||
|
||||
/**
|
||||
* Tier 1: Find the best matching institution type category.
|
||||
* Uses pre-computed embeddings for each Types file.
|
||||
*
|
||||
* @param queryEmbedding - Embedding of the user's query
|
||||
* @param threshold - Minimum similarity threshold (default 0.7)
|
||||
* @returns Best matching type info or null
|
||||
*/
|
||||
export async function matchTier1(
|
||||
queryEmbedding: number[],
|
||||
threshold: number = 0.7
|
||||
): Promise<{ typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
|
||||
let bestMatch: { typeName: string; typeCode: InstitutionTypeCode; similarity: number } | null = null;
|
||||
|
||||
for (const [typeName, embedding] of Object.entries(vocab.tier1Embeddings)) {
|
||||
if (embedding.length === 0) continue; // Skip empty embeddings
|
||||
|
||||
const similarity = cosineSimilarity(queryEmbedding, embedding);
|
||||
|
||||
if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
|
||||
// Find the type code for this type name
|
||||
const typeInfo = Object.values(vocab.institutionTypes).find(t => t.className === typeName);
|
||||
if (typeInfo) {
|
||||
bestMatch = {
|
||||
typeName,
|
||||
typeCode: typeInfo.code as InstitutionTypeCode,
|
||||
similarity,
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return bestMatch;
|
||||
}
|
||||
|
||||
/**
|
||||
* Tier 2: Find the best matching subtype within a category.
|
||||
* Uses pre-computed embeddings for each subtype.
|
||||
*
|
||||
* @param queryEmbedding - Embedding of the user's query
|
||||
* @param typeCode - The institution type code from Tier 1
|
||||
* @param threshold - Minimum similarity threshold (default 0.75)
|
||||
* @returns Best matching subtype info or null
|
||||
*/
|
||||
export async function matchTier2(
|
||||
queryEmbedding: number[],
|
||||
typeCode: InstitutionTypeCode,
|
||||
threshold: number = 0.75
|
||||
): Promise<{ subtypeName: string; similarity: number } | null> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
|
||||
const subtypeEmbeddings = vocab.tier2Embeddings[typeCode];
|
||||
if (!subtypeEmbeddings) return null;
|
||||
|
||||
let bestMatch: { subtypeName: string; similarity: number } | null = null;
|
||||
|
||||
for (const [subtypeName, embedding] of Object.entries(subtypeEmbeddings)) {
|
||||
if (embedding.length === 0) continue; // Skip empty embeddings
|
||||
|
||||
const similarity = cosineSimilarity(queryEmbedding, embedding);
|
||||
|
||||
if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
|
||||
bestMatch = { subtypeName, similarity };
|
||||
}
|
||||
}
|
||||
|
||||
return bestMatch;
|
||||
}
|
||||
|
||||
/**
|
||||
* Full two-tier embedding-based matching.
|
||||
*
|
||||
* @param queryEmbedding - Embedding of the user's query
|
||||
* @returns Match result or null
|
||||
*/
|
||||
export async function matchWithEmbeddings(
|
||||
queryEmbedding: number[]
|
||||
): Promise<VocabularyMatch | null> {
|
||||
// Tier 1: Find best type category
|
||||
const tier1Match = await matchTier1(queryEmbedding);
|
||||
if (!tier1Match) return null;
|
||||
|
||||
// Tier 2: Find best subtype within the category
|
||||
const tier2Match = await matchTier2(queryEmbedding, tier1Match.typeCode);
|
||||
|
||||
return {
|
||||
typeCode: tier1Match.typeCode,
|
||||
typeName: tier1Match.typeName,
|
||||
subtypeName: tier2Match?.subtypeName,
|
||||
matchedTerm: '',
|
||||
matchMethod: tier2Match ? 'embedding_tier2' : 'embedding_tier1',
|
||||
confidence: tier2Match?.similarity || tier1Match.similarity,
|
||||
};
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Combined Lookup (Fast Path + Slow Path)
|
||||
// ============================================================================
|
||||
|
||||
/**
|
||||
* Primary entry point for vocabulary-based entity extraction.
|
||||
*
|
||||
* Strategy:
|
||||
* 1. First try fast O(1) term log lookup (no embeddings needed)
|
||||
* 2. If no match and embeddings available, try two-tier semantic matching
|
||||
*
|
||||
* @param query - The user's query text
|
||||
* @param queryEmbedding - Optional embedding for semantic matching
|
||||
* @returns Best match or null
|
||||
*/
|
||||
export async function extractEntityFromVocabulary(
|
||||
query: string,
|
||||
queryEmbedding?: number[]
|
||||
): Promise<VocabularyMatch | null> {
|
||||
// Fast path: Try term log lookup first
|
||||
const termMatch = await lookupTermLog(query);
|
||||
if (termMatch) {
|
||||
return termMatch;
|
||||
}
|
||||
|
||||
// Slow path: Try embedding-based matching if embeddings available
|
||||
if (queryEmbedding && queryEmbedding.length > 0) {
|
||||
return matchWithEmbeddings(queryEmbedding);
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Utility Functions
|
||||
// ============================================================================
|
||||
|
||||
/**
|
||||
* Get all keywords for a specific institution type.
|
||||
*/
|
||||
export async function getKeywordsForType(typeCode: InstitutionTypeCode): Promise<string[]> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
const typeInfo = vocab.institutionTypes[typeCode];
|
||||
if (!typeInfo) return [];
|
||||
|
||||
const keywords: string[] = [];
|
||||
|
||||
// Add base type keywords
|
||||
for (const terms of Object.values(typeInfo.keywords)) {
|
||||
keywords.push(...terms);
|
||||
}
|
||||
|
||||
// Add subtype keywords
|
||||
for (const subtype of Object.values(typeInfo.subtypes)) {
|
||||
for (const terms of Object.values(subtype.keywords)) {
|
||||
keywords.push(...terms);
|
||||
}
|
||||
}
|
||||
|
||||
return [...new Set(keywords)];
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if a term exists in the vocabulary.
|
||||
*/
|
||||
export async function hasTerm(term: string): Promise<boolean> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
return term.toLowerCase() in vocab.termLog;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get vocabulary statistics.
|
||||
*/
|
||||
export async function getVocabularyStats(): Promise<{
|
||||
version: string;
|
||||
institutionTypes: number;
|
||||
subtypes: number;
|
||||
recordSetTypes: number;
|
||||
terms: number;
|
||||
hasEmbeddings: boolean;
|
||||
}> {
|
||||
const vocab = await loadTypesVocabulary();
|
||||
|
||||
const subtypeCount = Object.values(vocab.institutionTypes)
|
||||
.reduce((sum, t) => sum + Object.keys(t.subtypes).length, 0);
|
||||
|
||||
const hasEmbeddings = Object.values(vocab.tier1Embeddings)
|
||||
.some(e => e.length > 0);
|
||||
|
||||
return {
|
||||
version: vocab.version,
|
||||
institutionTypes: Object.keys(vocab.institutionTypes).length,
|
||||
subtypes: subtypeCount,
|
||||
recordSetTypes: Object.keys(vocab.recordSetTypes).length,
|
||||
terms: Object.keys(vocab.termLog).length,
|
||||
hasEmbeddings,
|
||||
};
|
||||
}
|
||||
Loading…
Reference in a new issue