# Rule 46: Ontology-Driven Cache Segmentation 🚨 **CRITICAL**: The semantic cache MUST use vocabulary derived from LinkML `*Type.yaml` and `*Types.yaml` schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated. **Status**: Implemented (Evolved v2.0) **Version**: 2.0 (Epistemological Evolution) **Updated**: 2026-01-10 ## Evolution Overview Rule 46 v2.0 incorporates insights from Volodymyr Pavlyshyn's work on agentic memory systems: 1. **Epistemic Provenance** (Phase 1) - Track WHERE, WHEN, HOW data originated 2. **Topological Distance** (Phase 2) - Use ontology structure, not just embeddings 3. **Holarchic Cache** (Phase 3) - Entries as holons with up/down links 4. **Message Passing** (Phase 4, planned) - Smalltalk-style introspectable cache 5. **Clarity Trading** (Phase 5, planned) - Block ambiguous queries from cache ## Epistemic Provenance Every cached response carries epistemological metadata: ```typescript interface EpistemicProvenance { dataSource: 'ISIL_REGISTRY' | 'WIKIDATA' | 'CUSTODIAN_YAML' | 'LLM_INFERENCE' | ...; dataTier: 1 | 2 | 3 | 4; // TIER_1_AUTHORITATIVE → TIER_4_INFERRED sourceTimestamp: string; derivationChain: string[]; // ["SPARQL:Qdrant", "RAG:retrieve", "LLM:generate"] revalidationPolicy: 'static' | 'daily' | 'weekly' | 'on_access'; } ``` **Benefit**: Users see "This answer is from TIER_1 ISIL registry data, captured 2025-01-08". ## Topological Distance Beyond embedding similarity, cache matching considers **structural distance** in the type hierarchy: ``` HeritageCustodian (*) │ ┌──────────────────┼──────────────────┐ ▼ ▼ ▼ MuseumType (M) ArchiveType (A) LibraryType (L) │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ▼ ▼ ▼ ▼ ▼ ▼ ArtMuseum History Municipal State Public Academic ``` **Combined Similarity Formula**: ```typescript finalScore = 0.7 * embeddingSimilarity + 0.3 * (1 - topologicalDistance) ``` **Benefit**: "Art museum" won't match "natural history museum" even with 95% embedding similarity. ## Holarchic Cache Structure Cache entries are **holons** - simultaneously complete AND parts of aggregates: | Level | Example | Aggregates | |-------|---------|------------| | Micro | "Rijksmuseum details" | None | | Meso | "Museums in Amsterdam" | List of micro holons | | Macro | "Heritage in Noord-Holland" | All meso holons in region | ```typescript interface CachedQuery { // ... existing fields ... holonLevel?: 'micro' | 'meso' | 'macro'; participatesIn?: string[]; // Higher-level cache keys aggregates?: string[]; // Lower-level entries } ``` ## Problem Statement The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction: ``` Query: "Hoeveel musea in Amsterdam?" Cached: "Hoeveel musea in Noord-Holland?" Result: BLOCKED (location mismatch) ✅ ``` However, the current implementation uses **hardcoded regex patterns**: ```typescript // DEPRECATED: Hardcoded patterns in semantic-cache.ts const INSTITUTION_PATTERNS: Record = { M: /\b(muse(um|a|ums?)|musea)/i, A: /\b(archie[fv]en?|archives?|archief)/i, // ... 19 patterns to maintain manually }; ``` **Problems with hardcoded patterns**: 1. **Maintenance burden** - Every new institution type requires code changes 2. **Missing subtypes** - "kunstmuseum" vs "museum" should cache separately 3. **No multilingual support** - Only Dutch/English, misses German/French labels 4. **Duplication** - Same vocabulary exists in LinkML schemas 5. **No record type awareness** - "burgerlijke stand" queries mixed with general archive queries ## Solution: Schema-Derived Vocabulary The LinkML schema already contains rich vocabulary: | Schema File | Content | Cache Utility | |-------------|---------|---------------| | `CustodianType.yaml` | 19 top-level types | Primary segmentation (M/A/L/G...) | | `MuseumType.yaml` | 187+ museum subtypes | Subtype segmentation | | `ArchiveOrganizationType.yaml` | 144+ archive subtypes | Subtype segmentation | | `*RecordSetTypes.yaml` | Record type taxonomies | Finding aids specificity | ### Vocabulary Sources in Schema 1. **`type_label`** - Multilingual labels via `skos:prefLabel` 2. **`structured_aliases`** - Language-tagged alternative names 3. **`keywords`** - Search terms for entity recognition 4. **`wikidata_entity`** - Linked Data identifiers ## Architecture ### Overview: Two-Tier Embedding Hierarchy The system uses a **hierarchical embedding approach** for fast semantic routing: 1. **Tier 1: Types File Embeddings** - Which category? (Museum vs Archive vs Library) 2. **Tier 2: Individual Type Embeddings** - Which specific type? (ArtMuseum vs NaturalHistoryMuseum) ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ BUILD TIME: Extract vocabulary + generate embeddings │ │ │ │ schemas/20251121/linkml/modules/classes/*Type.yaml │ │ schemas/20251121/linkml/modules/classes/*Types.yaml │ │ ↓ │ │ scripts/extract-types-vocab.ts │ │ ↓ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ types-vocab.json │ │ │ │ ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] } │ │ │ │ ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│ │ │ │ └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│ │ │ └───────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ ▼ (loaded at runtime) ┌─────────────────────────────────────────────────────────────────────────┐ │ RUNTIME: Two-Tier Semantic Routing │ │ │ │ Query: "Hoeveel gemeentearchieven in Amsterdam?" │ │ ↓ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ TIER 1: Types File Selection │ │ │ │ Query embedding vs Tier1 embeddings (19 categories) │ │ │ │ Result: ArchiveOrganizationType (similarity: 0.89) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ TIER 2: Specific Type Selection │ │ │ │ Query embedding vs Tier2 embeddings (144 archive subtypes) │ │ │ │ Result: MunicipalArchive (similarity: 0.94) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam" │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Tier 1: Types File Embeddings Each Types file (e.g., `MuseumType.yaml`, `ArchiveOrganizationType.yaml`) gets ONE embedding representing the **accumulated vocabulary** of all types within that file. **Embedding Text Construction**: ``` MuseumType: museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum [... all keywords from all 187 subtypes ...] ``` **Purpose**: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to. | Types File | Code | Accumulated Terms Count | |------------|------|------------------------| | MuseumType | M | ~500+ terms from 187 subtypes | | ArchiveOrganizationType | A | ~400+ terms from 144 subtypes | | LibraryType | L | ~200+ terms from subtypes | | GalleryType | G | ~100+ terms from subtypes | | ... | ... | ... | ### Tier 2: Individual Type Embeddings Each **specific type** within a Types file gets its own embedding from its accumulated terms. **Embedding Text Construction**: ``` MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive town archive local government records burgerlijke stand bevolkingsregister council minutes building permits [... all keywords + structured_aliases + labels ...] ``` **Purpose**: Precise subtype identification after Tier 1 narrows the category. ### Term Log Structure A lookup table mapping every extracted term to its type/subtype: ```json { "termLog": { "kunstmuseum": { "typeCode": "M", "typeName": "MuseumType", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "language": "nl" }, "art museum": { "typeCode": "M", "typeName": "MuseumType", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "language": "en" }, "gemeentearchief": { "typeCode": "A", "typeName": "ArchiveOrganizationType", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "language": "nl" } } } ``` **Purpose**: 1. Fast O(1) keyword lookup (no embedding needed for exact matches) 2. Audit trail of which terms map to which types 3. Debugging which queries match which types ### Runtime Lookup Strategy ```typescript async function extractEntitiesWithEmbeddings(query: string): Promise { const vocab = await loadTypesVocabulary(); const normalized = query.toLowerCase(); // FAST PATH: Check termLog for exact keyword matches for (const [term, mapping] of Object.entries(vocab.termLog)) { if (normalized.includes(term)) { return { institutionType: mapping.typeCode, institutionSubtype: mapping.subtypeName, subtypeWikidata: mapping.wikidata, // ... location and intent extraction }; } } // SLOW PATH: Embedding-based semantic matching const queryEmbedding = await generateEmbedding(query); // Tier 1: Find best matching Types file let bestType: string | null = null; let bestTypeSimilarity = 0; for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) { const similarity = cosineSimilarity(queryEmbedding, typeEmbedding); if (similarity > bestTypeSimilarity && similarity > 0.7) { bestTypeSimilarity = similarity; bestType = typeName; } } if (!bestType) return {}; // No type matched // Tier 2: Find best matching subtype within the Types file const typeCode = vocab.institutionTypes[bestType].code; let bestSubtype: string | null = null; let bestSubtypeSimilarity = 0; for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) { const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding); if (similarity > bestSubtypeSimilarity && similarity > 0.75) { bestSubtypeSimilarity = similarity; bestSubtype = subtypeName; } } return { institutionType: typeCode, institutionSubtype: bestSubtype, // ... location and intent extraction }; } ``` ### Embedding Model Choice For build-time embedding generation, use the same model as the semantic cache: | Option | Model | Dimensions | Quality | |--------|-------|------------|---------| | **Primary** | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Good multilingual | | Fallback | `all-MiniLM-L6-v2` | 384 | English-focused | | High Quality | `multilingual-e5-large` | 1024 | Best multilingual | **Build-time generation**: Embeddings are generated ONCE at build time and stored in JSON. This avoids runtime embedding API calls for type classification. ## TypesVocabulary JSON Structure Generated at build time with **pre-computed embeddings**: ```json { "version": "2026-01-10T12:00:00Z", "schemaVersion": "20251121", "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2", "embeddingDimensions": 384, "tier1Embeddings": { "MuseumType": [0.023, -0.045, 0.087, ...], "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...], "LibraryType": [-0.034, 0.089, 0.012, ...], "GalleryType": [0.045, -0.023, 0.067, ...] }, "tier2Embeddings": { "M": { "ART_MUSEUM": [0.034, -0.056, 0.078, ...], "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...], "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...] }, "A": { "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...], "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...], "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...] } }, "termLog": { "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"}, "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"}, "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"}, "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"}, "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"}, "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}, "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"} }, "institutionTypes": { "M": { "code": "M", "className": "MuseumType", "baseWikidata": "Q33506", "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...", "keywords": { "nl": ["museum", "musea"], "en": ["museum", "museums"], "de": ["Museum", "Museen"] }, "subtypes": { "ART_MUSEUM": { "className": "ArtMuseum", "wikidata": "Q207694", "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum", "keywords": { "nl": ["kunstmuseum", "kunstmusea"], "en": ["art museum", "art museums"] } }, "NATURAL_HISTORY_MUSEUM": { "className": "NaturalHistoryMuseum", "wikidata": "Q559049", "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology", "keywords": { "nl": ["natuurhistorisch museum", "natuurmuseum"], "en": ["natural history museum"] } } } }, "A": { "code": "A", "className": "ArchiveOrganizationType", "baseWikidata": "Q166118", "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...", "keywords": { "nl": ["archief", "archieven"], "en": ["archive", "archives"] }, "subtypes": { "MUNICIPAL_ARCHIVE": { "className": "MunicipalArchive", "wikidata": "Q8362876", "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes", "keywords": { "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"], "en": ["municipal archive", "city archive", "town archive"] } }, "NATIONAL_ARCHIVE": { "className": "NationalArchive", "wikidata": "Q1188452", "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive", "keywords": { "nl": ["nationaal archief", "rijksarchief"], "en": ["national archive", "state archive"] } } } } }, "recordSetTypes": { "CIVIL_REGISTRY": { "className": "CivilRegistrySeries", "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy", "keywords": { "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"], "en": ["civil registry", "birth records", "marriage records", "death records"] } }, "COUNCIL_GOVERNANCE": { "className": "CouncilGovernanceFonds", "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council", "keywords": { "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"], "en": ["council minutes", "ordinances", "resolutions"] } } } } ``` ### Key Additions for Embedding Support | Field | Purpose | |-------|---------| | `tier1Embeddings` | Pre-computed embeddings for each Types file (19 categories) | | `tier2Embeddings` | Pre-computed embeddings for each subtype (500+ types) | | `termLog` | Fast O(1) lookup table for exact keyword matches | | `accumulatedTerms` | Raw text used to generate embeddings (for debugging/regeneration) | | `embeddingModel` | Model used to generate embeddings (for reproducibility) | ## Enhanced ExtractedEntities Interface ```typescript export interface ExtractedEntities { // Existing fields institutionType?: InstitutionTypeCode | null; location?: string | null; locationType?: 'city' | 'province' | null; intent?: 'count' | 'list' | 'info' | null; // NEW: Ontology-derived fields institutionSubtype?: string | null; // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM' recordSetType?: string | null; // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE' subtypeWikidata?: string | null; // e.g., 'Q8362876' for LOD integration } ``` ## Enhanced Cache Key Format ``` {intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location} Examples: - "count:m:amsterdam" # Basic museum count - "count:m.art_museum:amsterdam" # Art museum count (subtype) - "list:a.municipal_archive:nh" # Municipal archives in Noord-Holland - "query:a:civil_registry:utrecht" # Civil registry in Utrecht - "info:a.national_archive::nl" # National archive info (no location filter) ``` ## Implementation Files | File | Purpose | |------|---------| | `scripts/extract-types-vocab.ts` | Build-time vocabulary extraction from LinkML | | `apps/archief-assistent/public/types-vocab.json` | Generated vocabulary file | | `apps/archief-assistent/src/lib/types-vocabulary.ts` | Runtime vocabulary loader | | `apps/archief-assistent/src/lib/semantic-cache.ts` | Updated entity extraction | ## Build Integration Add to `apps/archief-assistent/package.json`: ```json { "scripts": { "prebuild": "tsx ../../scripts/extract-types-vocab.ts", "build": "vite build" } } ``` ## Keyword Extraction Priority When extracting keywords from schema files: 1. **`keywords`** array (highest priority) - Explicit search terms 2. **`structured_aliases.literal_form`** - Multilingual alternative names 3. **`type_label`** - Preferred labels per language 4. **Class name conversion** - `MunicipalArchive` → "municipal archive" ## Cache Segmentation Rules ### Rule 1: Subtype Specificity Queries with **specific subtypes** should NOT match **generic type** cache entries: ``` Query: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam" Cached: "musea in Amsterdam" → key: "count:m:amsterdam" Result: MISS (subtype mismatch) ✅ ``` ### Rule 2: Record Set Type Isolation Queries about **specific record types** should cache separately: ``` Query: "burgerlijke stand Utrecht" → key: "query:a:civil_registry:utrecht" Cached: "archieven in Utrecht" → key: "list:a:utrecht" Result: MISS (record set type mismatch) ✅ ``` ### Rule 3: Subtype-to-Type Fallback Generic queries CAN match subtype cache entries (broader is acceptable): ``` Query: "musea in Amsterdam" → key: "count:m:amsterdam" Cached: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam" Result: MISS (don't return subset for superset query) ``` ## Migration Notes 1. **Backwards Compatible**: Existing cache entries without `institutionSubtype` continue to work 2. **Gradual Rollout**: New cache entries get subtype, old entries remain valid 3. **Cache Clear**: Consider clearing cache after deployment to ensure consistency ## Validation Run E2E tests to verify: ```bash cd apps/archief-assistent npm run test:e2e ``` Key test cases: - Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland) - Subtype isolation (kunstmuseum ≠ museum) - Record set isolation (burgerlijke stand ≠ archive) - Intent isolation (count ≠ list ≠ info) ## References - **Rule 41**: Types classes define SPARQL template variables - **Rule 0b**: Type/Types file naming convention - **CustodianType.yaml**: Base taxonomy definition - **AGENTS.md**: GLAMORCUBESFIXPHDNT taxonomy documentation --- **Created**: 2026-01-10 **Author**: OpenCode Agent **Status**: Implemented (v2.0) ## References - Pavlyshyn, V. "Context Graphs and Data Traces: Building Epistemology Layers for Agentic Memory" - Pavlyshyn, V. "The Shape of Knowledge: Topology Theory for Knowledge Graphs" - Pavlyshyn, V. "Beyond Hierarchy: Why Agentic AI Systems Need Holarchies" - Pavlyshyn, V. "Smalltalk: The Language That Changed Everything" - Pavlyshyn, V. "Clarity Traders: Beyond Vibe Coding"