- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions. - Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications. - Implemented Rule: No Version Indicators in Names to maintain stable semantic naming. - Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions. - Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices. - Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files. - Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates. - Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml. - Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
24 KiB
Rule 46: Ontology-Driven Cache Segmentation
🚨 CRITICAL: The semantic cache MUST use vocabulary derived from LinkML *Type.yaml and *Types.yaml schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.
Status: Implemented (Evolved v2.0)
Version: 2.0 (Epistemological Evolution)
Updated: 2026-01-10
Evolution Overview
Rule 46 v2.0 incorporates insights from Volodymyr Pavlyshyn's work on agentic memory systems:
- Epistemic Provenance (Phase 1) - Track WHERE, WHEN, HOW data originated
- Topological Distance (Phase 2) - Use ontology structure, not just embeddings
- Holarchic Cache (Phase 3) - Entries as holons with up/down links
- Message Passing (Phase 4, planned) - Smalltalk-style introspectable cache
- Clarity Trading (Phase 5, planned) - Block ambiguous queries from cache
Epistemic Provenance
Every cached response carries epistemological metadata:
interface EpistemicProvenance {
dataSource: 'ISIL_REGISTRY' | 'WIKIDATA' | 'CUSTODIAN_YAML' | 'LLM_INFERENCE' | ...;
dataTier: 1 | 2 | 3 | 4; // TIER_1_AUTHORITATIVE → TIER_4_INFERRED
sourceTimestamp: string;
derivationChain: string[]; // ["SPARQL:Qdrant", "RAG:retrieve", "LLM:generate"]
revalidationPolicy: 'static' | 'daily' | 'weekly' | 'on_access';
}
Benefit: Users see "This answer is from TIER_1 ISIL registry data, captured 2025-01-08".
Topological Distance
Beyond embedding similarity, cache matching considers structural distance in the type hierarchy:
HeritageCustodian (*)
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
MuseumType (M) ArchiveType (A) LibraryType (L)
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
▼ ▼ ▼ ▼ ▼ ▼
ArtMuseum History Municipal State Public Academic
Combined Similarity Formula:
finalScore = 0.7 * embeddingSimilarity + 0.3 * (1 - topologicalDistance)
Benefit: "Art museum" won't match "natural history museum" even with 95% embedding similarity.
Holarchic Cache Structure
Cache entries are holons - simultaneously complete AND parts of aggregates:
| Level | Example | Aggregates |
|---|---|---|
| Micro | "Rijksmuseum details" | None |
| Meso | "Museums in Amsterdam" | List of micro holons |
| Macro | "Heritage in Noord-Holland" | All meso holons in region |
interface CachedQuery {
// ... existing fields ...
holonLevel?: 'micro' | 'meso' | 'macro';
participatesIn?: string[]; // Higher-level cache keys
aggregates?: string[]; // Lower-level entries
}
Problem Statement
The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:
Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅
However, the current implementation uses hardcoded regex patterns:
// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
M: /\b(muse(um|a|ums?)|musea)/i,
A: /\b(archie[fv]en?|archives?|archief)/i,
// ... 19 patterns to maintain manually
};
Problems with hardcoded patterns:
- Maintenance burden - Every new institution type requires code changes
- Missing subtypes - "kunstmuseum" vs "museum" should cache separately
- No multilingual support - Only Dutch/English, misses German/French labels
- Duplication - Same vocabulary exists in LinkML schemas
- No record type awareness - "burgerlijke stand" queries mixed with general archive queries
Solution: Schema-Derived Vocabulary
The LinkML schema already contains rich vocabulary:
| Schema File | Content | Cache Utility |
|---|---|---|
CustodianType.yaml |
19 top-level types | Primary segmentation (M/A/L/G...) |
MuseumType.yaml |
187+ museum subtypes | Subtype segmentation |
ArchiveOrganizationType.yaml |
144+ archive subtypes | Subtype segmentation |
*RecordSetTypes.yaml |
Record type taxonomies | Finding aids specificity |
Vocabulary Sources in Schema
type_label- Multilingual labels viaskos:prefLabelstructured_aliases- Language-tagged alternative nameskeywords- Search terms for entity recognitionwikidata_entity- Linked Data identifiers
Architecture
Overview: Two-Tier Embedding Hierarchy
The system uses a hierarchical embedding approach for fast semantic routing:
- Tier 1: Types File Embeddings - Which category? (Museum vs Archive vs Library)
- Tier 2: Individual Type Embeddings - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)
┌─────────────────────────────────────────────────────────────────────────┐
│ BUILD TIME: Extract vocabulary + generate embeddings │
│ │
│ schemas/20251121/linkml/modules/classes/*Type.yaml │
│ schemas/20251121/linkml/modules/classes/*Types.yaml │
│ ↓ │
│ scripts/extract-types-vocab.ts │
│ ↓ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ types-vocab.json │ │
│ │ ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] } │ │
│ │ ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│ │
│ │ └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│ │
│ └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│ RUNTIME: Two-Tier Semantic Routing │
│ │
│ Query: "Hoeveel gemeentearchieven in Amsterdam?" │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TIER 1: Types File Selection │ │
│ │ Query embedding vs Tier1 embeddings (19 categories) │ │
│ │ Result: ArchiveOrganizationType (similarity: 0.89) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TIER 2: Specific Type Selection │ │
│ │ Query embedding vs Tier2 embeddings (144 archive subtypes) │ │
│ │ Result: MunicipalArchive (similarity: 0.94) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam" │
└─────────────────────────────────────────────────────────────────────────┘
Tier 1: Types File Embeddings
Each Types file (e.g., MuseumType.yaml, ArchiveOrganizationType.yaml) gets ONE embedding
representing the accumulated vocabulary of all types within that file.
Embedding Text Construction:
MuseumType: museum musea kunstmuseum art museum natural history museum
science museum open-air museum ecomuseum virtual museum
heritage farm national museum regional museum university museum
[... all keywords from all 187 subtypes ...]
Purpose: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.
| Types File | Code | Accumulated Terms Count |
|---|---|---|
| MuseumType | M | ~500+ terms from 187 subtypes |
| ArchiveOrganizationType | A | ~400+ terms from 144 subtypes |
| LibraryType | L | ~200+ terms from subtypes |
| GalleryType | G | ~100+ terms from subtypes |
| ... | ... | ... |
Tier 2: Individual Type Embeddings
Each specific type within a Types file gets its own embedding from its accumulated terms.
Embedding Text Construction:
MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
town archive local government records burgerlijke stand
bevolkingsregister council minutes building permits
[... all keywords + structured_aliases + labels ...]
Purpose: Precise subtype identification after Tier 1 narrows the category.
Term Log Structure
A lookup table mapping every extracted term to its type/subtype:
{
"termLog": {
"kunstmuseum": {
"typeCode": "M",
"typeName": "MuseumType",
"subtypeName": "ART_MUSEUM",
"wikidata": "Q207694",
"language": "nl"
},
"art museum": {
"typeCode": "M",
"typeName": "MuseumType",
"subtypeName": "ART_MUSEUM",
"wikidata": "Q207694",
"language": "en"
},
"gemeentearchief": {
"typeCode": "A",
"typeName": "ArchiveOrganizationType",
"subtypeName": "MUNICIPAL_ARCHIVE",
"wikidata": "Q8362876",
"language": "nl"
}
}
}
Purpose:
- Fast O(1) keyword lookup (no embedding needed for exact matches)
- Audit trail of which terms map to which types
- Debugging which queries match which types
Runtime Lookup Strategy
async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
const vocab = await loadTypesVocabulary();
const normalized = query.toLowerCase();
// FAST PATH: Check termLog for exact keyword matches
for (const [term, mapping] of Object.entries(vocab.termLog)) {
if (normalized.includes(term)) {
return {
institutionType: mapping.typeCode,
institutionSubtype: mapping.subtypeName,
subtypeWikidata: mapping.wikidata,
// ... location and intent extraction
};
}
}
// SLOW PATH: Embedding-based semantic matching
const queryEmbedding = await generateEmbedding(query);
// Tier 1: Find best matching Types file
let bestType: string | null = null;
let bestTypeSimilarity = 0;
for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
if (similarity > bestTypeSimilarity && similarity > 0.7) {
bestTypeSimilarity = similarity;
bestType = typeName;
}
}
if (!bestType) return {}; // No type matched
// Tier 2: Find best matching subtype within the Types file
const typeCode = vocab.institutionTypes[bestType].code;
let bestSubtype: string | null = null;
let bestSubtypeSimilarity = 0;
for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
bestSubtypeSimilarity = similarity;
bestSubtype = subtypeName;
}
}
return {
institutionType: typeCode,
institutionSubtype: bestSubtype,
// ... location and intent extraction
};
}
Embedding Model Choice
For build-time embedding generation, use the same model as the semantic cache:
| Option | Model | Dimensions | Quality |
|---|---|---|---|
| Primary | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
384 | Good multilingual |
| Fallback | all-MiniLM-L6-v2 |
384 | English-focused |
| High Quality | multilingual-e5-large |
1024 | Best multilingual |
Build-time generation: Embeddings are generated ONCE at build time and stored in JSON. This avoids runtime embedding API calls for type classification.
TypesVocabulary JSON Structure
Generated at build time with pre-computed embeddings:
{
"version": "2026-01-10T12:00:00Z",
"schemaVersion": "20251121",
"embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
"embeddingDimensions": 384,
"tier1Embeddings": {
"MuseumType": [0.023, -0.045, 0.087, ...],
"ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
"LibraryType": [-0.034, 0.089, 0.012, ...],
"GalleryType": [0.045, -0.023, 0.067, ...]
},
"tier2Embeddings": {
"M": {
"ART_MUSEUM": [0.034, -0.056, 0.078, ...],
"NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
"SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
},
"A": {
"MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
"NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
"CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
}
},
"termLog": {
"kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
"art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
"gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
"stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
"city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
"burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
"geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
},
"institutionTypes": {
"M": {
"code": "M",
"className": "MuseumType",
"baseWikidata": "Q33506",
"accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
"keywords": {
"nl": ["museum", "musea"],
"en": ["museum", "museums"],
"de": ["Museum", "Museen"]
},
"subtypes": {
"ART_MUSEUM": {
"className": "ArtMuseum",
"wikidata": "Q207694",
"accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
"keywords": {
"nl": ["kunstmuseum", "kunstmusea"],
"en": ["art museum", "art museums"]
}
},
"NATURAL_HISTORY_MUSEUM": {
"className": "NaturalHistoryMuseum",
"wikidata": "Q559049",
"accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
"keywords": {
"nl": ["natuurhistorisch museum", "natuurmuseum"],
"en": ["natural history museum"]
}
}
}
},
"A": {
"code": "A",
"className": "ArchiveOrganizationType",
"baseWikidata": "Q166118",
"accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
"keywords": {
"nl": ["archief", "archieven"],
"en": ["archive", "archives"]
},
"subtypes": {
"MUNICIPAL_ARCHIVE": {
"className": "MunicipalArchive",
"wikidata": "Q8362876",
"accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
"keywords": {
"nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
"en": ["municipal archive", "city archive", "town archive"]
}
},
"NATIONAL_ARCHIVE": {
"className": "NationalArchive",
"wikidata": "Q1188452",
"accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
"keywords": {
"nl": ["nationaal archief", "rijksarchief"],
"en": ["national archive", "state archive"]
}
}
}
}
},
"recordSetTypes": {
"CIVIL_REGISTRY": {
"className": "CivilRegistrySeries",
"accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
"keywords": {
"nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
"en": ["civil registry", "birth records", "marriage records", "death records"]
}
},
"COUNCIL_GOVERNANCE": {
"className": "CouncilGovernanceFonds",
"accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
"keywords": {
"nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
"en": ["council minutes", "ordinances", "resolutions"]
}
}
}
}
Key Additions for Embedding Support
| Field | Purpose |
|---|---|
tier1Embeddings |
Pre-computed embeddings for each Types file (19 categories) |
tier2Embeddings |
Pre-computed embeddings for each subtype (500+ types) |
termLog |
Fast O(1) lookup table for exact keyword matches |
accumulatedTerms |
Raw text used to generate embeddings (for debugging/regeneration) |
embeddingModel |
Model used to generate embeddings (for reproducibility) |
Enhanced ExtractedEntities Interface
export interface ExtractedEntities {
// Existing fields
institutionType?: InstitutionTypeCode | null;
location?: string | null;
locationType?: 'city' | 'province' | null;
intent?: 'count' | 'list' | 'info' | null;
// NEW: Ontology-derived fields
institutionSubtype?: string | null; // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
recordSetType?: string | null; // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
subtypeWikidata?: string | null; // e.g., 'Q8362876' for LOD integration
}
Enhanced Cache Key Format
{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}
Examples:
- "count:m:amsterdam" # Basic museum count
- "count:m.art_museum:amsterdam" # Art museum count (subtype)
- "list:a.municipal_archive:nh" # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht" # Civil registry in Utrecht
- "info:a.national_archive::nl" # National archive info (no location filter)
Implementation Files
| File | Purpose |
|---|---|
scripts/extract-types-vocab.ts |
Build-time vocabulary extraction from LinkML |
apps/archief-assistent/public/types-vocab.json |
Generated vocabulary file |
apps/archief-assistent/src/lib/types-vocabulary.ts |
Runtime vocabulary loader |
apps/archief-assistent/src/lib/semantic-cache.ts |
Updated entity extraction |
Build Integration
Add to apps/archief-assistent/package.json:
{
"scripts": {
"prebuild": "tsx ../../scripts/extract-types-vocab.ts",
"build": "vite build"
}
}
Keyword Extraction Priority
When extracting keywords from schema files:
keywordsarray (highest priority) - Explicit search termsstructured_aliases.literal_form- Multilingual alternative namestype_label- Preferred labels per language- Class name conversion -
MunicipalArchive→ "municipal archive"
Cache Segmentation Rules
Rule 1: Subtype Specificity
Queries with specific subtypes should NOT match generic type cache entries:
Query: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam" → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅
Rule 2: Record Set Type Isolation
Queries about specific record types should cache separately:
Query: "burgerlijke stand Utrecht" → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht" → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅
Rule 3: Subtype-to-Type Fallback
Generic queries CAN match subtype cache entries (broader is acceptable):
Query: "musea in Amsterdam" → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam" → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)
Migration Notes
- Backwards Compatible: Existing cache entries without
institutionSubtypecontinue to work - Gradual Rollout: New cache entries get subtype, old entries remain valid
- Cache Clear: Consider clearing cache after deployment to ensure consistency
Validation
Run E2E tests to verify:
cd apps/archief-assistent
npm run test:e2e
Key test cases:
- Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
- Subtype isolation (kunstmuseum ≠ museum)
- Record set isolation (burgerlijke stand ≠ archive)
- Intent isolation (count ≠ list ≠ info)
References
- Rule 41: Types classes define SPARQL template variables
- Rule 0b: Type/Types file naming convention
- CustodianType.yaml: Base taxonomy definition
- AGENTS.md: GLAMORCUBESFIXPHDNT taxonomy documentation
Created: 2026-01-10
Author: OpenCode Agent
Status: Implemented (v2.0)
References
- Pavlyshyn, V. "Context Graphs and Data Traces: Building Epistemology Layers for Agentic Memory"
- Pavlyshyn, V. "The Shape of Knowledge: Topology Theory for Knowledge Graphs"
- Pavlyshyn, V. "Beyond Hierarchy: Why Agentic AI Systems Need Holarchies"
- Pavlyshyn, V. "Smalltalk: The Language That Changed Everything"
- Pavlyshyn, V. "Clarity Traders: Beyond Vibe Coding"