glam/.opencode/rules/context/ontology-driven-cache-segmentation.md

# Rule 46: Ontology-Driven Cache Segmentation

🚨 **CRITICAL**: The semantic cache MUST use vocabulary derived from LinkML `*Type.yaml` and `*Types.yaml` schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.

**Status**: Implemented (Evolved v2.0)
**Version**: 2.0 (Epistemological Evolution)
**Updated**: 2026-01-10

## Evolution Overview

Rule 46 v2.0 incorporates insights from Volodymyr Pavlyshyn's work on agentic memory systems:

1. **Epistemic Provenance** (Phase 1) - Track WHERE, WHEN, HOW data originated
2. **Topological Distance** (Phase 2) - Use ontology structure, not just embeddings
3. **Holarchic Cache** (Phase 3) - Entries as holons with up/down links
4. **Message Passing** (Phase 4, planned) - Smalltalk-style introspectable cache
5. **Clarity Trading** (Phase 5, planned) - Block ambiguous queries from cache

## Epistemic Provenance

Every cached response carries epistemological metadata:

```typescript
interface EpistemicProvenance {
  dataSource: 'ISIL_REGISTRY' | 'WIKIDATA' | 'CUSTODIAN_YAML' | 'LLM_INFERENCE' | ...;
  dataTier: 1 | 2 | 3 | 4;  // TIER_1_AUTHORITATIVE → TIER_4_INFERRED
  sourceTimestamp: string;
  derivationChain: string[];  // ["SPARQL:Qdrant", "RAG:retrieve", "LLM:generate"]
  revalidationPolicy: 'static' | 'daily' | 'weekly' | 'on_access';
}
```

**Benefit**: Users see "This answer is from TIER_1 ISIL registry data, captured 2025-01-08".

## Topological Distance

Beyond embedding similarity, cache matching considers **structural distance** in the type hierarchy:

```
                    HeritageCustodian (*)
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
   MuseumType (M)    ArchiveType (A)    LibraryType (L)
        │                  │                  │
   ┌────┴────┐        ┌────┴────┐        ┌────┴────┐
   ▼         ▼        ▼         ▼        ▼         ▼
ArtMuseum  History  Municipal  State   Public  Academic
```

**Combined Similarity Formula**:
```typescript
finalScore = 0.7 * embeddingSimilarity + 0.3 * (1 - topologicalDistance)
```

**Benefit**: "Art museum" won't match "natural history museum" even with 95% embedding similarity.

## Holarchic Cache Structure

Cache entries are **holons** - simultaneously complete AND parts of aggregates:

| Level | Example | Aggregates |
|-------|---------|------------|
| Micro | "Rijksmuseum details" | None |
| Meso | "Museums in Amsterdam" | List of micro holons |
| Macro | "Heritage in Noord-Holland" | All meso holons in region |

```typescript
interface CachedQuery {
  // ... existing fields ...
  holonLevel?: 'micro' | 'meso' | 'macro';
  participatesIn?: string[];  // Higher-level cache keys
  aggregates?: string[];       // Lower-level entries
}
```

## Problem Statement

The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:

```
Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅
```

However, the current implementation uses **hardcoded regex patterns**:

```typescript
// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
  M: /\b(muse(um|a|ums?)|musea)/i,
  A: /\b(archie[fv]en?|archives?|archief)/i,
  // ... 19 patterns to maintain manually
};
```

**Problems with hardcoded patterns**:
1. **Maintenance burden** - Every new institution type requires code changes
2. **Missing subtypes** - "kunstmuseum" vs "museum" should cache separately
3. **No multilingual support** - Only Dutch/English, misses German/French labels
4. **Duplication** - Same vocabulary exists in LinkML schemas
5. **No record type awareness** - "burgerlijke stand" queries mixed with general archive queries

## Solution: Schema-Derived Vocabulary

The LinkML schema already contains rich vocabulary:

| Schema File | Content | Cache Utility |
|-------------|---------|---------------|
| `CustodianType.yaml` | 19 top-level types | Primary segmentation (M/A/L/G...) |
| `MuseumType.yaml` | 187+ museum subtypes | Subtype segmentation |
| `ArchiveOrganizationType.yaml` | 144+ archive subtypes | Subtype segmentation |
| `*RecordSetTypes.yaml` | Record type taxonomies | Finding aids specificity |

### Vocabulary Sources in Schema

1. **`type_label`** - Multilingual labels via `skos:prefLabel`
2. **`structured_aliases`** - Language-tagged alternative names
3. **`keywords`** - Search terms for entity recognition
4. **`wikidata_entity`** - Linked Data identifiers

## Architecture

### Overview: Two-Tier Embedding Hierarchy

The system uses a **hierarchical embedding approach** for fast semantic routing:

1. **Tier 1: Types File Embeddings** - Which category? (Museum vs Archive vs Library)
2. **Tier 2: Individual Type Embeddings** - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)

```
┌─────────────────────────────────────────────────────────────────────────┐
│  BUILD TIME: Extract vocabulary + generate embeddings                  │
│                                                                         │
│  schemas/20251121/linkml/modules/classes/*Type.yaml                     │
│  schemas/20251121/linkml/modules/classes/*Types.yaml                    │
│                          ↓                                              │
│  scripts/extract-types-vocab.ts                                         │
│                          ↓                                              │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  types-vocab.json                                                 │  │
│  │  ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] }   │  │
│  │  ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│  │
│  │  └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│  RUNTIME: Two-Tier Semantic Routing                                    │
│                                                                         │
│  Query: "Hoeveel gemeentearchieven in Amsterdam?"                       │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 1: Types File Selection                                    │    │
│  │ Query embedding vs Tier1 embeddings (19 categories)             │    │
│  │ Result: ArchiveOrganizationType (similarity: 0.89)              │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 2: Specific Type Selection                                 │    │
│  │ Query embedding vs Tier2 embeddings (144 archive subtypes)      │    │
│  │ Result: MunicipalArchive (similarity: 0.94)                     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam"            │
└─────────────────────────────────────────────────────────────────────────┘
```

### Tier 1: Types File Embeddings

Each Types file (e.g., `MuseumType.yaml`, `ArchiveOrganizationType.yaml`) gets ONE embedding
representing the **accumulated vocabulary** of all types within that file.

**Embedding Text Construction**:
```
MuseumType: museum musea kunstmuseum art museum natural history museum
            science museum open-air museum ecomuseum virtual museum
            heritage farm national museum regional museum university museum
            [... all keywords from all 187 subtypes ...]
```

**Purpose**: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.

| Types File | Code | Accumulated Terms Count |
|------------|------|------------------------|
| MuseumType | M | ~500+ terms from 187 subtypes |
| ArchiveOrganizationType | A | ~400+ terms from 144 subtypes |
| LibraryType | L | ~200+ terms from subtypes |
| GalleryType | G | ~100+ terms from subtypes |
| ... | ... | ... |

### Tier 2: Individual Type Embeddings

Each **specific type** within a Types file gets its own embedding from its accumulated terms.

**Embedding Text Construction**:
```
MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
                  town archive local government records burgerlijke stand
                  bevolkingsregister council minutes building permits
                  [... all keywords + structured_aliases + labels ...]
```

**Purpose**: Precise subtype identification after Tier 1 narrows the category.

### Term Log Structure

A lookup table mapping every extracted term to its type/subtype:

```json
{
  "termLog": {
    "kunstmuseum": {
      "typeCode": "M",
      "typeName": "MuseumType",
      "subtypeName": "ART_MUSEUM",
      "wikidata": "Q207694",
      "language": "nl"
    },
    "art museum": {
      "typeCode": "M",
      "typeName": "MuseumType",
      "subtypeName": "ART_MUSEUM",
      "wikidata": "Q207694",
      "language": "en"
    },
    "gemeentearchief": {
      "typeCode": "A",
      "typeName": "ArchiveOrganizationType",
      "subtypeName": "MUNICIPAL_ARCHIVE",
      "wikidata": "Q8362876",
      "language": "nl"
    }
  }
}
```

**Purpose**:
1. Fast O(1) keyword lookup (no embedding needed for exact matches)
2. Audit trail of which terms map to which types
3. Debugging which queries match which types

### Runtime Lookup Strategy

```typescript
async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
  const vocab = await loadTypesVocabulary();
  const normalized = query.toLowerCase();

  // FAST PATH: Check termLog for exact keyword matches
  for (const [term, mapping] of Object.entries(vocab.termLog)) {
    if (normalized.includes(term)) {
      return {
        institutionType: mapping.typeCode,
        institutionSubtype: mapping.subtypeName,
        subtypeWikidata: mapping.wikidata,
        // ... location and intent extraction
      };
    }
  }

  // SLOW PATH: Embedding-based semantic matching
  const queryEmbedding = await generateEmbedding(query);

  // Tier 1: Find best matching Types file
  let bestType: string | null = null;
  let bestTypeSimilarity = 0;
  for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
    const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
    if (similarity > bestTypeSimilarity && similarity > 0.7) {
      bestTypeSimilarity = similarity;
      bestType = typeName;
    }
  }

  if (!bestType) return {}; // No type matched

  // Tier 2: Find best matching subtype within the Types file
  const typeCode = vocab.institutionTypes[bestType].code;
  let bestSubtype: string | null = null;
  let bestSubtypeSimilarity = 0;

  for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
    const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
    if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
      bestSubtypeSimilarity = similarity;
      bestSubtype = subtypeName;
    }
  }

  return {
    institutionType: typeCode,
    institutionSubtype: bestSubtype,
    // ... location and intent extraction
  };
}
```

### Embedding Model Choice

For build-time embedding generation, use the same model as the semantic cache:

| Option | Model | Dimensions | Quality |
|--------|-------|------------|---------|
| **Primary** | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Good multilingual |
| Fallback | `all-MiniLM-L6-v2` | 384 | English-focused |
| High Quality | `multilingual-e5-large` | 1024 | Best multilingual |

**Build-time generation**: Embeddings are generated ONCE at build time and stored in JSON.
This avoids runtime embedding API calls for type classification.

## TypesVocabulary JSON Structure

Generated at build time with **pre-computed embeddings**:

```json
{
  "version": "2026-01-10T12:00:00Z",
  "schemaVersion": "20251121",
  "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
  "embeddingDimensions": 384,

  "tier1Embeddings": {
    "MuseumType": [0.023, -0.045, 0.087, ...],
    "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
    "LibraryType": [-0.034, 0.089, 0.012, ...],
    "GalleryType": [0.045, -0.023, 0.067, ...]
  },

  "tier2Embeddings": {
    "M": {
      "ART_MUSEUM": [0.034, -0.056, 0.078, ...],
      "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
      "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
    },
    "A": {
      "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
      "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
      "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
    }
  },

  "termLog": {
    "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
    "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
    "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
    "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
    "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
  },

  "institutionTypes": {
    "M": {
      "code": "M",
      "className": "MuseumType",
      "baseWikidata": "Q33506",
      "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
      "keywords": {
        "nl": ["museum", "musea"],
        "en": ["museum", "museums"],
        "de": ["Museum", "Museen"]
      },
      "subtypes": {
        "ART_MUSEUM": {
          "className": "ArtMuseum",
          "wikidata": "Q207694",
          "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
          "keywords": {
            "nl": ["kunstmuseum", "kunstmusea"],
            "en": ["art museum", "art museums"]
          }
        },
        "NATURAL_HISTORY_MUSEUM": {
          "className": "NaturalHistoryMuseum",
          "wikidata": "Q559049",
          "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
          "keywords": {
            "nl": ["natuurhistorisch museum", "natuurmuseum"],
            "en": ["natural history museum"]
          }
        }
      }
    },
    "A": {
      "code": "A",
      "className": "ArchiveOrganizationType",
      "baseWikidata": "Q166118",
      "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
      "keywords": {
        "nl": ["archief", "archieven"],
        "en": ["archive", "archives"]
      },
      "subtypes": {
        "MUNICIPAL_ARCHIVE": {
          "className": "MunicipalArchive",
          "wikidata": "Q8362876",
          "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
          "keywords": {
            "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
            "en": ["municipal archive", "city archive", "town archive"]
          }
        },
        "NATIONAL_ARCHIVE": {
          "className": "NationalArchive",
          "wikidata": "Q1188452",
          "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
          "keywords": {
            "nl": ["nationaal archief", "rijksarchief"],
            "en": ["national archive", "state archive"]
          }
        }
      }
    }
  },

  "recordSetTypes": {
    "CIVIL_REGISTRY": {
      "className": "CivilRegistrySeries",
      "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
      "keywords": {
        "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
        "en": ["civil registry", "birth records", "marriage records", "death records"]
      }
    },
    "COUNCIL_GOVERNANCE": {
      "className": "CouncilGovernanceFonds",
      "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
      "keywords": {
        "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
        "en": ["council minutes", "ordinances", "resolutions"]
      }
    }
  }
}
```

### Key Additions for Embedding Support

| Field | Purpose |
|-------|---------|
| `tier1Embeddings` | Pre-computed embeddings for each Types file (19 categories) |
| `tier2Embeddings` | Pre-computed embeddings for each subtype (500+ types) |
| `termLog` | Fast O(1) lookup table for exact keyword matches |
| `accumulatedTerms` | Raw text used to generate embeddings (for debugging/regeneration) |
| `embeddingModel` | Model used to generate embeddings (for reproducibility) |

## Enhanced ExtractedEntities Interface

```typescript
export interface ExtractedEntities {
  // Existing fields
  institutionType?: InstitutionTypeCode | null;
  location?: string | null;
  locationType?: 'city' | 'province' | null;
  intent?: 'count' | 'list' | 'info' | null;

  // NEW: Ontology-derived fields
  institutionSubtype?: string | null;  // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
  recordSetType?: string | null;        // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
  subtypeWikidata?: string | null;      // e.g., 'Q8362876' for LOD integration
}
```

## Enhanced Cache Key Format

```
{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}

Examples:
- "count:m:amsterdam"                        # Basic museum count
- "count:m.art_museum:amsterdam"             # Art museum count (subtype)
- "list:a.municipal_archive:nh"              # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht"           # Civil registry in Utrecht
- "info:a.national_archive::nl"              # National archive info (no location filter)
```

## Implementation Files

| File | Purpose |
|------|---------|
| `scripts/extract-types-vocab.ts` | Build-time vocabulary extraction from LinkML |
| `apps/archief-assistent/public/types-vocab.json` | Generated vocabulary file |
| `apps/archief-assistent/src/lib/types-vocabulary.ts` | Runtime vocabulary loader |
| `apps/archief-assistent/src/lib/semantic-cache.ts` | Updated entity extraction |

## Build Integration

Add to `apps/archief-assistent/package.json`:

```json
{
  "scripts": {
    "prebuild": "tsx ../../scripts/extract-types-vocab.ts",
    "build": "vite build"
  }
}
```

## Keyword Extraction Priority

When extracting keywords from schema files:

1. **`keywords`** array (highest priority) - Explicit search terms
2. **`structured_aliases.literal_form`** - Multilingual alternative names
3. **`type_label`** - Preferred labels per language
4. **Class name conversion** - `MunicipalArchive` → "municipal archive"

## Cache Segmentation Rules

### Rule 1: Subtype Specificity

Queries with **specific subtypes** should NOT match **generic type** cache entries:

```
Query: "kunstmusea in Amsterdam"     → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam"         → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅
```

### Rule 2: Record Set Type Isolation

Queries about **specific record types** should cache separately:

```
Query: "burgerlijke stand Utrecht"   → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht"       → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅
```

### Rule 3: Subtype-to-Type Fallback

Generic queries CAN match subtype cache entries (broader is acceptable):

```
Query: "musea in Amsterdam"          → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam"    → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)
```

## Migration Notes

1. **Backwards Compatible**: Existing cache entries without `institutionSubtype` continue to work
2. **Gradual Rollout**: New cache entries get subtype, old entries remain valid
3. **Cache Clear**: Consider clearing cache after deployment to ensure consistency

## Validation

Run E2E tests to verify:

```bash
cd apps/archief-assistent
npm run test:e2e
```

Key test cases:
- Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
- Subtype isolation (kunstmuseum ≠ museum)
- Record set isolation (burgerlijke stand ≠ archive)
- Intent isolation (count ≠ list ≠ info)

## References

- **Rule 41**: Types classes define SPARQL template variables
- **Rule 0b**: Type/Types file naming convention
- **CustodianType.yaml**: Base taxonomy definition
- **AGENTS.md**: GLAMORCUBESFIXPHDNT taxonomy documentation

---

**Created**: 2026-01-10
**Author**: OpenCode Agent
**Status**: Implemented (v2.0)

## References

- Pavlyshyn, V. "Context Graphs and Data Traces: Building Epistemology Layers for Agentic Memory"
- Pavlyshyn, V. "The Shape of Knowledge: Topology Theory for Knowledge Graphs"
- Pavlyshyn, V. "Beyond Hierarchy: Why Agentic AI Systems Need Holarchies"
- Pavlyshyn, V. "Smalltalk: The Language That Changed Everything"
- Pavlyshyn, V. "Clarity Traders: Beyond Vibe Coding"