kempersc 01b9d77566 feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation

Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46):

- types-vocab.json: 10,142 lines of institution type vocabulary from LinkML
  - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels
  - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.)
  - Extracted from CustodianType.yaml and CustodianTypes.yaml

- types-vocabulary.ts: TypeScript module for entity extraction
  - Exports INSTITUTION_TYPES with regex patterns per type code
  - Replaces hardcoded patterns with schema-derived vocabulary
  - Supports multilingual matching

- Rule 46 documentation (.opencode/rules/)
  - Specifies vocabulary extraction workflow
  - Defines cache key generation algorithm
  - Migration path from hardcoded patterns

2026-01-10 12:57:03 +01:00

21 KiB

Raw Blame History

Rule 46: Ontology-Driven Cache Segmentation

🚨 CRITICAL: The semantic cache MUST use vocabulary derived from LinkML *Type.yaml and *Types.yaml schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.

Problem Statement

The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:

Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅

However, the current implementation uses hardcoded regex patterns:

// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
  M: /\b(muse(um|a|ums?)|musea)/i,
  A: /\b(archie[fv]en?|archives?|archief)/i,
  // ... 19 patterns to maintain manually
};

Problems with hardcoded patterns:

Maintenance burden - Every new institution type requires code changes
Missing subtypes - "kunstmuseum" vs "museum" should cache separately
No multilingual support - Only Dutch/English, misses German/French labels
Duplication - Same vocabulary exists in LinkML schemas
No record type awareness - "burgerlijke stand" queries mixed with general archive queries

Solution: Schema-Derived Vocabulary

The LinkML schema already contains rich vocabulary:

Schema File	Content	Cache Utility
`CustodianType.yaml`	19 top-level types	Primary segmentation (M/A/L/G...)
`MuseumType.yaml`	187+ museum subtypes	Subtype segmentation
`ArchiveOrganizationType.yaml`	144+ archive subtypes	Subtype segmentation
`*RecordSetTypes.yaml`	Record type taxonomies	Finding aids specificity

Vocabulary Sources in Schema

type_label - Multilingual labels via skos:prefLabel
structured_aliases - Language-tagged alternative names
keywords - Search terms for entity recognition
wikidata_entity - Linked Data identifiers

Architecture

Overview: Two-Tier Embedding Hierarchy

The system uses a hierarchical embedding approach for fast semantic routing:

Tier 1: Types File Embeddings - Which category? (Museum vs Archive vs Library)
Tier 2: Individual Type Embeddings - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)

┌─────────────────────────────────────────────────────────────────────────┐
│  BUILD TIME: Extract vocabulary + generate embeddings                  │
│                                                                         │
│  schemas/20251121/linkml/modules/classes/*Type.yaml                     │
│  schemas/20251121/linkml/modules/classes/*Types.yaml                    │
│                          ↓                                              │
│  scripts/extract-types-vocab.ts                                         │
│                          ↓                                              │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  types-vocab.json                                                 │  │
│  │  ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] }   │  │
│  │  ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│  │
│  │  └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│  RUNTIME: Two-Tier Semantic Routing                                    │
│                                                                         │
│  Query: "Hoeveel gemeentearchieven in Amsterdam?"                       │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 1: Types File Selection                                    │    │
│  │ Query embedding vs Tier1 embeddings (19 categories)             │    │
│  │ Result: ArchiveOrganizationType (similarity: 0.89)              │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 2: Specific Type Selection                                 │    │
│  │ Query embedding vs Tier2 embeddings (144 archive subtypes)      │    │
│  │ Result: MunicipalArchive (similarity: 0.94)                     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam"            │
└─────────────────────────────────────────────────────────────────────────┘

Tier 1: Types File Embeddings

Each Types file (e.g., MuseumType.yaml, ArchiveOrganizationType.yaml) gets ONE embedding representing the accumulated vocabulary of all types within that file.

Embedding Text Construction:

MuseumType: museum musea kunstmuseum art museum natural history museum 
            science museum open-air museum ecomuseum virtual museum 
            heritage farm national museum regional museum university museum
            [... all keywords from all 187 subtypes ...]

Purpose: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.

Types File	Code	Accumulated Terms Count
MuseumType	M	~500+ terms from 187 subtypes
ArchiveOrganizationType	A	~400+ terms from 144 subtypes
LibraryType	L	~200+ terms from subtypes
GalleryType	G	~100+ terms from subtypes
...	...	...

Tier 2: Individual Type Embeddings

Each specific type within a Types file gets its own embedding from its accumulated terms.

Embedding Text Construction:

MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
                  town archive local government records burgerlijke stand
                  bevolkingsregister council minutes building permits
                  [... all keywords + structured_aliases + labels ...]

Purpose: Precise subtype identification after Tier 1 narrows the category.

Term Log Structure

A lookup table mapping every extracted term to its type/subtype:

{
  "termLog": {
    "kunstmuseum": {
      "typeCode": "M",
      "typeName": "MuseumType", 
      "subtypeName": "ART_MUSEUM",
      "wikidata": "Q207694",
      "language": "nl"
    },
    "art museum": {
      "typeCode": "M",
      "typeName": "MuseumType",
      "subtypeName": "ART_MUSEUM", 
      "wikidata": "Q207694",
      "language": "en"
    },
    "gemeentearchief": {
      "typeCode": "A",
      "typeName": "ArchiveOrganizationType",
      "subtypeName": "MUNICIPAL_ARCHIVE",
      "wikidata": "Q8362876",
      "language": "nl"
    }
  }
}

Purpose:

Fast O(1) keyword lookup (no embedding needed for exact matches)
Audit trail of which terms map to which types
Debugging which queries match which types

Runtime Lookup Strategy

async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
  const vocab = await loadTypesVocabulary();
  const normalized = query.toLowerCase();
  
  // FAST PATH: Check termLog for exact keyword matches
  for (const [term, mapping] of Object.entries(vocab.termLog)) {
    if (normalized.includes(term)) {
      return {
        institutionType: mapping.typeCode,
        institutionSubtype: mapping.subtypeName,
        subtypeWikidata: mapping.wikidata,
        // ... location and intent extraction
      };
    }
  }
  
  // SLOW PATH: Embedding-based semantic matching
  const queryEmbedding = await generateEmbedding(query);
  
  // Tier 1: Find best matching Types file
  let bestType: string | null = null;
  let bestTypeSimilarity = 0;
  for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
    const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
    if (similarity > bestTypeSimilarity && similarity > 0.7) {
      bestTypeSimilarity = similarity;
      bestType = typeName;
    }
  }
  
  if (!bestType) return {}; // No type matched
  
  // Tier 2: Find best matching subtype within the Types file
  const typeCode = vocab.institutionTypes[bestType].code;
  let bestSubtype: string | null = null;
  let bestSubtypeSimilarity = 0;
  
  for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
    const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
    if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
      bestSubtypeSimilarity = similarity;
      bestSubtype = subtypeName;
    }
  }
  
  return {
    institutionType: typeCode,
    institutionSubtype: bestSubtype,
    // ... location and intent extraction
  };
}

Embedding Model Choice

For build-time embedding generation, use the same model as the semantic cache:

Option	Model	Dimensions	Quality
Primary	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384	Good multilingual
Fallback	`all-MiniLM-L6-v2`	384	English-focused
High Quality	`multilingual-e5-large`	1024	Best multilingual

Build-time generation: Embeddings are generated ONCE at build time and stored in JSON. This avoids runtime embedding API calls for type classification.

TypesVocabulary JSON Structure

Generated at build time with pre-computed embeddings:

{
  "version": "2026-01-10T12:00:00Z",
  "schemaVersion": "20251121",
  "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
  "embeddingDimensions": 384,
  
  "tier1Embeddings": {
    "MuseumType": [0.023, -0.045, 0.087, ...],
    "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
    "LibraryType": [-0.034, 0.089, 0.012, ...],
    "GalleryType": [0.045, -0.023, 0.067, ...]
  },
  
  "tier2Embeddings": {
    "M": {
      "ART_MUSEUM": [0.034, -0.056, 0.078, ...],
      "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
      "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
    },
    "A": {
      "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
      "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
      "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
    }
  },
  
  "termLog": {
    "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
    "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
    "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
    "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
    "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
  },
  
  "institutionTypes": {
    "M": {
      "code": "M",
      "className": "MuseumType",
      "baseWikidata": "Q33506",
      "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
      "keywords": {
        "nl": ["museum", "musea"],
        "en": ["museum", "museums"],
        "de": ["Museum", "Museen"]
      },
      "subtypes": {
        "ART_MUSEUM": {
          "className": "ArtMuseum",
          "wikidata": "Q207694",
          "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
          "keywords": {
            "nl": ["kunstmuseum", "kunstmusea"],
            "en": ["art museum", "art museums"]
          }
        },
        "NATURAL_HISTORY_MUSEUM": {
          "className": "NaturalHistoryMuseum",
          "wikidata": "Q559049",
          "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
          "keywords": {
            "nl": ["natuurhistorisch museum", "natuurmuseum"],
            "en": ["natural history museum"]
          }
        }
      }
    },
    "A": {
      "code": "A",
      "className": "ArchiveOrganizationType",
      "baseWikidata": "Q166118",
      "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
      "keywords": {
        "nl": ["archief", "archieven"],
        "en": ["archive", "archives"]
      },
      "subtypes": {
        "MUNICIPAL_ARCHIVE": {
          "className": "MunicipalArchive",
          "wikidata": "Q8362876",
          "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
          "keywords": {
            "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
            "en": ["municipal archive", "city archive", "town archive"]
          }
        },
        "NATIONAL_ARCHIVE": {
          "className": "NationalArchive",
          "wikidata": "Q1188452",
          "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
          "keywords": {
            "nl": ["nationaal archief", "rijksarchief"],
            "en": ["national archive", "state archive"]
          }
        }
      }
    }
  },
  
  "recordSetTypes": {
    "CIVIL_REGISTRY": {
      "className": "CivilRegistrySeries",
      "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
      "keywords": {
        "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
        "en": ["civil registry", "birth records", "marriage records", "death records"]
      }
    },
    "COUNCIL_GOVERNANCE": {
      "className": "CouncilGovernanceFonds",
      "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
      "keywords": {
        "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
        "en": ["council minutes", "ordinances", "resolutions"]
      }
    }
  }
}

Key Additions for Embedding Support

Field	Purpose
`tier1Embeddings`	Pre-computed embeddings for each Types file (19 categories)
`tier2Embeddings`	Pre-computed embeddings for each subtype (500+ types)
`termLog`	Fast O(1) lookup table for exact keyword matches
`accumulatedTerms`	Raw text used to generate embeddings (for debugging/regeneration)
`embeddingModel`	Model used to generate embeddings (for reproducibility)

Enhanced ExtractedEntities Interface

export interface ExtractedEntities {
  // Existing fields
  institutionType?: InstitutionTypeCode | null;
  location?: string | null;
  locationType?: 'city' | 'province' | null;
  intent?: 'count' | 'list' | 'info' | null;
  
  // NEW: Ontology-derived fields
  institutionSubtype?: string | null;  // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
  recordSetType?: string | null;        // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
  subtypeWikidata?: string | null;      // e.g., 'Q8362876' for LOD integration
}

Enhanced Cache Key Format

{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}

Examples:
- "count:m:amsterdam"                        # Basic museum count
- "count:m.art_museum:amsterdam"             # Art museum count (subtype)
- "list:a.municipal_archive:nh"              # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht"           # Civil registry in Utrecht
- "info:a.national_archive::nl"              # National archive info (no location filter)

Implementation Files

File	Purpose
`scripts/extract-types-vocab.ts`	Build-time vocabulary extraction from LinkML
`apps/archief-assistent/public/types-vocab.json`	Generated vocabulary file
`apps/archief-assistent/src/lib/types-vocabulary.ts`	Runtime vocabulary loader
`apps/archief-assistent/src/lib/semantic-cache.ts`	Updated entity extraction

Build Integration

Add to apps/archief-assistent/package.json:

{
  "scripts": {
    "prebuild": "tsx ../../scripts/extract-types-vocab.ts",
    "build": "vite build"
  }
}

Keyword Extraction Priority

When extracting keywords from schema files:

keywords array (highest priority) - Explicit search terms
structured_aliases.literal_form - Multilingual alternative names
type_label - Preferred labels per language
Class name conversion - MunicipalArchive → "municipal archive"

Cache Segmentation Rules

Rule 1: Subtype Specificity

Queries with specific subtypes should NOT match generic type cache entries:

Query: "kunstmusea in Amsterdam"     → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam"         → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅

Rule 2: Record Set Type Isolation

Queries about specific record types should cache separately:

Query: "burgerlijke stand Utrecht"   → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht"       → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅

Rule 3: Subtype-to-Type Fallback

Generic queries CAN match subtype cache entries (broader is acceptable):

Query: "musea in Amsterdam"          → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam"    → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)

Migration Notes

Backwards Compatible: Existing cache entries without institutionSubtype continue to work
Gradual Rollout: New cache entries get subtype, old entries remain valid
Cache Clear: Consider clearing cache after deployment to ensure consistency

Validation

Run E2E tests to verify:

cd apps/archief-assistent
npm run test:e2e

Key test cases:

Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
Subtype isolation (kunstmuseum ≠ museum)
Record set isolation (burgerlijke stand ≠ archive)
Intent isolation (count ≠ list ≠ info)

References

Rule 41: Types classes define SPARQL template variables
Rule 0b: Type/Types file naming convention
CustodianType.yaml: Base taxonomy definition
AGENTS.md: GLAMORCUBESFIXPHDNT taxonomy documentation

Created: 2026-01-10 Author: OpenCode Agent Status: Implementing

21 KiB Raw Blame History