kempersc 554fe520ea Add comprehensive rules for LinkML schema management and ontology mapping

- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions.
- Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications.
- Implemented Rule: No Version Indicators in Names to maintain stable semantic naming.
- Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions.
- Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices.
- Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files.
- Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates.
- Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml.
- Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.

2026-02-15 19:20:09 +01:00

24 KiB

Raw Blame History

Rule 46: Ontology-Driven Cache Segmentation

🚨 CRITICAL: The semantic cache MUST use vocabulary derived from LinkML *Type.yaml and *Types.yaml schema files to extract entities for cache key generation. Hardcoded regex patterns are deprecated.

Status: Implemented (Evolved v2.0)
Version: 2.0 (Epistemological Evolution)
Updated: 2026-01-10

Evolution Overview

Rule 46 v2.0 incorporates insights from Volodymyr Pavlyshyn's work on agentic memory systems:

Epistemic Provenance (Phase 1) - Track WHERE, WHEN, HOW data originated
Topological Distance (Phase 2) - Use ontology structure, not just embeddings
Holarchic Cache (Phase 3) - Entries as holons with up/down links
Message Passing (Phase 4, planned) - Smalltalk-style introspectable cache
Clarity Trading (Phase 5, planned) - Block ambiguous queries from cache

Epistemic Provenance

Every cached response carries epistemological metadata:

interface EpistemicProvenance {
  dataSource: 'ISIL_REGISTRY' | 'WIKIDATA' | 'CUSTODIAN_YAML' | 'LLM_INFERENCE' | ...;
  dataTier: 1 | 2 | 3 | 4;  // TIER_1_AUTHORITATIVE → TIER_4_INFERRED
  sourceTimestamp: string;
  derivationChain: string[];  // ["SPARQL:Qdrant", "RAG:retrieve", "LLM:generate"]
  revalidationPolicy: 'static' | 'daily' | 'weekly' | 'on_access';
}

Benefit: Users see "This answer is from TIER_1 ISIL registry data, captured 2025-01-08".

Topological Distance

Beyond embedding similarity, cache matching considers structural distance in the type hierarchy:

                    HeritageCustodian (*)
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
   MuseumType (M)    ArchiveType (A)    LibraryType (L)
        │                  │                  │
   ┌────┴────┐        ┌────┴────┐        ┌────┴────┐
   ▼         ▼        ▼         ▼        ▼         ▼
ArtMuseum  History  Municipal  State   Public  Academic

Combined Similarity Formula:

finalScore = 0.7 * embeddingSimilarity + 0.3 * (1 - topologicalDistance)

Benefit: "Art museum" won't match "natural history museum" even with 95% embedding similarity.

Holarchic Cache Structure

Cache entries are holons - simultaneously complete AND parts of aggregates:

Level	Example	Aggregates
Micro	"Rijksmuseum details"	None
Meso	"Museums in Amsterdam"	List of micro holons
Macro	"Heritage in Noord-Holland"	All meso holons in region

interface CachedQuery {
  // ... existing fields ...
  holonLevel?: 'micro' | 'meso' | 'macro';
  participatesIn?: string[];  // Higher-level cache keys
  aggregates?: string[];       // Lower-level entries
}

Problem Statement

The ArchiefAssistent semantic cache prevents geographic false positives using entity extraction:

Query: "Hoeveel musea in Amsterdam?"
Cached: "Hoeveel musea in Noord-Holland?"
Result: BLOCKED (location mismatch) ✅

However, the current implementation uses hardcoded regex patterns:

// DEPRECATED: Hardcoded patterns in semantic-cache.ts
const INSTITUTION_PATTERNS: Record<InstitutionTypeCode, RegExp> = {
  M: /\b(muse(um|a|ums?)|musea)/i,
  A: /\b(archie[fv]en?|archives?|archief)/i,
  // ... 19 patterns to maintain manually
};

Problems with hardcoded patterns:

Maintenance burden - Every new institution type requires code changes
Missing subtypes - "kunstmuseum" vs "museum" should cache separately
No multilingual support - Only Dutch/English, misses German/French labels
Duplication - Same vocabulary exists in LinkML schemas
No record type awareness - "burgerlijke stand" queries mixed with general archive queries

Solution: Schema-Derived Vocabulary

The LinkML schema already contains rich vocabulary:

Schema File	Content	Cache Utility
`CustodianType.yaml`	19 top-level types	Primary segmentation (M/A/L/G...)
`MuseumType.yaml`	187+ museum subtypes	Subtype segmentation
`ArchiveOrganizationType.yaml`	144+ archive subtypes	Subtype segmentation
`*RecordSetTypes.yaml`	Record type taxonomies	Finding aids specificity

Vocabulary Sources in Schema

type_label - Multilingual labels via skos:prefLabel
structured_aliases - Language-tagged alternative names
keywords - Search terms for entity recognition
wikidata_entity - Linked Data identifiers

Architecture

Overview: Two-Tier Embedding Hierarchy

The system uses a hierarchical embedding approach for fast semantic routing:

Tier 1: Types File Embeddings - Which category? (Museum vs Archive vs Library)
Tier 2: Individual Type Embeddings - Which specific type? (ArtMuseum vs NaturalHistoryMuseum)

┌─────────────────────────────────────────────────────────────────────────┐
│  BUILD TIME: Extract vocabulary + generate embeddings                  │
│                                                                         │
│  schemas/20251121/linkml/modules/classes/*Type.yaml                     │
│  schemas/20251121/linkml/modules/classes/*Types.yaml                    │
│                          ↓                                              │
│  scripts/extract-types-vocab.ts                                         │
│                          ↓                                              │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  types-vocab.json                                                 │  │
│  │  ├── tier1Embeddings: { MuseumType: [...], ArchiveType: [...] }   │  │
│  │  ├── tier2Embeddings: { ArtMuseum: [...], MunicipalArchive: [...]}│  │
│  │  └── termLog: { "kunstmuseum": { type: "M", subtype: "ART_MUSEUM"}│  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼ (loaded at runtime)
┌─────────────────────────────────────────────────────────────────────────┐
│  RUNTIME: Two-Tier Semantic Routing                                    │
│                                                                         │
│  Query: "Hoeveel gemeentearchieven in Amsterdam?"                       │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 1: Types File Selection                                    │    │
│  │ Query embedding vs Tier1 embeddings (19 categories)             │    │
│  │ Result: ArchiveOrganizationType (similarity: 0.89)              │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ TIER 2: Specific Type Selection                                 │    │
│  │ Query embedding vs Tier2 embeddings (144 archive subtypes)      │    │
│  │ Result: MunicipalArchive (similarity: 0.94)                     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│         ↓                                                               │
│  Structured cache key: "count:A.MUNICIPAL_ARCHIVE:amsterdam"            │
└─────────────────────────────────────────────────────────────────────────┘

Tier 1: Types File Embeddings

Each Types file (e.g., MuseumType.yaml, ArchiveOrganizationType.yaml) gets ONE embedding representing the accumulated vocabulary of all types within that file.

Embedding Text Construction:

MuseumType: museum musea kunstmuseum art museum natural history museum 
            science museum open-air museum ecomuseum virtual museum 
            heritage farm national museum regional museum university museum
            [... all keywords from all 187 subtypes ...]

Purpose: Fast first-pass filter to identify which GLAMORCUBESFIXPHDNT category the query relates to.

Types File	Code	Accumulated Terms Count
MuseumType	M	~500+ terms from 187 subtypes
ArchiveOrganizationType	A	~400+ terms from 144 subtypes
LibraryType	L	~200+ terms from subtypes
GalleryType	G	~100+ terms from subtypes
...	...	...

Tier 2: Individual Type Embeddings

Each specific type within a Types file gets its own embedding from its accumulated terms.

Embedding Text Construction:

MunicipalArchive: gemeentearchief stadsarchief city archive municipal archive
                  town archive local government records burgerlijke stand
                  bevolkingsregister council minutes building permits
                  [... all keywords + structured_aliases + labels ...]

Purpose: Precise subtype identification after Tier 1 narrows the category.

Term Log Structure

A lookup table mapping every extracted term to its type/subtype:

{
  "termLog": {
    "kunstmuseum": {
      "typeCode": "M",
      "typeName": "MuseumType", 
      "subtypeName": "ART_MUSEUM",
      "wikidata": "Q207694",
      "language": "nl"
    },
    "art museum": {
      "typeCode": "M",
      "typeName": "MuseumType",
      "subtypeName": "ART_MUSEUM", 
      "wikidata": "Q207694",
      "language": "en"
    },
    "gemeentearchief": {
      "typeCode": "A",
      "typeName": "ArchiveOrganizationType",
      "subtypeName": "MUNICIPAL_ARCHIVE",
      "wikidata": "Q8362876",
      "language": "nl"
    }
  }
}

Purpose:

Fast O(1) keyword lookup (no embedding needed for exact matches)
Audit trail of which terms map to which types
Debugging which queries match which types

Runtime Lookup Strategy

async function extractEntitiesWithEmbeddings(query: string): Promise<ExtractedEntities> {
  const vocab = await loadTypesVocabulary();
  const normalized = query.toLowerCase();
  
  // FAST PATH: Check termLog for exact keyword matches
  for (const [term, mapping] of Object.entries(vocab.termLog)) {
    if (normalized.includes(term)) {
      return {
        institutionType: mapping.typeCode,
        institutionSubtype: mapping.subtypeName,
        subtypeWikidata: mapping.wikidata,
        // ... location and intent extraction
      };
    }
  }
  
  // SLOW PATH: Embedding-based semantic matching
  const queryEmbedding = await generateEmbedding(query);
  
  // Tier 1: Find best matching Types file
  let bestType: string | null = null;
  let bestTypeSimilarity = 0;
  for (const [typeName, typeEmbedding] of Object.entries(vocab.tier1Embeddings)) {
    const similarity = cosineSimilarity(queryEmbedding, typeEmbedding);
    if (similarity > bestTypeSimilarity && similarity > 0.7) {
      bestTypeSimilarity = similarity;
      bestType = typeName;
    }
  }
  
  if (!bestType) return {}; // No type matched
  
  // Tier 2: Find best matching subtype within the Types file
  const typeCode = vocab.institutionTypes[bestType].code;
  let bestSubtype: string | null = null;
  let bestSubtypeSimilarity = 0;
  
  for (const [subtypeName, subtypeEmbedding] of Object.entries(vocab.tier2Embeddings[typeCode] || {})) {
    const similarity = cosineSimilarity(queryEmbedding, subtypeEmbedding);
    if (similarity > bestSubtypeSimilarity && similarity > 0.75) {
      bestSubtypeSimilarity = similarity;
      bestSubtype = subtypeName;
    }
  }
  
  return {
    institutionType: typeCode,
    institutionSubtype: bestSubtype,
    // ... location and intent extraction
  };
}

Embedding Model Choice

For build-time embedding generation, use the same model as the semantic cache:

Option	Model	Dimensions	Quality
Primary	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384	Good multilingual
Fallback	`all-MiniLM-L6-v2`	384	English-focused
High Quality	`multilingual-e5-large`	1024	Best multilingual

Build-time generation: Embeddings are generated ONCE at build time and stored in JSON. This avoids runtime embedding API calls for type classification.

TypesVocabulary JSON Structure

Generated at build time with pre-computed embeddings:

{
  "version": "2026-01-10T12:00:00Z",
  "schemaVersion": "20251121",
  "embeddingModel": "paraphrase-multilingual-MiniLM-L12-v2",
  "embeddingDimensions": 384,
  
  "tier1Embeddings": {
    "MuseumType": [0.023, -0.045, 0.087, ...],
    "ArchiveOrganizationType": [0.012, 0.056, -0.034, ...],
    "LibraryType": [-0.034, 0.089, 0.012, ...],
    "GalleryType": [0.045, -0.023, 0.067, ...]
  },
  
  "tier2Embeddings": {
    "M": {
      "ART_MUSEUM": [0.034, -0.056, 0.078, ...],
      "NATURAL_HISTORY_MUSEUM": [0.045, 0.023, -0.089, ...],
      "SCIENCE_MUSEUM": [0.067, -0.012, 0.045, ...]
    },
    "A": {
      "MUNICIPAL_ARCHIVE": [0.089, 0.034, -0.056, ...],
      "NATIONAL_ARCHIVE": [0.012, -0.078, 0.045, ...],
      "CHURCH_ARCHIVE": [-0.023, 0.067, 0.034, ...]
    }
  },
  
  "termLog": {
    "kunstmuseum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "nl"},
    "art museum": {"typeCode": "M", "subtypeName": "ART_MUSEUM", "wikidata": "Q207694", "lang": "en"},
    "gemeentearchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "stadsarchief": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "nl"},
    "city archive": {"typeCode": "A", "subtypeName": "MUNICIPAL_ARCHIVE", "wikidata": "Q8362876", "lang": "en"},
    "burgerlijke stand": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"},
    "geboorteakte": {"typeCode": "A", "recordSetType": "CIVIL_REGISTRY", "lang": "nl"}
  },
  
  "institutionTypes": {
    "M": {
      "code": "M",
      "className": "MuseumType",
      "baseWikidata": "Q33506",
      "accumulatedTerms": "museum musea kunstmuseum art museum natural history museum science museum open-air museum ecomuseum virtual museum heritage farm national museum regional museum university museum...",
      "keywords": {
        "nl": ["museum", "musea"],
        "en": ["museum", "museums"],
        "de": ["Museum", "Museen"]
      },
      "subtypes": {
        "ART_MUSEUM": {
          "className": "ArtMuseum",
          "wikidata": "Q207694",
          "accumulatedTerms": "kunstmuseum art museum kunstmusea art museums fine art museum visual arts museum painting gallery sculpture museum",
          "keywords": {
            "nl": ["kunstmuseum", "kunstmusea"],
            "en": ["art museum", "art museums"]
          }
        },
        "NATURAL_HISTORY_MUSEUM": {
          "className": "NaturalHistoryMuseum",
          "wikidata": "Q559049",
          "accumulatedTerms": "natuurhistorisch museum natuurmuseum natural history museum science museum fossils taxidermy specimens geology biology",
          "keywords": {
            "nl": ["natuurhistorisch museum", "natuurmuseum"],
            "en": ["natural history museum"]
          }
        }
      }
    },
    "A": {
      "code": "A",
      "className": "ArchiveOrganizationType",
      "baseWikidata": "Q166118",
      "accumulatedTerms": "archief archieven archive archives gemeentearchief stadsarchief nationaal archief rijksarchief church archive company archive film archive...",
      "keywords": {
        "nl": ["archief", "archieven"],
        "en": ["archive", "archives"]
      },
      "subtypes": {
        "MUNICIPAL_ARCHIVE": {
          "className": "MunicipalArchive",
          "wikidata": "Q8362876",
          "accumulatedTerms": "gemeentearchief stadsarchief municipal archive city archive town archive local government records civil registry population register building permits council minutes",
          "keywords": {
            "nl": ["gemeentearchief", "stadsarchief", "gemeentelijke archiefdienst"],
            "en": ["municipal archive", "city archive", "town archive"]
          }
        },
        "NATIONAL_ARCHIVE": {
          "className": "NationalArchive",
          "wikidata": "Q1188452",
          "accumulatedTerms": "nationaal archief rijksarchief national archive state archive government records national records federal archive",
          "keywords": {
            "nl": ["nationaal archief", "rijksarchief"],
            "en": ["national archive", "state archive"]
          }
        }
      }
    }
  },
  
  "recordSetTypes": {
    "CIVIL_REGISTRY": {
      "className": "CivilRegistrySeries",
      "accumulatedTerms": "burgerlijke stand geboorteakte huwelijksakte overlijdensakte bevolkingsregister civil registry birth records marriage records death records population register vital records genealogy",
      "keywords": {
        "nl": ["burgerlijke stand", "geboorteakte", "huwelijksakte", "overlijdensakte", "bevolkingsregister"],
        "en": ["civil registry", "birth records", "marriage records", "death records"]
      }
    },
    "COUNCIL_GOVERNANCE": {
      "className": "CouncilGovernanceFonds",
      "accumulatedTerms": "gemeenteraad raadsnotulen raadsbesluit verordening council minutes ordinances resolutions bylaws municipal council town council city council",
      "keywords": {
        "nl": ["gemeenteraad", "raadsnotulen", "raadsbesluit", "verordening"],
        "en": ["council minutes", "ordinances", "resolutions"]
      }
    }
  }
}

Key Additions for Embedding Support

Field	Purpose
`tier1Embeddings`	Pre-computed embeddings for each Types file (19 categories)
`tier2Embeddings`	Pre-computed embeddings for each subtype (500+ types)
`termLog`	Fast O(1) lookup table for exact keyword matches
`accumulatedTerms`	Raw text used to generate embeddings (for debugging/regeneration)
`embeddingModel`	Model used to generate embeddings (for reproducibility)

Enhanced ExtractedEntities Interface

export interface ExtractedEntities {
  // Existing fields
  institutionType?: InstitutionTypeCode | null;
  location?: string | null;
  locationType?: 'city' | 'province' | null;
  intent?: 'count' | 'list' | 'info' | null;
  
  // NEW: Ontology-derived fields
  institutionSubtype?: string | null;  // e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM'
  recordSetType?: string | null;        // e.g., 'CIVIL_REGISTRY', 'COUNCIL_GOVERNANCE'
  subtypeWikidata?: string | null;      // e.g., 'Q8362876' for LOD integration
}

Enhanced Cache Key Format

{intent}:{institutionType}[.{subtype}][:{recordSetType}]:{location}

Examples:
- "count:m:amsterdam"                        # Basic museum count
- "count:m.art_museum:amsterdam"             # Art museum count (subtype)
- "list:a.municipal_archive:nh"              # Municipal archives in Noord-Holland
- "query:a:civil_registry:utrecht"           # Civil registry in Utrecht
- "info:a.national_archive::nl"              # National archive info (no location filter)

Implementation Files

File	Purpose
`scripts/extract-types-vocab.ts`	Build-time vocabulary extraction from LinkML
`apps/archief-assistent/public/types-vocab.json`	Generated vocabulary file
`apps/archief-assistent/src/lib/types-vocabulary.ts`	Runtime vocabulary loader
`apps/archief-assistent/src/lib/semantic-cache.ts`	Updated entity extraction

Build Integration

Add to apps/archief-assistent/package.json:

{
  "scripts": {
    "prebuild": "tsx ../../scripts/extract-types-vocab.ts",
    "build": "vite build"
  }
}

Keyword Extraction Priority

When extracting keywords from schema files:

keywords array (highest priority) - Explicit search terms
structured_aliases.literal_form - Multilingual alternative names
type_label - Preferred labels per language
Class name conversion - MunicipalArchive → "municipal archive"

Cache Segmentation Rules

Rule 1: Subtype Specificity

Queries with specific subtypes should NOT match generic type cache entries:

Query: "kunstmusea in Amsterdam"     → key: "count:m.art_museum:amsterdam"
Cached: "musea in Amsterdam"         → key: "count:m:amsterdam"
Result: MISS (subtype mismatch) ✅

Rule 2: Record Set Type Isolation

Queries about specific record types should cache separately:

Query: "burgerlijke stand Utrecht"   → key: "query:a:civil_registry:utrecht"
Cached: "archieven in Utrecht"       → key: "list:a:utrecht"
Result: MISS (record set type mismatch) ✅

Rule 3: Subtype-to-Type Fallback

Generic queries CAN match subtype cache entries (broader is acceptable):

Query: "musea in Amsterdam"          → key: "count:m:amsterdam"
Cached: "kunstmusea in Amsterdam"    → key: "count:m.art_museum:amsterdam"
Result: MISS (don't return subset for superset query)

Migration Notes

Backwards Compatible: Existing cache entries without institutionSubtype continue to work
Gradual Rollout: New cache entries get subtype, old entries remain valid
Cache Clear: Consider clearing cache after deployment to ensure consistency

Validation

Run E2E tests to verify:

cd apps/archief-assistent
npm run test:e2e

Key test cases:

Geographic isolation (Amsterdam ≠ Rotterdam ≠ Noord-Holland)
Subtype isolation (kunstmuseum ≠ museum)
Record set isolation (burgerlijke stand ≠ archive)
Intent isolation (count ≠ list ≠ info)

References

Rule 41: Types classes define SPARQL template variables
Rule 0b: Type/Types file naming convention
CustodianType.yaml: Base taxonomy definition
AGENTS.md: GLAMORCUBESFIXPHDNT taxonomy documentation

Created: 2026-01-10
Author: OpenCode Agent
Status: Implemented (v2.0)

References

Pavlyshyn, V. "Context Graphs and Data Traces: Building Epistemology Layers for Agentic Memory"
Pavlyshyn, V. "The Shape of Knowledge: Topology Theory for Knowledge Graphs"
Pavlyshyn, V. "Beyond Hierarchy: Why Agentic AI Systems Need Holarchies"
Pavlyshyn, V. "Smalltalk: The Language That Changed Everything"
Pavlyshyn, V. "Clarity Traders: Beyond Vibe Coding"

24 KiB Raw Blame History

Rule 46: Ontology-Driven Cache Segmentation

Evolution Overview

Epistemic Provenance

Topological Distance

Holarchic Cache Structure

Problem Statement

Solution: Schema-Derived Vocabulary

Vocabulary Sources in Schema

Architecture

Overview: Two-Tier Embedding Hierarchy

Tier 1: Types File Embeddings

Tier 2: Individual Type Embeddings

Term Log Structure

Runtime Lookup Strategy

Embedding Model Choice

TypesVocabulary JSON Structure

Key Additions for Embedding Support

Enhanced ExtractedEntities Interface

Enhanced Cache Key Format

Implementation Files

Build Integration

Keyword Extraction Priority

Cache Segmentation Rules

Rule 1: Subtype Specificity

Rule 2: Record Set Type Isolation

Rule 3: Subtype-to-Type Fallback

Migration Notes

Validation

References

References

24 KiB

Raw Blame History