- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework. - Added detailed mapping of SPARQL templates to context templates for improved specificity filtering. - Implemented wrapper patterns around existing classifiers to extend functionality without duplication. - Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality. - Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling. |
||
|---|---|---|
| .. | ||
| 00-master-checklist.md | ||
| 01-design-patterns.md | ||
| 02-tdd.md | ||
| 03-rag-dspy-integration.md | ||
| 04-prompt-conversation-templates.md | ||
| 05-dependencies.md | ||
| 06-uml-visualization.md | ||
| README.md | ||
Specificity Score System for Heritage Custodian Ontology
Overview
This plan documents the implementation of a specificity scoring system for all classes in the GLAM Heritage Custodian Ontology. The system assigns numerical scores (0.0-1.0) to each class indicating:
- General Specificity Score: Context-free relevance indicating whether a class is highly specific or of general relevance
- Template-Based Specificity Scores: Multiple scores keyed by prompt/conversation template IDs indicating the likelihood of a class being relevant for follow-up questions
Problem Statement
The Heritage Custodian Ontology contains 304+ classes across multiple modules. When users interact with the RAG system or view UML visualizations, they face information overload:
- UML Overwhelm: Visualizations showing all classes are too dense to comprehend
- RAG Retrieval Noise: Follow-up questions retrieve irrelevant classes
- No Context Sensitivity: Same classes shown regardless of conversation topic
- Missing Relevance Signals: No way to filter or highlight based on topic
Example Scenario
Initial Question: "What archives are in Drenthe?"
Current RAG Response: Returns 50+ classes including:
CustodianName(highly relevant for follow-up)Location(highly relevant)WebObservation(low relevance for this context)PersonProfileExtraction(low relevance for this context)
Desired Behavior: Use specificity scores to prioritize Archive, Location, GHCID, Collection for follow-up questions.
Solution: Dual-Layer Specificity Scoring
Layer 1: General Specificity Score
A single context-free score (0.0-1.0) stored as a LinkML annotation on each class:
| Score Range | Interpretation | Examples |
|---|---|---|
| 0.9-1.0 | Highly specific, rarely needed | LinkedInConnectionExtraction, GHCIDHistoryEntry |
| 0.7-0.9 | Domain-specific | Archive, Museum, Collection |
| 0.5-0.7 | Moderately general | DigitalPlatform, ChangeEvent |
| 0.3-0.5 | General utility | Location, Identifier, Provenance |
| 0.0-0.3 | Core/foundational | HeritageCustodian, CustodianName |
Lower scores = more generally relevant (always useful) Higher scores = more specific (only useful in specialized contexts)
Layer 2: Template-Based Specificity Scores
Multiple scores per class, keyed by conversation template IDs:
# Example: Archive class
annotations:
specificity_score: 0.75 # General score
template_specificity:
archive_search: 0.95 # Highly relevant for archive queries
museum_search: 0.10 # Not relevant for museum queries
collection_discovery: 0.70 # Moderately relevant
person_research: 0.20 # Low relevance
location_browse: 0.60 # Somewhat relevant
Architecture
INTEGRATION NOTE: This system integrates with the existing TemplateClassifier infrastructure in
backend/rag/template_sparql.py. The existing SPARQL template system handles query generation; the specificity system extends it for context-aware class filtering.
User Question
|
v
+----------------------------------+
| ConversationContextResolver | <-- EXISTING: Resolves elliptical questions
+----------------------------------+
|
v
+----------------------------------+
| TemplateClassifier (EXISTING) | <-- backend/rag/template_sparql.py:1104
+----------------------------------+ Returns SPARQL template_id
|
v
+----------------------------------+
| SPARQL → Context Mapper (NEW) | <-- Maps SPARQL template to context template
+----------------------------------+ e.g., list_institutions_by_type_city → location_browse
|
v
+----------------------------------+
| Specificity Lookup (NEW) | <-- Retrieves template-specific scores for all classes
+----------------------------------+
|
v
+----------------------------------+
| Class Filter/Ranker (NEW) | <-- Filters classes below threshold, ranks by score
+----------------------------------+
|
v
+----------------------------------+
| RAG Context Builder | <-- Builds context from high-specificity classes
+----------------------------------+
|
v
+----------------------------------+
| UML View Renderer | <-- Filters/highlights UML based on specificity
+----------------------------------+
Existing Infrastructure Reference
| Component | Location | Description |
|---|---|---|
TemplateClassifier |
backend/rag/template_sparql.py:1104 |
DSPy Module classifying questions to SPARQL templates |
TemplateClassifierSignature |
backend/rag/template_sparql.py:634 |
DSPy Signature defining template IDs |
ConversationContextResolver |
backend/rag/template_sparql.py:745 |
Resolves elliptical follow-ups |
sparql_templates.yaml |
data/sparql_templates.yaml |
SPARQL template definitions |
SPARQL Template → Context Template Mapping
The existing SPARQL templates serve query generation. Context templates serve class filtering:
| SPARQL Template (existing) | Context Template (specificity) |
|---|---|
list_institutions_by_type_city |
location_browse |
list_institutions_by_type_region |
location_browse |
list_institutions_by_type_country |
location_browse |
count_institutions_by_type_location |
location_browse |
find_institution_by_name |
general_heritage |
find_institution_by_identifier |
identifier_lookup |
find_institutions_by_founding_date |
organizational_change |
compare_locations |
location_browse |
find_custodians_by_budget_threshold |
general_heritage |
none |
general_heritage |
Institution-type refinement: When the SPARQL template extracts an institution_type slot, the context template is refined:
institution_type = A→archive_searchinstitution_type = M→museum_searchinstitution_type = L→library_search
Documentation Index
| Document | Description |
|---|---|
| 00-master-checklist.md | Implementation checklist with phases and tasks |
| 01-design-patterns.md | Software patterns (Strategy, Decorator, Observer) |
| 02-tdd.md | Test-driven development approach with test cases |
| 03-rag-dspy-integration.md | DSPy integration for template classification |
| 04-prompt-conversation-templates.md | Template definitions and scoring guidelines |
| 05-dependencies.md | Required libraries and services |
| 06-uml-visualization.md | UML filtering and highlighting based on scores |
Quick Start
1. Schema Annotation Format
# schemas/20251121/linkml/modules/classes/Archive.yaml
classes:
Archive:
is_a: HeritageCustodian
class_uri: hc:Archive
description: An archive holding historical records and documents
annotations:
# General specificity score (context-free)
specificity_score: 0.75
specificity_rationale: >-
Domain-specific class for archival institutions. High relevance
for record management, genealogy, and historical research queries.
# Template-based specificity scores
template_specificity:
archive_search: 0.95
museum_search: 0.10
library_search: 0.30
collection_discovery: 0.70
person_research: 0.40
location_browse: 0.60
identifier_lookup: 0.50
organizational_change: 0.65
2. Usage in RAG Pipeline
# 1. Classify user question into template
template_id = classifier.classify("Which archives in Drenthe have photo collections?")
# -> "archive_search"
# 2. Retrieve template-specific scores for all classes
scores = specificity_lookup.get_scores(template_id)
# -> {"Archive": 0.95, "Collection": 0.85, "Location": 0.80, ...}
# 3. Filter classes above threshold
relevant_classes = [cls for cls, score in scores.items() if score > 0.5]
# -> ["Archive", "Collection", "Location", "GHCID", "CustodianName"]
# 4. Build RAG context with relevant classes only
context = build_context(relevant_classes)
3. Usage in UML Visualization
// Filter nodes by specificity for cleaner visualization
const visibleNodes = nodes.filter(node => {
const score = getSpecificityScore(node.class, currentTemplate);
return score >= specificityThreshold;
});
// Or highlight by specificity (opacity/size based on score)
nodes.forEach(node => {
const score = getSpecificityScore(node.class, currentTemplate);
node.opacity = 0.3 + (score * 0.7); // 0.3-1.0 opacity range
node.radius = 10 + (score * 20); // 10-30 radius range
});
Scope
In Scope
- 304 class files in
schemas/20251121/linkml/modules/classes/ - General specificity score (0.0-1.0) for each class
- Template-based scores for 10-15 conversation templates
- RAG integration for class filtering
- UML visualization filtering/highlighting
- Validation tooling
Out of Scope
- Slot-level specificity scores (future enhancement)
- Dynamic score learning (future ML enhancement)
- User preference customization (future feature)
Key Metrics
| Metric | Current | Target |
|---|---|---|
| Classes with specificity scores | 0 | 304 |
| Conversation templates defined | 0 | 10-15 |
| RAG retrieval precision | Unknown | +20% improvement |
| UML node count (filtered view) | 304 | <50 per template |
| Follow-up question relevance | Unknown | >80% precision |
Next Steps After Planning
- Define conversation templates (Task 4) - Identify 10-15 common query patterns
- Score foundational classes - Start with core classes (HeritageCustodian, Location, etc.)
- Build scoring tool - Create script to add annotations to all 304 classes
- Integrate with RAG - Modify DSPy pipeline to use scores
- Integrate with UML - Add filtering/highlighting to frontend
- Validate with users - Test retrieval quality improvements
References
- AGENTS.md - Rule 37: Specificity Score Convention
.opencode/rules/specificity-score-convention.md- Full scoring rulesschemas/20251121/linkml/- Target schema filesbackend/rag/template_sparql.py- EXISTING TemplateClassifier infrastructuredata/sparql_templates.yaml- EXISTING SPARQL template definitionsdocs/plan/prompt-query_template_mapping/- Related template-based query system
Version: 0.1.0
Last Updated: 2025-01-04
Status: Planning Phase