kempersc 11983014bb Enhance specificity scoring system integration with existing infrastructure

- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.

2026-01-05 17:37:49 +01:00

22 KiB

Raw Blame History

Specificity Score System - UML Visualization Integration

Overview

This document describes how specificity scores integrate with UML diagram generation to create filtered, readable visualizations of the Heritage Custodian Ontology's 304+ classes.

INTEGRATION NOTE: This document references context templates (e.g., archive_search, museum_search) used for UML filtering. These context templates are mapped from the existing SPARQL templates in backend/rag/template_sparql.py. See the mapping table below.

SPARQL → Context Template Mapping (for UML Views)

The existing TemplateClassifier in backend/rag/template_sparql.py:1104 classifies questions to SPARQL template IDs. For UML visualization, these are mapped to context templates:

SPARQL Template ID	Context Template	UML View Focus
`list_institutions_by_type_city`	`location_browse`	Institution + Location classes
`list_institutions_by_type_region`	`location_browse`	Institution + Region classes
`find_institution_by_identifier`	`identifier_lookup`	Identifier + GHCID classes
`find_institutions_by_founding_date`	`organizational_change`	ChangeEvent + Timeline classes
`list_institutions_by_collection_type`	`collection_discovery`	Collection + Subject classes
`find_person_by_role`	`person_research`	Person + Staff + Role classes
`none` (default)	`general_heritage`	Core heritage classes

Institution Type Refinement: When SPARQL templates extract institution_type slot:

institution_type = A → refine to archive_search
institution_type = M → refine to museum_search
institution_type = L → refine to library_search

Problem Statement

Current UML Challenges

The Heritage Custodian Ontology contains 304+ classes, making full UML diagrams:

Visually overwhelming - Too many nodes to comprehend
Difficult to navigate - No clear entry point or hierarchy
Context-blind - Shows everything regardless of current task
Slow to render - Large graphs take seconds to generate

Solution: Specificity-Based Filtering

Use specificity scores to:

Filter classes by relevance threshold
Adjust visual prominence (opacity, size, position)
Create template-specific focused views
Generate progressive disclosure diagrams

Filtering Strategies

Strategy 1: Threshold Filtering

Show only classes above a specificity threshold:

def filter_classes_by_threshold(
    schema: SchemaView,
    template_id: str,
    threshold: float = 0.5
) -> list[str]:
    """Return class names meeting specificity threshold."""
    included = []
    
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= threshold:
            included.append(class_name)
    
    return included


# Example: archive_search with threshold 0.6
# Returns: ["Archive", "RecordSet", "Fonds", "FindingAid", 
#           "Collection", "Location", "HeritageCustodian", ...]

Threshold Guidelines:

Threshold	Result	Use Case
0.8+	5-10 classes	Focused overview, quick reference
0.6+	15-30 classes	Detailed task view
0.4+	40-60 classes	Comprehensive view
0.2+	80-150 classes	Near-complete (excludes technical)
0.0+	304 classes	Full ontology (overwhelming)

Strategy 2: Top-N Filtering

Show the N most relevant classes for a template:

def top_n_classes(
    schema: SchemaView,
    template_id: str,
    n: int = 20
) -> list[str]:
    """Return top N classes by specificity for template."""
    class_scores = []
    
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        class_scores.append((class_name, score))
    
    # Sort by score descending
    class_scores.sort(key=lambda x: x[1], reverse=True)
    
    return [name for name, _ in class_scores[:n]]


# Example: top_n_classes(schema, "person_research", n=10)
# Returns: ["PersonProfile", "Staff", "Role", "Affiliation", 
#           "Director", "HeritageCustodian", "ContactPoint", ...]

Strategy 3: Tier-Based Grouping

Group classes into visual tiers based on score ranges:

from dataclasses import dataclass
from enum import Enum

class VisualTier(Enum):
    PRIMARY = "primary"      # Score >= 0.8
    SECONDARY = "secondary"  # Score 0.5-0.8
    TERTIARY = "tertiary"    # Score 0.3-0.5
    BACKGROUND = "background"  # Score < 0.3

@dataclass
class TieredClass:
    name: str
    score: float
    tier: VisualTier

def tier_classes(
    schema: SchemaView,
    template_id: str
) -> dict[VisualTier, list[TieredClass]]:
    """Group classes into visual tiers."""
    tiers = {tier: [] for tier in VisualTier}
    
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= 0.8:
            tier = VisualTier.PRIMARY
        elif score >= 0.5:
            tier = VisualTier.SECONDARY
        elif score >= 0.3:
            tier = VisualTier.TERTIARY
        else:
            tier = VisualTier.BACKGROUND
        
        tiers[tier].append(TieredClass(class_name, score, tier))
    
    return tiers

Visual Styling

Style 1: Opacity Mapping

Map specificity score to node opacity:

def score_to_opacity(score: float) -> str:
    """Convert score to hex opacity (00-FF)."""
    # Score 1.0 -> fully opaque (FF)
    # Score 0.0 -> nearly transparent (20)
    opacity = int(32 + (score * 223))  # Range: 32-255
    return f"{opacity:02X}"

def style_node_by_score(
    class_name: str,
    score: float,
    base_color: str = "#4A90D9"
) -> dict:
    """Generate node styling based on specificity score."""
    opacity = score_to_opacity(score)
    
    return {
        "fillcolor": f"{base_color}{opacity}",
        "style": "filled",
        "fontcolor": "#333333" if score > 0.5 else "#888888",
        "penwidth": str(1 + score * 2),  # 1-3px border
    }

Visual Result:

Score	Opacity	Appearance
0.95	~100%	Solid, prominent
0.75	~75%	Clear, visible
0.50	~50%	Semi-transparent
0.25	~25%	Faded, background
0.10	~10%	Nearly invisible

Style 2: Size Mapping

Adjust node size based on importance:

def score_to_size(score: float) -> tuple[float, float]:
    """Convert score to width and height."""
    # Base size 1.0, max size 2.5
    size = 1.0 + (score * 1.5)
    return (size, size * 0.6)  # Width, height

def style_node_with_size(
    class_name: str,
    score: float
) -> dict:
    """Generate node styling with size variation."""
    width, height = score_to_size(score)
    
    return {
        "width": str(width),
        "height": str(height),
        "fontsize": str(int(10 + score * 6)),  # 10-16pt
    }

Style 3: Color Gradients

Use color to indicate relevance:

from colorsys import hsv_to_rgb

def score_to_color(score: float) -> str:
    """Map score to color gradient (red -> yellow -> green)."""
    # Hue: 0.0 (red) -> 0.33 (green)
    hue = score * 0.33
    saturation = 0.6
    value = 0.9
    
    r, g, b = hsv_to_rgb(hue, saturation, value)
    return f"#{int(r*255):02X}{int(g*255):02X}{int(b*255):02X}"


def style_node_with_color(class_name: str, score: float) -> dict:
    """Generate node styling with score-based color."""
    return {
        "fillcolor": score_to_color(score),
        "style": "filled",
    }

Color Mapping:

Score	Color	Meaning
0.9+	Green	Highly relevant
0.6-0.9	Yellow-Green	Relevant
0.3-0.6	Yellow	Somewhat relevant
0.1-0.3	Orange	Low relevance
<0.1	Red	Not relevant

Diagram Generation

Graphviz DOT Generation

from graphviz import Digraph

def generate_filtered_uml(
    schema: SchemaView,
    template_id: str,
    threshold: float = 0.5,
    show_edges: bool = True
) -> Digraph:
    """Generate UML diagram filtered by specificity."""
    
    dot = Digraph(
        name=f"Heritage Ontology - {template_id}",
        comment=f"Classes with specificity >= {threshold}",
    )
    
    # Graph attributes
    dot.attr(
        rankdir="TB",  # Top to bottom
        splines="ortho",
        nodesep="0.5",
        ranksep="0.8",
    )
    
    # Node defaults
    dot.attr("node", shape="record", fontname="Helvetica")
    
    # Get filtered classes
    included_classes = set()
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= threshold:
            included_classes.add(class_name)
            
            # Add node with styling
            style = style_node_by_score(class_name, score)
            label = create_uml_label(cls)
            dot.node(class_name, label=label, **style)
    
    # Add edges (inheritance, relationships)
    if show_edges:
        for class_name in included_classes:
            cls = schema.get_class(class_name)
            
            # Inheritance (is_a)
            if cls.is_a and cls.is_a in included_classes:
                dot.edge(cls.is_a, class_name, arrowhead="empty")
            
            # Associations (slot ranges)
            for slot in cls.slots or []:
                slot_def = schema.get_slot(slot)
                if slot_def.range and slot_def.range in included_classes:
                    dot.edge(
                        class_name,
                        slot_def.range,
                        label=slot,
                        arrowhead="open"
                    )
    
    return dot


def create_uml_label(cls) -> str:
    """Create UML class label with attributes."""
    slots = []
    for slot_name in cls.slots or []:
        slots.append(f"+ {slot_name}")
    
    slot_str = "\\l".join(slots) if slots else ""
    
    return f"{{{cls.name}|{slot_str}\\l}}"

PlantUML Generation

def generate_plantuml(
    schema: SchemaView,
    template_id: str,
    threshold: float = 0.5
) -> str:
    """Generate PlantUML diagram filtered by specificity."""
    
    lines = [
        "@startuml",
        f"title Heritage Ontology - {template_id}",
        "skinparam classAttributeIconSize 0",
        "skinparam shadowing false",
        "",
    ]
    
    # Define color scale
    lines.append("skinparam class {")
    lines.append("  BackgroundColor<<high>> #90EE90")
    lines.append("  BackgroundColor<<medium>> #FFFACD")
    lines.append("  BackgroundColor<<low>> #FFB6C1")
    lines.append("}")
    lines.append("")
    
    # Get filtered classes
    included_classes = set()
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= threshold:
            included_classes.add(class_name)
            
            # Determine stereotype
            if score >= 0.8:
                stereotype = "<<high>>"
            elif score >= 0.5:
                stereotype = "<<medium>>"
            else:
                stereotype = "<<low>>"
            
            # Class definition
            lines.append(f"class {class_name} {stereotype} {{")
            for slot in cls.slots or []:
                lines.append(f"  +{slot}")
            lines.append("}")
    
    lines.append("")
    
    # Add relationships
    for class_name in included_classes:
        cls = schema.get_class(class_name)
        
        # Inheritance
        if cls.is_a and cls.is_a in included_classes:
            lines.append(f"{cls.is_a} <|-- {class_name}")
    
    lines.append("@enduml")
    
    return "\n".join(lines)

Interactive Features

Feature 1: Progressive Disclosure

Start with high-threshold view, allow drilling down:

class ProgressiveUMLViewer:
    """UML viewer with progressive disclosure based on specificity."""
    
    def __init__(self, schema: SchemaView, template_id: str):
        self.schema = schema
        self.template_id = template_id
        self.current_threshold = 0.8  # Start focused
    
    def render(self) -> Digraph:
        """Render current view."""
        return generate_filtered_uml(
            self.schema,
            self.template_id,
            self.current_threshold
        )
    
    def expand(self, step: float = 0.1):
        """Show more classes by lowering threshold."""
        self.current_threshold = max(0.1, self.current_threshold - step)
        return self.render()
    
    def focus(self, step: float = 0.1):
        """Show fewer classes by raising threshold."""
        self.current_threshold = min(0.95, self.current_threshold + step)
        return self.render()
    
    def expand_around(self, class_name: str, depth: int = 1):
        """Expand to show neighbors of a specific class."""
        # Find classes connected to the given class
        neighbors = self._find_neighbors(class_name, depth)
        
        # Temporarily lower threshold for neighbors
        # Implementation depends on visualization framework
        pass

Feature 2: Template Switching

Quick view switching between conversation templates:

class MultiTemplateViewer:
    """UML viewer supporting multiple template perspectives."""
    
    TEMPLATES = [
        "archive_search",
        "museum_search", 
        "library_search",
        "collection_discovery",
        "person_research",
        "location_browse",
        "identifier_lookup",
        "organizational_change",
        "digital_platform",
        "general_heritage",
    ]
    
    def __init__(self, schema: SchemaView):
        self.schema = schema
        self.current_template = "general_heritage"
        self.threshold = 0.5
    
    def switch_template(self, template_id: str) -> Digraph:
        """Switch to a different template perspective."""
        if template_id not in self.TEMPLATES:
            raise ValueError(f"Unknown template: {template_id}")
        
        self.current_template = template_id
        return self.render()
    
    def compare_templates(
        self,
        template_a: str,
        template_b: str
    ) -> tuple[Digraph, Digraph]:
        """Generate side-by-side comparison of two templates."""
        return (
            generate_filtered_uml(self.schema, template_a, self.threshold),
            generate_filtered_uml(self.schema, template_b, self.threshold),
        )

Feature 3: Hover Information

Add score metadata to tooltips:

def add_tooltip_info(
    dot: Digraph,
    class_name: str,
    cls,
    template_id: str
) -> None:
    """Add tooltip with specificity information."""
    
    general_score = get_general_score(cls)
    template_score = get_template_score(cls, template_id)
    rationale = cls.annotations.get("specificity_rationale", {}).value
    
    tooltip = f"""Class: {class_name}
General Specificity: {general_score:.2f}
{template_id} Specificity: {template_score:.2f}

Rationale: {rationale}"""
    
    dot.node(
        class_name,
        tooltip=tooltip,
        URL=f"#class-{class_name}",  # Link to documentation
    )

Pre-generated Views

Standard View Set

Generate a set of pre-computed views for common use cases:

STANDARD_VIEWS = {
    # Overview views
    "overview_core": {
        "template": "general_heritage",
        "threshold": 0.7,
        "description": "Core classes for heritage custodian modeling"
    },
    "overview_full": {
        "template": "general_heritage", 
        "threshold": 0.3,
        "description": "Comprehensive view of all semantic classes"
    },
    
    # Task-specific views
    "task_archive_research": {
        "template": "archive_search",
        "threshold": 0.5,
        "description": "Classes relevant for archive research"
    },
    "task_person_lookup": {
        "template": "person_research",
        "threshold": 0.5,
        "description": "Classes for finding people in heritage institutions"
    },
    "task_collection_discovery": {
        "template": "collection_discovery",
        "threshold": 0.5,
        "description": "Classes for exploring collections"
    },
    
    # Technical views
    "technical_identifiers": {
        "template": "identifier_lookup",
        "threshold": 0.6,
        "description": "Identifier and linking classes"
    },
    "technical_platforms": {
        "template": "digital_platform",
        "threshold": 0.6,
        "description": "Digital platform and API classes"
    },
}


def generate_all_standard_views(schema: SchemaView, output_dir: Path):
    """Generate all standard views as SVG files."""
    
    for view_id, config in STANDARD_VIEWS.items():
        dot = generate_filtered_uml(
            schema,
            config["template"],
            config["threshold"]
        )
        
        # Add title and description
        dot.attr(label=config["description"])
        
        # Render to SVG
        output_path = output_dir / f"uml_{view_id}"
        dot.render(output_path, format="svg", cleanup=True)
        
        print(f"Generated: {output_path}.svg")

Integration with Frontend

API Endpoint for Dynamic Diagrams

from fastapi import FastAPI, Query
from fastapi.responses import Response

app = FastAPI()

@app.get("/api/uml/filtered")
async def get_filtered_uml(
    template: str = Query("general_heritage"),
    threshold: float = Query(0.5, ge=0.0, le=1.0),
    format: str = Query("svg", regex="^(svg|png|dot)$"),
):
    """Generate filtered UML diagram."""
    
    schema = load_schema()
    dot = generate_filtered_uml(schema, template, threshold)
    
    if format == "dot":
        return Response(
            content=dot.source,
            media_type="text/plain"
        )
    else:
        rendered = dot.pipe(format=format)
        media_type = "image/svg+xml" if format == "svg" else "image/png"
        return Response(content=rendered, media_type=media_type)


@app.get("/api/uml/templates")
async def list_available_templates():
    """List available template perspectives."""
    return {
        "templates": [
            {"id": "archive_search", "name": "Archive Search"},
            {"id": "museum_search", "name": "Museum Search"},
            {"id": "library_search", "name": "Library Search"},
            {"id": "collection_discovery", "name": "Collection Discovery"},
            {"id": "person_research", "name": "Person Research"},
            {"id": "location_browse", "name": "Location Browse"},
            {"id": "identifier_lookup", "name": "Identifier Lookup"},
            {"id": "organizational_change", "name": "Organizational Change"},
            {"id": "digital_platform", "name": "Digital Platform"},
            {"id": "general_heritage", "name": "General (Default)"},
        ]
    }

React Component Example

// components/FilteredUMLViewer.tsx
import { useState } from 'react';

interface Props {
  initialTemplate?: string;
  initialThreshold?: number;
}

export function FilteredUMLViewer({
  initialTemplate = 'general_heritage',
  initialThreshold = 0.5,
}: Props) {
  const [template, setTemplate] = useState(initialTemplate);
  const [threshold, setThreshold] = useState(initialThreshold);
  
  const umlUrl = `/api/uml/filtered?template=${template}&threshold=${threshold}&format=svg`;
  
  return (
    <div className="uml-viewer">
      <div className="controls">
        <select
          value={template}
          onChange={(e) => setTemplate(e.target.value)}
        >
          <option value="archive_search">Archive Search</option>
          <option value="museum_search">Museum Search</option>
          <option value="person_research">Person Research</option>
          <option value="general_heritage">General</option>
          {/* ... more templates */}
        </select>
        
        <input
          type="range"
          min="0"
          max="1"
          step="0.1"
          value={threshold}
          onChange={(e) => setThreshold(parseFloat(e.target.value))}
        />
        <span>Threshold: {threshold.toFixed(1)}</span>
      </div>
      
      <div className="diagram">
        <img src={umlUrl} alt={`UML diagram for ${template}`} />
      </div>
    </div>
  );
}

Performance Considerations

Caching Strategy

from functools import lru_cache
from hashlib import md5

@lru_cache(maxsize=100)
def get_cached_uml(
    template_id: str,
    threshold: float,
    format: str
) -> bytes:
    """Cache rendered UML diagrams."""
    schema = load_schema()
    dot = generate_filtered_uml(schema, template_id, threshold)
    return dot.pipe(format=format)


def invalidate_uml_cache():
    """Clear cache when schema changes."""
    get_cached_uml.cache_clear()

Pre-rendering for Production

# scripts/prerender_uml_views.sh

#!/bin/bash
# Pre-render all standard UML views for production

TEMPLATES="archive_search museum_search library_search collection_discovery person_research location_browse identifier_lookup organizational_change digital_platform general_heritage"
THRESHOLDS="0.3 0.5 0.7 0.9"

for template in $TEMPLATES; do
  for threshold in $THRESHOLDS; do
    echo "Rendering: $template @ $threshold"
    python scripts/render_uml.py \
      --template "$template" \
      --threshold "$threshold" \
      --output "static/uml/${template}_${threshold}.svg"
  done
done

echo "Pre-rendering complete!"

Validation Checklist

All templates have corresponding view generation
Threshold range validated (0.0-1.0)
Edge cases handled (no classes meet threshold)
Large diagrams render within timeout
SVG output is valid XML
Interactive features work in target browsers
Cache invalidation triggers on schema change
Standard views regenerated on deployment

References

docs/plan/specificity_score/04-prompt-conversation-templates.md - Template definitions
docs/plan/specificity_score/05-dependencies.md - Visualization dependencies
frontend/src/components/ - Existing frontend components
scripts/generate_uml.py - Current UML generation script

22 KiB Raw Blame History