- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework. - Added detailed mapping of SPARQL templates to context templates for improved specificity filtering. - Implemented wrapper patterns around existing classifiers to extend functionality without duplication. - Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality. - Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
16 KiB
Specificity Score System - Prompt/Conversation Templates
Overview
This document defines the conversation templates used for template-based specificity scoring. Each template represents a common question pattern about heritage institutions.
INTEGRATION NOTE: These "context templates" are derived from the existing SPARQL templates in
data/sparql_templates.yaml. The SPARQL templates serve query generation; context templates serve class filtering for RAG retrieval.
SPARQL Template → Context Template Mapping
The existing SPARQL template system already classifies questions. We map SPARQL templates to context templates:
| SPARQL Template (existing) | Context Template | Trigger Condition |
|---|---|---|
list_institutions_by_type_city |
location_browse |
Default |
list_institutions_by_type_city |
archive_search |
If institution_type = A |
list_institutions_by_type_city |
museum_search |
If institution_type = M |
list_institutions_by_type_city |
library_search |
If institution_type = L |
list_institutions_by_type_region |
location_browse |
Default |
list_institutions_by_type_region |
archive_search |
If institution_type = A |
list_institutions_by_type_region |
museum_search |
If institution_type = M |
list_institutions_by_type_region |
library_search |
If institution_type = L |
list_institutions_by_type_country |
location_browse |
Default |
count_institutions_by_type_location |
location_browse |
Always |
count_institutions_by_type |
general_heritage |
Always |
find_institution_by_name |
general_heritage |
Always |
list_all_institutions_in_city |
location_browse |
Always |
find_institutions_by_founding_date |
organizational_change |
Always |
find_institution_by_identifier |
identifier_lookup |
Always |
compare_locations |
location_browse |
Always |
find_custodians_by_budget_threshold |
general_heritage |
Always |
none |
general_heritage |
Fallback |
Context Templates NOT Derived from SPARQL
These context templates don't have direct SPARQL equivalents (questions that require RAG, not structured queries):
| Context Template | When Used |
|---|---|
collection_discovery |
Questions about collections, holdings, accessions |
person_research |
Questions about staff, directors, curators |
digital_platform |
Questions about websites, databases, APIs |
Template Catalog
1. archive_search - Archive Discovery
Description: Questions focused on finding or learning about archival institutions and their holdings.
Question Patterns:
- "What archives are in {location}?"
- "Which archives have {collection_type} collections?"
- "Find archives with records from {time_period}"
- "Are there any {archive_type} archives in {region}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Archive | 0.95 | Primary focus |
| RecordSet | 0.90 | Archival holdings |
| Fonds | 0.90 | Archival organization |
| FindingAid | 0.85 | Access to archives |
| Collection | 0.85 | Archive contents |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| Location | 0.70 | Where archives are |
| GHCID | 0.65 | Institution identifier |
| HeritageCustodian | 0.65 | Parent class |
| CustodianName | 0.60 | Institution naming |
| DigitalPlatform | 0.55 | Online access |
Low Relevance Classes (score < 0.5):
| Class | Score | Rationale |
|---|---|---|
| Museum | 0.20 | Different institution type |
| Gallery | 0.15 | Different institution type |
| PersonProfile | 0.30 | Not primary focus |
| LinkedInConnectionExtraction | 0.05 | Technical extraction |
2. museum_search - Museum Discovery
Description: Questions focused on finding museums, exhibitions, and museum collections.
Question Patterns:
- "What museums are in {location}?"
- "Find {museum_type} museums"
- "Which museums have {artifact_type} collections?"
- "Are there any open-air museums in {region}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Museum | 0.95 | Primary focus |
| Gallery | 0.85 | Often co-located |
| Exhibition | 0.85 | Museum activity |
| Collection | 0.85 | Museum holdings |
| Artifact | 0.80 | Museum objects |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| Location | 0.70 | Where museums are |
| GHCID | 0.65 | Institution identifier |
| HeritageCustodian | 0.65 | Parent class |
| CustodianName | 0.60 | Institution naming |
| OpeningHours | 0.55 | Visitor information |
Low Relevance Classes (score < 0.5):
| Class | Score | Rationale |
|---|---|---|
| Archive | 0.20 | Different institution type |
| RecordSet | 0.15 | Archival concept |
| PersonProfileExtraction | 0.05 | Technical extraction |
3. library_search - Library Discovery
Description: Questions about libraries, catalogs, and bibliographic collections.
Question Patterns:
- "What libraries are in {location}?"
- "Find {library_type} libraries"
- "Which libraries have {subject} collections?"
- "Are there any special collections libraries in {region}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Library | 0.95 | Primary focus |
| Catalog | 0.90 | Library resource |
| BibliographicCollection | 0.85 | Library holdings |
| Collection | 0.85 | Library contents |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| Location | 0.70 | Where libraries are |
| GHCID | 0.65 | Institution identifier |
| HeritageCustodian | 0.65 | Parent class |
| DigitalPlatform | 0.60 | Online catalogs |
| Identifier | 0.55 | ISIL codes |
4. collection_discovery - Collection Research
Description: Questions about collections, holdings, and what institutions contain.
Question Patterns:
- "What collections does {institution} have?"
- "Find institutions with {subject} collections"
- "Which archives have photo collections?"
- "What is in the {collection_name}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Collection | 0.95 | Primary focus |
| RecordSet | 0.85 | Collection type |
| Fonds | 0.85 | Archival collection |
| Accession | 0.80 | How items entered |
| Extent | 0.80 | Collection size |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| Archive | 0.75 | Collection holder |
| Museum | 0.75 | Collection holder |
| Library | 0.75 | Collection holder |
| HeritageCustodian | 0.70 | Collection holder |
| DigitalPlatform | 0.55 | Collection access |
5. person_research - People & Staff
Description: Questions about people associated with heritage institutions.
Question Patterns:
- "Who is the director of {institution}?"
- "What curators work at {institution}?"
- "Find archivists in {region}"
- "Who founded {institution}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| PersonProfile | 0.95 | Primary focus |
| Staff | 0.95 | Institution employees |
| Role | 0.90 | Person's position |
| Affiliation | 0.85 | Person-institution link |
| Director | 0.85 | Leadership role |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.70 | Where people work |
| CustodianName | 0.60 | Institution context |
| ContactPoint | 0.55 | How to reach people |
Low Relevance Classes (score < 0.5):
| Class | Score | Rationale |
|---|---|---|
| Collection | 0.30 | Not person-focused |
| Location | 0.35 | Secondary context |
| LinkedInConnectionExtraction | 0.40 | Source data (higher here) |
6. location_browse - Geographic Exploration
Description: Questions focused on location and geographic distribution.
Question Patterns:
- "Where is {institution} located?"
- "What heritage institutions are in {city}?"
- "Show me archives in {province}"
- "Find museums near {coordinates}"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Location | 0.95 | Primary focus |
| Address | 0.90 | Physical location |
| GeoCoordinates | 0.90 | Map positioning |
| City | 0.85 | Location level |
| Region | 0.85 | Location level |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.75 | What's at location |
| Archive | 0.65 | Institution type |
| Museum | 0.65 | Institution type |
| Library | 0.65 | Institution type |
| GHCID | 0.60 | Contains location code |
7. identifier_lookup - Identifier Search
Description: Questions about specific identifiers (ISIL, Wikidata, GHCID, etc.).
Question Patterns:
- "What is the ISIL code for {institution}?"
- "Find institution with Wikidata ID {id}"
- "Look up GHCID {ghcid}"
- "What VIAF ID does {institution} have?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| Identifier | 0.95 | Primary focus |
| GHCID | 0.95 | Heritage identifier |
| ISIL | 0.90 | Library/archive ID |
| WikidataIdentifier | 0.90 | Linked data ID |
| VIAF | 0.85 | Authority ID |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.70 | Identifier owner |
| CustodianName | 0.65 | Name resolution |
| ExternalLink | 0.60 | Related URLs |
8. organizational_change - History & Changes
Description: Questions about institutional history, mergers, closures, and changes.
Question Patterns:
- "When was {institution} founded?"
- "What happened to {institution} after {year}?"
- "Did {institution_a} merge with {institution_b}?"
- "Why did {institution} close?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| ChangeEvent | 0.95 | Primary focus |
| Founding | 0.90 | Origin event |
| Merger | 0.90 | Combination event |
| Closure | 0.90 | End event |
| GHCIDHistoryEntry | 0.85 | ID changes |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.75 | Subject of change |
| CustodianName | 0.70 | Name changes |
| Provenance | 0.65 | Data history |
| TemporalExtent | 0.60 | Time periods |
9. digital_platform - Online Resources
Description: Questions about digital platforms, websites, and online access.
Question Patterns:
- "What is the website for {institution}?"
- "Does {institution} have an online catalog?"
- "Find archives with SPARQL endpoints"
- "Which institutions use {platform_name}?"
High Relevance Classes (score ≥ 0.8):
| Class | Score | Rationale |
|---|---|---|
| DigitalPlatform | 0.95 | Primary focus |
| Website | 0.90 | Online presence |
| API | 0.85 | Technical access |
| OnlineCatalog | 0.85 | Digital collection |
| DiscoveryPortal | 0.80 | Aggregator |
Medium Relevance Classes (score 0.5-0.8):
| Class | Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.70 | Platform owner |
| Collection | 0.60 | Online content |
| ContactPoint | 0.55 | Digital contact |
10. general_heritage - General Questions
Description: Broad questions without specific focus. Used as fallback.
Question Patterns:
- "Tell me about heritage in {region}"
- "What cultural institutions exist?"
- "How is heritage preserved?"
- General exploratory questions
Scoring Strategy: Use general specificity scores (not template-specific). All classes get their default context-free scores.
Threshold: Lower threshold (0.3) to be more inclusive for exploratory queries.
Template Selection Algorithm
def select_template(question: str, classifier_result: dict) -> str:
"""Select the most appropriate template for a question."""
template_id = classifier_result["template_id"]
confidence = classifier_result["confidence"]
# High confidence: use classified template
if confidence >= 0.8:
return template_id
# Medium confidence: check for keyword confirmation
if confidence >= 0.5:
if confirms_template(question, template_id):
return template_id
# Low confidence: fall back to general
return "general_heritage"
def confirms_template(question: str, template_id: str) -> bool:
"""Check if question keywords confirm the template."""
confirmations = {
"archive_search": ["archive", "records", "fonds", "finding aid"],
"museum_search": ["museum", "exhibit", "gallery", "artifact"],
"library_search": ["library", "catalog", "book", "bibliographic"],
"person_research": ["who", "director", "curator", "staff", "archivist"],
"location_browse": ["where", "located", "address", "city", "region"],
"identifier_lookup": ["isil", "wikidata", "ghcid", "viaf", "code", "id"],
"organizational_change": ["when", "founded", "merged", "closed", "history"],
"digital_platform": ["website", "online", "api", "portal", "digital"],
}
keywords = confirmations.get(template_id, [])
question_lower = question.lower()
return any(kw in question_lower for kw in keywords)
Score Assignment Guidelines
General Principles
-
Primary Focus Classes: Score 0.90-0.95
- Classes directly answering the template's question type
-
Supporting Classes: Score 0.70-0.85
- Classes providing essential context
-
Contextual Classes: Score 0.50-0.70
- Classes that may be useful but not central
-
Low Relevance Classes: Score 0.20-0.45
- Classes rarely needed for this template
-
Irrelevant Classes: Score 0.05-0.15
- Technical/extraction classes not useful for answers
Cross-Template Consistency
Some classes should maintain consistent relative scores across templates:
| Class | Minimum Score | Rationale |
|---|---|---|
| HeritageCustodian | 0.60 | Always relevant as base class |
| CustodianName | 0.55 | Names needed everywhere |
| Location | 0.50 | Geography usually relevant |
| GHCID | 0.50 | Identifiers useful for linking |
| Provenance | 0.40 | Data quality context |
Template-Specific Adjustments
When a class is the primary focus of a template, boost by +0.30 (max 0.95):
def calculate_template_score(class_name: str, template_id: str, general_score: float) -> float:
"""Calculate template-specific score from general score."""
primary_classes = {
"archive_search": ["Archive", "RecordSet", "Fonds"],
"museum_search": ["Museum", "Gallery", "Exhibition"],
"library_search": ["Library", "Catalog"],
"person_research": ["PersonProfile", "Staff", "Role"],
"location_browse": ["Location", "Address", "GeoCoordinates"],
"identifier_lookup": ["Identifier", "GHCID", "ISIL"],
"organizational_change": ["ChangeEvent", "Founding", "Merger"],
"digital_platform": ["DigitalPlatform", "Website", "API"],
}
if class_name in primary_classes.get(template_id, []):
return min(0.95, general_score + 0.30)
# Check for anti-correlation (wrong institution type)
if is_competing_type(class_name, template_id):
return max(0.10, general_score - 0.40)
return general_score
Validation Checklist
For each template definition, verify:
- At least 3 question patterns defined
- High relevance classes identified (≥3 classes with score ≥0.8)
- Low relevance classes identified (≥3 classes with score <0.3)
- Scores sum to reasonable distribution (not all high or all low)
- Cross-template consistency maintained for universal classes
- Competing institution types have reduced scores
References
docs/plan/prompt-query_template_mapping/competency-questions.mddocs/plan/prompt-query_template_mapping/templates-schema.mdschemas/20251121/linkml/modules/classes/- All class definitions