# Rule: Specificity Score Convention for LinkML Schema Annotations **Version**: 1.0.0 **Created**: 2025-01-04 **Status**: Active **Applies to**: `schemas/20251121/linkml/modules/classes/*.yaml` --- ## Rule Statement Every class in the Heritage Custodian Ontology MUST have specificity score annotations to enable intelligent filtering for RAG retrieval and UML visualization. --- ## Annotation Schema ### Required Annotations Every class YAML file MUST include these annotations: ```yaml classes: ClassName: annotations: specificity_score: 0.75 # Required: General specificity (0.0-1.0) specificity_rationale: "..." # Required: Why this score was assigned ``` ### Optional Annotations Template-specific scores for context-aware filtering: ```yaml classes: ClassName: annotations: specificity_score: 0.75 specificity_rationale: "..." template_specificity: # Optional: Template-specific scores archive_search: 0.95 museum_search: 0.20 person_research: 0.30 ``` --- ## Score Semantics ### General Specificity Score The `specificity_score` measures how **context-dependent** a class is: | Score Range | Meaning | Example Classes | |-------------|---------|-----------------| | 0.00-0.20 | **Universal** - relevant in almost all contexts | `HeritageCustodian`, `CustodianName`, `Location` | | 0.20-0.40 | **Broadly useful** - relevant in most contexts | `Collection`, `Identifier`, `GHCID` | | 0.40-0.60 | **Moderately specific** - relevant in several contexts | `ChangeEvent`, `PersonProfile`, `DigitalPlatform` | | 0.60-0.80 | **Fairly specific** - relevant in limited contexts | `Archive`, `Museum`, `Library`, `FindingAid` | | 0.80-1.00 | **Highly specific** - relevant only in specialized contexts | `LinkedInConnectionExtraction`, `GHCIDHistoryEntry` | **Key Insight**: Lower scores = MORE generally relevant (always useful in RAG); Higher scores = MORE specific (only useful in specialized queries). --- ### Template Specificity Scores The `template_specificity` maps class relevance to 10 conversation templates: | Template ID | Focus Area | Example High-Score Classes | |-------------|------------|---------------------------| | `archive_search` | Archives and archival holdings | `Archive`, `RecordSet`, `Fonds` | | `museum_search` | Museums and exhibitions | `Museum`, `Gallery`, `Exhibition` | | `library_search` | Libraries and catalogs | `Library`, `Catalog`, `BibliographicCollection` | | `collection_discovery` | Collections and holdings | `Collection`, `Accession`, `Extent` | | `person_research` | People and staff | `PersonProfile`, `Staff`, `Role` | | `location_browse` | Geographic information | `Location`, `Address`, `GeoCoordinates` | | `identifier_lookup` | Identifiers (ISIL, Wikidata) | `Identifier`, `GHCID`, `ISIL` | | `organizational_change` | History and changes | `ChangeEvent`, `Founding`, `Merger` | | `digital_platform` | Online resources | `DigitalPlatform`, `Website`, `API` | | `general_heritage` | Fallback/general | Uses `specificity_score` directly | --- ## Examples ### Example 1: Universal Class (Low Specificity) ```yaml # modules/classes/HeritageCustodian.yaml classes: HeritageCustodian: description: >- Base class for all heritage custodian institutions. annotations: specificity_score: 0.15 specificity_rationale: >- Universal base class relevant in virtually all heritage contexts. Every query about heritage institutions implicitly involves this class. template_specificity: archive_search: 0.65 museum_search: 0.65 library_search: 0.65 collection_discovery: 0.70 person_research: 0.70 location_browse: 0.75 identifier_lookup: 0.70 organizational_change: 0.75 digital_platform: 0.70 general_heritage: 0.15 ``` ### Example 2: Domain-Specific Class (High Specificity) ```yaml # modules/classes/Archive.yaml classes: Archive: is_a: HeritageCustodian description: >- An archive institution holding historical records and documents. annotations: specificity_score: 0.70 specificity_rationale: >- Domain-specific institution type. Highly relevant for archival research but not needed for museum or library queries. template_specificity: archive_search: 0.95 museum_search: 0.20 library_search: 0.25 collection_discovery: 0.75 person_research: 0.40 location_browse: 0.65 identifier_lookup: 0.50 organizational_change: 0.60 digital_platform: 0.45 general_heritage: 0.70 ``` ### Example 3: Technical Class (Very High Specificity) ```yaml # modules/classes/LinkedInConnectionExtraction.yaml classes: LinkedInConnectionExtraction: description: >- Technical class for extracting LinkedIn connection data. annotations: specificity_score: 0.95 specificity_rationale: >- Internal extraction class with no semantic significance for end users. Only relevant when specifically researching data extraction processes. template_specificity: archive_search: 0.05 museum_search: 0.05 library_search: 0.05 collection_discovery: 0.05 person_research: 0.40 location_browse: 0.05 identifier_lookup: 0.10 organizational_change: 0.05 digital_platform: 0.15 general_heritage: 0.95 ``` --- ## Score Assignment Guidelines ### Factors That LOWER Specificity Score | Factor | Impact | Example | |--------|--------|---------| | Base/parent class | -0.20 to -0.30 | `HeritageCustodian` is parent of all | | Used in identifiers | -0.10 to -0.15 | `CustodianName` used in GHCID | | Geographic component | -0.10 to -0.15 | `Location` needed for all institutions | | Universal attribute | -0.10 to -0.15 | `Provenance` applies to all data | ### Factors That RAISE Specificity Score | Factor | Impact | Example | |--------|--------|---------| | Institution type | +0.30 to +0.40 | `Archive`, `Museum`, `Library` | | Technical/extraction | +0.30 to +0.40 | `LinkedInConnectionExtraction` | | Event subtype | +0.20 to +0.30 | `Merger`, `Founding`, `Closure` | | Domain terminology | +0.15 to +0.25 | `Fonds`, `FindingAid`, `RecordSet` | ### Cross-Class Consistency Rules 1. **Inheritance**: Child classes should have equal or higher specificity than parents 2. **Siblings**: Classes at same hierarchy level should have similar base scores 3. **Competing types**: Institution types should reduce each other's template scores ```yaml # CORRECT: Archive (0.70) inherits from HeritageCustodian (0.15) Archive: is_a: HeritageCustodian # Parent: 0.15 annotations: specificity_score: 0.70 # Child: 0.70 >= 0.15 ✓ # WRONG: Child less specific than parent Archive: is_a: HeritageCustodian # Parent: 0.15 annotations: specificity_score: 0.10 # Child: 0.10 < 0.15 ✗ ``` --- ## Validation Rules ### Required Validations 1. **Range Check**: `0.0 <= specificity_score <= 1.0` 2. **Rationale Present**: `specificity_rationale` must not be empty 3. **Inheritance Consistency**: Child score >= parent score 4. **Template Score Range**: All template scores must be 0.0-1.0 ### Recommended Validations 1. **No Orphan Scores**: Every class should have annotations (warn if missing) 2. **Score Distribution**: Flag if >50% of classes have same score (lack of differentiation) 3. **Template Coverage**: Warn if template_specificity omits common templates ### Validation Script ```python # scripts/validate_specificity_scores.py from linkml_runtime import SchemaView from pathlib import Path import sys REQUIRED_TEMPLATES = [ "archive_search", "museum_search", "library_search", "collection_discovery", "person_research", "location_browse", "identifier_lookup", "organizational_change", "digital_platform", "general_heritage" ] def validate_specificity_scores(schema_path: Path) -> list[str]: """Validate all specificity score annotations.""" errors = [] schema = SchemaView(str(schema_path)) for class_name in schema.all_classes(): cls = schema.get_class(class_name) # Check required annotations score = cls.annotations.get("specificity_score") rationale = cls.annotations.get("specificity_rationale") if score is None: errors.append(f"{class_name}: Missing specificity_score") continue # Validate score range try: score_val = float(score.value) if not 0.0 <= score_val <= 1.0: errors.append(f"{class_name}: Score {score_val} out of range [0.0, 1.0]") except (ValueError, TypeError): errors.append(f"{class_name}: Invalid score value: {score.value}") # Check rationale if rationale is None or not rationale.value.strip(): errors.append(f"{class_name}: Missing or empty specificity_rationale") # Check inheritance consistency if cls.is_a: parent = schema.get_class(cls.is_a) parent_score = parent.annotations.get("specificity_score") if parent_score and float(score.value) < float(parent_score.value): errors.append( f"{class_name}: Score {score.value} < parent {cls.is_a} score {parent_score.value}" ) return errors if __name__ == "__main__": schema_path = Path("schemas/20251121/linkml/01_custodian_name.yaml") errors = validate_specificity_scores(schema_path) if errors: print("Validation errors:") for error in errors: print(f" - {error}") sys.exit(1) else: print("All specificity scores valid!") sys.exit(0) ``` --- ## Anti-Patterns ### What NOT to Do | Anti-Pattern | Why It's Wrong | Correct Approach | |--------------|----------------|------------------| | Score without rationale | No audit trail for decisions | Always include rationale | | All scores = 0.5 | No differentiation, useless for filtering | Differentiate based on semantics | | Child < parent score | Violates specificity inheritance | Child should be equal or more specific | | Template score > 1.0 | Invalid score value | Keep all scores in [0.0, 1.0] | | Empty rationale | Fails validation, no documentation | Write meaningful rationale | ### Example of Incorrect Annotation ```yaml # WRONG - Multiple issues classes: Archive: annotations: specificity_score: 1.5 # Out of range! specificity_rationale: "" # Empty rationale! template_specificity: archive_search: 0.95 # Missing other templates - incomplete coverage ``` ### Example of Correct Annotation ```yaml # CORRECT classes: Archive: annotations: specificity_score: 0.70 specificity_rationale: >- Domain-specific institution type for archives. Highly relevant for archival research queries but less useful for museum or library-focused questions. template_specificity: archive_search: 0.95 museum_search: 0.20 library_search: 0.25 collection_discovery: 0.75 person_research: 0.40 location_browse: 0.65 identifier_lookup: 0.50 organizational_change: 0.60 digital_platform: 0.45 general_heritage: 0.70 ``` --- ## Migration Checklist When adding specificity scores to existing classes: ### Phase 1: Assessment - [ ] Count classes without annotations - [ ] Identify class hierarchy (parents → children order) - [ ] Review existing descriptions for scoring hints ### Phase 2: Annotation - [ ] Start with root classes (lowest specificity) - [ ] Work down hierarchy (increasing specificity) - [ ] Assign template scores based on domain alignment - [ ] Write rationale explaining score decisions ### Phase 3: Validation - [ ] Run validation script - [ ] Check inheritance consistency - [ ] Verify score distribution (not all same value) - [ ] Review edge cases (technical classes, mixins) ### Phase 4: Documentation - [ ] Update class count in plan documents - [ ] Document any scoring decisions that were difficult - [ ] Create PR with all changes --- ## Related Rules - **Rule 0**: LinkML Schemas Are the Single Source of Truth - **Rule 4**: Technical Classes Are Excluded from Visualizations - **Rule 13**: Custodian Type Annotations on LinkML Schema Elements --- ## References - `docs/plan/specificity_score/README.md` - System overview - `docs/plan/specificity_score/04-prompt-conversation-templates.md` - Template definitions - `docs/plan/specificity_score/06-uml-visualization.md` - UML filtering integration --- ## Changelog | Date | Version | Change | |------|---------|--------| | 2025-01-04 | 1.0.0 | Initial rule created for specificity score system |