# Hybrid GLiNER2 + LLM Annotator Architecture This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction. ## Overview The hybrid annotator addresses a fundamental trade-off in NLP annotation: | Approach | Speed | Accuracy | Relationships | Domain Knowledge | |----------|-------|----------|---------------|------------------| | GLiNER2 (encoder) | ~100x faster | Good recall | Limited | Generic | | LLM (decoder) | Slower | High precision | Excellent | Rich | | **Hybrid** | Fast + thorough | Best of both | Full support | Domain-aware | ## Pipeline Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ HYBRID ANNOTATION PIPELINE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ INPUT │ │ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │ │ │ │ TEXT │───▶│ FAST-PASS │───▶│ REFINEMENT │───▶│ VALIDATION │ │ │ │ │ │ (GLiNER2) │ │ (LLM) │ │ (CROSS) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ AnnotationCandidate AnnotationCandidate EntityClaim │ │ (DETECTED) (REFINED) (VALIDATED) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Stage 1: Fast-Pass (GLiNER2) **Purpose**: High-recall entity mention detection at ~100x speed of LLM-only. **Technology**: GLiNER2 encoder model (`urchade/gliner_multi-v2.1`) **Input**: Raw text document **Output**: `List[AnnotationCandidate]` with status `DETECTED` ### Process 1. Tokenize input text 2. Run GLiNER2 span prediction with configurable threshold (default 0.5) 3. Map GLiNER2 generic labels to GLAM-NER hyponyms using `GLINER2_TO_GLAM_MAPPING` 4. Create `AnnotationCandidate` for each detected span ### GLiNER2 to GLAM-NER Type Mapping ```python GLINER2_TO_GLAM_MAPPING = { # Person types "person": "AGT.PER", "people": "AGT.PER", # Organization types "organization": "GRP", "museum": "GRP.HER", "library": "GRP.HER", "archive": "GRP.HER", "university": "GRP.EDU", # Location types "location": "TOP", "city": "TOP.SET", "country": "TOP.CTY", "building": "TOP.BLD", # Temporal types "date": "TMP.DAB", "time": "TMP.TAB", "period": "TMP.ERA", # ... (see hybrid_annotator.py for complete mapping) } ``` ### Configuration ```python HybridConfig( gliner_model="urchade/gliner_multi-v2.1", # Model to use gliner_threshold=0.5, # Detection confidence threshold gliner_entity_labels=None, # Custom labels (or use defaults) gliner_device="cpu", # Device (cpu/cuda) enable_fast_pass=True, # Enable/disable this stage ) ``` ## Stage 2: Refinement (LLM) **Purpose**: Entity type refinement, relationship extraction, and domain knowledge injection. **Technology**: Z.AI GLM-4 (default), Claude, or GPT-4 **Input**: Original text + `List[AnnotationCandidate]` from Stage 1 **Output**: Refined `List[AnnotationCandidate]` + `List[RelationshipCandidate]` ### Process 1. Construct prompt with: - Original text - GLiNER2 candidate spans (as hints) - GLAM-NER type definitions - Relationship extraction instructions 2. LLM performs: - **Type refinement**: Upgrade generic types (e.g., `GRP` → `GRP.HER`) - **New entity detection**: Find entities GLiNER2 missed - **Relationship extraction**: Identify semantic relationships - **Entity linking hints**: Suggest Wikidata/VIAF IDs - **Temporal/spatial scoping**: Add context 3. Parse LLM response and update candidates ### Relationship Extraction The LLM extracts relationships following GLAM-NER relationship hyponyms: ```python RelationshipCandidate( subject_id="candidate-uuid-1", subject_text="Rijksmuseum", subject_type="GRP.HER", relationship_type="REL.SPA.LOC", # Located at relationship_label="located in", object_id="candidate-uuid-2", object_text="Amsterdam", object_type="TOP.SET", confidence=0.92, ) ``` ### Configuration ```python HybridConfig( llm_model="glm-4", # Model name (auto-detects provider) llm_api_key=None, # API key (or use ZAI_API_TOKEN env var) enable_refinement=True, # Enable/disable this stage enable_relationships=True, # Extract relationships ) ``` ## Stage 3: Validation (Cross-Check) **Purpose**: Cross-validate outputs, detect hallucinations, ensure consistency. **Input**: Candidates from both Stage 1 and Stage 2 **Output**: Final `List[EntityClaim]` with status `VALIDATED` or `REJECTED` ### Process 1. **Merge candidates** from GLiNER2 and LLM: - Match by span overlap (configurable threshold) - Prefer LLM types on conflict (configurable) - Create `MERGED` candidates from both sources 2. **Hallucination detection**: - Verify LLM-only entities exist in source text - Check for fabricated relationships - Flag suspicious confidence scores 3. **Consistency checking**: - Validate relationship domain/range constraints - Check temporal coherence - Verify entity type compatibility 4. **Final filtering**: - Apply minimum confidence threshold - Remove rejected candidates (or keep with flag) ### Merge Strategy ``` GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7) LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95) Overlap ratio: 0.67 > threshold (0.3) → MERGE: Use LLM span + confidence, mark as MERGED → Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED) ``` ### Configuration ```python HybridConfig( enable_validation=True, # Enable/disable this stage merge_threshold=0.3, # Minimum overlap ratio for merging prefer_llm_on_conflict=True, # LLM types take precedence minimum_confidence=0.3, # Filter low-confidence results include_rejected=False, # Include rejected in output ) ``` ## Data Structures ### AnnotationCandidate Shared intermediate representation used across all pipeline stages: ```python @dataclass class AnnotationCandidate: candidate_id: str # Unique identifier text: str # Extracted text span start_offset: int # Character start position end_offset: int # Character end position hypernym: Optional[str] # Top-level type (AGT, GRP, TOP, etc.) hyponym: Optional[str] # Fine-grained type (AGT.PER, GRP.HER) # Confidence scores detection_confidence: float # GLiNER2 detection score classification_confidence: float # Type classification score overall_confidence: float # Combined confidence # Source tracking source: CandidateSource # GLINER2, LLM, HYBRID, MERGED status: CandidateStatus # DETECTED, REFINED, VALIDATED, REJECTED # Entity linking wikidata_id: Optional[str] viaf_id: Optional[str] isil_id: Optional[str] # Relationships (populated during LLM refinement) relationships: List[Dict[str, Any]] # Provenance provenance: Optional[Provenance] ``` ### RelationshipCandidate Intermediate representation for relationships: ```python @dataclass class RelationshipCandidate: relationship_id: str relationship_type: str # e.g., REL.CRE.AUT, REL.SPA.LOC relationship_label: str # Human-readable label subject_id: str # Reference to AnnotationCandidate subject_text: str subject_type: str object_id: str object_text: str object_type: str temporal_scope: Optional[str] # e.g., "1885-1890" spatial_scope: Optional[str] # e.g., "Amsterdam" confidence: float is_valid: bool # Domain/range validation result ``` ### HybridAnnotationResult Final output structure: ```python @dataclass class HybridAnnotationResult: entities: List[AnnotationCandidate] relationships: List[RelationshipCandidate] source_text: str # Pipeline stage flags gliner_pass: bool = False llm_pass: bool = False validation_pass: bool = False # Statistics total_candidates: int = 0 merged_count: int = 0 rejected_count: int = 0 ``` ## Usage ### Basic Usage ```python from glam_extractor.annotators import HybridAnnotator, HybridConfig # Default configuration annotator = HybridAnnotator() # Annotate text result = await annotator.annotate(""" The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects. Vincent van Gogh's works are among the most famous in the collection. """) # Access results for entity in result.entities: print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})") for rel in result.relationships: print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}") ``` ### Custom Configuration ```python config = HybridConfig( # Use smaller, faster GLiNER2 model gliner_model="urchade/gliner_small", gliner_threshold=0.6, # Use Claude instead of Z.AI llm_model="claude-3-sonnet-20240229", # Disable relationship extraction for speed enable_relationships=False, # Stricter filtering minimum_confidence=0.5, ) annotator = HybridAnnotator(config=config) ``` ### GLiNER2-Only Mode For maximum speed when relationships aren't needed: ```python config = HybridConfig( enable_fast_pass=True, enable_refinement=False, # Skip LLM enable_validation=True, ) annotator = HybridAnnotator(config=config) result = await annotator.annotate(text) ``` ### LLM-Only Mode When GLiNER2 isn't available or for maximum accuracy: ```python config = HybridConfig( enable_fast_pass=False, # Skip GLiNER2 enable_refinement=True, enable_validation=False, # No cross-validation without GLiNER2 ) annotator = HybridAnnotator(config=config) result = await annotator.annotate(text) ``` ## Performance Characteristics | Configuration | Speed | Entity Recall | Entity Precision | Relationships | |---------------|-------|---------------|------------------|---------------| | GLiNER2-only | ~100x | High | Medium | None | | LLM-only | 1x | Medium | High | Full | | Hybrid (default) | ~10x | High | High | Full | | Hybrid (no-rel) | ~20x | High | High | None | ## Dependencies ### Required - Python 3.10+ - `dataclasses` - `typing` ### Optional - `gliner` - For GLiNER2 fast-pass (gracefully degrades if not installed) - `httpx` or `aiohttp` - For LLM API calls - `torch` - For GLiNER2 GPU acceleration ### Install GLiNER2 ```bash pip install gliner # For GPU support pip install gliner torch ``` ## File Structure ``` src/glam_extractor/annotators/ ├── __init__.py # Module exports ├── base.py # EntityClaim, Provenance, hypernyms ├── hybrid_annotator.py # HybridAnnotator, candidates, pipeline ├── llm_annotator.py # LLMAnnotator, provider configs └── schema_builder.py # GLAMSchema, field specs tests/annotators/ ├── __init__.py └── test_hybrid_annotator.py # 24 unit tests ``` ## See Also - [GLAM-NER v1.7.0 Entity Annotation Convention](./GLAM_NER_CONVENTION.md) - [LLM Annotator Documentation](./LLM_ANNOTATOR.md) - [Schema Builder Guide](./SCHEMA_BUILDER.md)