glam/docs/HYBRID_ANNOTATOR_ARCHITECTURE.md
2025-12-05 16:25:39 +01:00

411 lines
13 KiB
Markdown

# Hybrid GLiNER2 + LLM Annotator Architecture
This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction.
## Overview
The hybrid annotator addresses a fundamental trade-off in NLP annotation:
| Approach | Speed | Accuracy | Relationships | Domain Knowledge |
|----------|-------|----------|---------------|------------------|
| GLiNER2 (encoder) | ~100x faster | Good recall | Limited | Generic |
| LLM (decoder) | Slower | High precision | Excellent | Rich |
| **Hybrid** | Fast + thorough | Best of both | Full support | Domain-aware |
## Pipeline Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ HYBRID ANNOTATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ INPUT │ │ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │ │
│ │ TEXT │───▶│ FAST-PASS │───▶│ REFINEMENT │───▶│ VALIDATION │ │
│ │ │ │ (GLiNER2) │ │ (LLM) │ │ (CROSS) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ AnnotationCandidate AnnotationCandidate EntityClaim │
│ (DETECTED) (REFINED) (VALIDATED) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Stage 1: Fast-Pass (GLiNER2)
**Purpose**: High-recall entity mention detection at ~100x speed of LLM-only.
**Technology**: GLiNER2 encoder model (`urchade/gliner_multi-v2.1`)
**Input**: Raw text document
**Output**: `List[AnnotationCandidate]` with status `DETECTED`
### Process
1. Tokenize input text
2. Run GLiNER2 span prediction with configurable threshold (default 0.5)
3. Map GLiNER2 generic labels to GLAM-NER hyponyms using `GLINER2_TO_GLAM_MAPPING`
4. Create `AnnotationCandidate` for each detected span
### GLiNER2 to GLAM-NER Type Mapping
```python
GLINER2_TO_GLAM_MAPPING = {
# Person types
"person": "AGT.PER",
"people": "AGT.PER",
# Organization types
"organization": "GRP",
"museum": "GRP.HER",
"library": "GRP.HER",
"archive": "GRP.HER",
"university": "GRP.EDU",
# Location types
"location": "TOP",
"city": "TOP.SET",
"country": "TOP.CTY",
"building": "TOP.BLD",
# Temporal types
"date": "TMP.DAB",
"time": "TMP.TAB",
"period": "TMP.ERA",
# ... (see hybrid_annotator.py for complete mapping)
}
```
### Configuration
```python
HybridConfig(
gliner_model="urchade/gliner_multi-v2.1", # Model to use
gliner_threshold=0.5, # Detection confidence threshold
gliner_entity_labels=None, # Custom labels (or use defaults)
gliner_device="cpu", # Device (cpu/cuda)
enable_fast_pass=True, # Enable/disable this stage
)
```
## Stage 2: Refinement (LLM)
**Purpose**: Entity type refinement, relationship extraction, and domain knowledge injection.
**Technology**: Z.AI GLM-4 (default), Claude, or GPT-4
**Input**: Original text + `List[AnnotationCandidate]` from Stage 1
**Output**: Refined `List[AnnotationCandidate]` + `List[RelationshipCandidate]`
### Process
1. Construct prompt with:
- Original text
- GLiNER2 candidate spans (as hints)
- GLAM-NER type definitions
- Relationship extraction instructions
2. LLM performs:
- **Type refinement**: Upgrade generic types (e.g., `GRP``GRP.HER`)
- **New entity detection**: Find entities GLiNER2 missed
- **Relationship extraction**: Identify semantic relationships
- **Entity linking hints**: Suggest Wikidata/VIAF IDs
- **Temporal/spatial scoping**: Add context
3. Parse LLM response and update candidates
### Relationship Extraction
The LLM extracts relationships following GLAM-NER relationship hyponyms:
```python
RelationshipCandidate(
subject_id="candidate-uuid-1",
subject_text="Rijksmuseum",
subject_type="GRP.HER",
relationship_type="REL.SPA.LOC", # Located at
relationship_label="located in",
object_id="candidate-uuid-2",
object_text="Amsterdam",
object_type="TOP.SET",
confidence=0.92,
)
```
### Configuration
```python
HybridConfig(
llm_model="glm-4", # Model name (auto-detects provider)
llm_api_key=None, # API key (or use ZAI_API_TOKEN env var)
enable_refinement=True, # Enable/disable this stage
enable_relationships=True, # Extract relationships
)
```
## Stage 3: Validation (Cross-Check)
**Purpose**: Cross-validate outputs, detect hallucinations, ensure consistency.
**Input**: Candidates from both Stage 1 and Stage 2
**Output**: Final `List[EntityClaim]` with status `VALIDATED` or `REJECTED`
### Process
1. **Merge candidates** from GLiNER2 and LLM:
- Match by span overlap (configurable threshold)
- Prefer LLM types on conflict (configurable)
- Create `MERGED` candidates from both sources
2. **Hallucination detection**:
- Verify LLM-only entities exist in source text
- Check for fabricated relationships
- Flag suspicious confidence scores
3. **Consistency checking**:
- Validate relationship domain/range constraints
- Check temporal coherence
- Verify entity type compatibility
4. **Final filtering**:
- Apply minimum confidence threshold
- Remove rejected candidates (or keep with flag)
### Merge Strategy
```
GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7)
LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95)
Overlap ratio: 0.67 > threshold (0.3)
→ MERGE: Use LLM span + confidence, mark as MERGED
→ Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED)
```
### Configuration
```python
HybridConfig(
enable_validation=True, # Enable/disable this stage
merge_threshold=0.3, # Minimum overlap ratio for merging
prefer_llm_on_conflict=True, # LLM types take precedence
minimum_confidence=0.3, # Filter low-confidence results
include_rejected=False, # Include rejected in output
)
```
## Data Structures
### AnnotationCandidate
Shared intermediate representation used across all pipeline stages:
```python
@dataclass
class AnnotationCandidate:
candidate_id: str # Unique identifier
text: str # Extracted text span
start_offset: int # Character start position
end_offset: int # Character end position
hypernym: Optional[str] # Top-level type (AGT, GRP, TOP, etc.)
hyponym: Optional[str] # Fine-grained type (AGT.PER, GRP.HER)
# Confidence scores
detection_confidence: float # GLiNER2 detection score
classification_confidence: float # Type classification score
overall_confidence: float # Combined confidence
# Source tracking
source: CandidateSource # GLINER2, LLM, HYBRID, MERGED
status: CandidateStatus # DETECTED, REFINED, VALIDATED, REJECTED
# Entity linking
wikidata_id: Optional[str]
viaf_id: Optional[str]
isil_id: Optional[str]
# Relationships (populated during LLM refinement)
relationships: List[Dict[str, Any]]
# Provenance
provenance: Optional[Provenance]
```
### RelationshipCandidate
Intermediate representation for relationships:
```python
@dataclass
class RelationshipCandidate:
relationship_id: str
relationship_type: str # e.g., REL.CRE.AUT, REL.SPA.LOC
relationship_label: str # Human-readable label
subject_id: str # Reference to AnnotationCandidate
subject_text: str
subject_type: str
object_id: str
object_text: str
object_type: str
temporal_scope: Optional[str] # e.g., "1885-1890"
spatial_scope: Optional[str] # e.g., "Amsterdam"
confidence: float
is_valid: bool # Domain/range validation result
```
### HybridAnnotationResult
Final output structure:
```python
@dataclass
class HybridAnnotationResult:
entities: List[AnnotationCandidate]
relationships: List[RelationshipCandidate]
source_text: str
# Pipeline stage flags
gliner_pass: bool = False
llm_pass: bool = False
validation_pass: bool = False
# Statistics
total_candidates: int = 0
merged_count: int = 0
rejected_count: int = 0
```
## Usage
### Basic Usage
```python
from glam_extractor.annotators import HybridAnnotator, HybridConfig
# Default configuration
annotator = HybridAnnotator()
# Annotate text
result = await annotator.annotate("""
The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects.
Vincent van Gogh's works are among the most famous in the collection.
""")
# Access results
for entity in result.entities:
print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})")
for rel in result.relationships:
print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}")
```
### Custom Configuration
```python
config = HybridConfig(
# Use smaller, faster GLiNER2 model
gliner_model="urchade/gliner_small",
gliner_threshold=0.6,
# Use Claude instead of Z.AI
llm_model="claude-3-sonnet-20240229",
# Disable relationship extraction for speed
enable_relationships=False,
# Stricter filtering
minimum_confidence=0.5,
)
annotator = HybridAnnotator(config=config)
```
### GLiNER2-Only Mode
For maximum speed when relationships aren't needed:
```python
config = HybridConfig(
enable_fast_pass=True,
enable_refinement=False, # Skip LLM
enable_validation=True,
)
annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)
```
### LLM-Only Mode
When GLiNER2 isn't available or for maximum accuracy:
```python
config = HybridConfig(
enable_fast_pass=False, # Skip GLiNER2
enable_refinement=True,
enable_validation=False, # No cross-validation without GLiNER2
)
annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)
```
## Performance Characteristics
| Configuration | Speed | Entity Recall | Entity Precision | Relationships |
|---------------|-------|---------------|------------------|---------------|
| GLiNER2-only | ~100x | High | Medium | None |
| LLM-only | 1x | Medium | High | Full |
| Hybrid (default) | ~10x | High | High | Full |
| Hybrid (no-rel) | ~20x | High | High | None |
## Dependencies
### Required
- Python 3.10+
- `dataclasses`
- `typing`
### Optional
- `gliner` - For GLiNER2 fast-pass (gracefully degrades if not installed)
- `httpx` or `aiohttp` - For LLM API calls
- `torch` - For GLiNER2 GPU acceleration
### Install GLiNER2
```bash
pip install gliner
# For GPU support
pip install gliner torch
```
## File Structure
```
src/glam_extractor/annotators/
├── __init__.py # Module exports
├── base.py # EntityClaim, Provenance, hypernyms
├── hybrid_annotator.py # HybridAnnotator, candidates, pipeline
├── llm_annotator.py # LLMAnnotator, provider configs
└── schema_builder.py # GLAMSchema, field specs
tests/annotators/
├── __init__.py
└── test_hybrid_annotator.py # 24 unit tests
```
## See Also
- [GLAM-NER v1.7.0 Entity Annotation Convention](./GLAM_NER_CONVENTION.md)
- [LLM Annotator Documentation](./LLM_ANNOTATOR.md)
- [Schema Builder Guide](./SCHEMA_BUILDER.md)