411 lines
13 KiB
Markdown
411 lines
13 KiB
Markdown
# Hybrid GLiNER2 + LLM Annotator Architecture
|
|
|
|
This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction.
|
|
|
|
## Overview
|
|
|
|
The hybrid annotator addresses a fundamental trade-off in NLP annotation:
|
|
|
|
| Approach | Speed | Accuracy | Relationships | Domain Knowledge |
|
|
|----------|-------|----------|---------------|------------------|
|
|
| GLiNER2 (encoder) | ~100x faster | Good recall | Limited | Generic |
|
|
| LLM (decoder) | Slower | High precision | Excellent | Rich |
|
|
| **Hybrid** | Fast + thorough | Best of both | Full support | Domain-aware |
|
|
|
|
## Pipeline Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ HYBRID ANNOTATION PIPELINE │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ INPUT │ │ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │ │
|
|
│ │ TEXT │───▶│ FAST-PASS │───▶│ REFINEMENT │───▶│ VALIDATION │ │
|
|
│ │ │ │ (GLiNER2) │ │ (LLM) │ │ (CROSS) │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ AnnotationCandidate AnnotationCandidate EntityClaim │
|
|
│ (DETECTED) (REFINED) (VALIDATED) │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Stage 1: Fast-Pass (GLiNER2)
|
|
|
|
**Purpose**: High-recall entity mention detection at ~100x speed of LLM-only.
|
|
|
|
**Technology**: GLiNER2 encoder model (`urchade/gliner_multi-v2.1`)
|
|
|
|
**Input**: Raw text document
|
|
|
|
**Output**: `List[AnnotationCandidate]` with status `DETECTED`
|
|
|
|
### Process
|
|
|
|
1. Tokenize input text
|
|
2. Run GLiNER2 span prediction with configurable threshold (default 0.5)
|
|
3. Map GLiNER2 generic labels to GLAM-NER hyponyms using `GLINER2_TO_GLAM_MAPPING`
|
|
4. Create `AnnotationCandidate` for each detected span
|
|
|
|
### GLiNER2 to GLAM-NER Type Mapping
|
|
|
|
```python
|
|
GLINER2_TO_GLAM_MAPPING = {
|
|
# Person types
|
|
"person": "AGT.PER",
|
|
"people": "AGT.PER",
|
|
|
|
# Organization types
|
|
"organization": "GRP",
|
|
"museum": "GRP.HER",
|
|
"library": "GRP.HER",
|
|
"archive": "GRP.HER",
|
|
"university": "GRP.EDU",
|
|
|
|
# Location types
|
|
"location": "TOP",
|
|
"city": "TOP.SET",
|
|
"country": "TOP.CTY",
|
|
"building": "TOP.BLD",
|
|
|
|
# Temporal types
|
|
"date": "TMP.DAB",
|
|
"time": "TMP.TAB",
|
|
"period": "TMP.ERA",
|
|
|
|
# ... (see hybrid_annotator.py for complete mapping)
|
|
}
|
|
```
|
|
|
|
### Configuration
|
|
|
|
```python
|
|
HybridConfig(
|
|
gliner_model="urchade/gliner_multi-v2.1", # Model to use
|
|
gliner_threshold=0.5, # Detection confidence threshold
|
|
gliner_entity_labels=None, # Custom labels (or use defaults)
|
|
gliner_device="cpu", # Device (cpu/cuda)
|
|
enable_fast_pass=True, # Enable/disable this stage
|
|
)
|
|
```
|
|
|
|
## Stage 2: Refinement (LLM)
|
|
|
|
**Purpose**: Entity type refinement, relationship extraction, and domain knowledge injection.
|
|
|
|
**Technology**: Z.AI GLM-4 (default), Claude, or GPT-4
|
|
|
|
**Input**: Original text + `List[AnnotationCandidate]` from Stage 1
|
|
|
|
**Output**: Refined `List[AnnotationCandidate]` + `List[RelationshipCandidate]`
|
|
|
|
### Process
|
|
|
|
1. Construct prompt with:
|
|
- Original text
|
|
- GLiNER2 candidate spans (as hints)
|
|
- GLAM-NER type definitions
|
|
- Relationship extraction instructions
|
|
|
|
2. LLM performs:
|
|
- **Type refinement**: Upgrade generic types (e.g., `GRP` → `GRP.HER`)
|
|
- **New entity detection**: Find entities GLiNER2 missed
|
|
- **Relationship extraction**: Identify semantic relationships
|
|
- **Entity linking hints**: Suggest Wikidata/VIAF IDs
|
|
- **Temporal/spatial scoping**: Add context
|
|
|
|
3. Parse LLM response and update candidates
|
|
|
|
### Relationship Extraction
|
|
|
|
The LLM extracts relationships following GLAM-NER relationship hyponyms:
|
|
|
|
```python
|
|
RelationshipCandidate(
|
|
subject_id="candidate-uuid-1",
|
|
subject_text="Rijksmuseum",
|
|
subject_type="GRP.HER",
|
|
relationship_type="REL.SPA.LOC", # Located at
|
|
relationship_label="located in",
|
|
object_id="candidate-uuid-2",
|
|
object_text="Amsterdam",
|
|
object_type="TOP.SET",
|
|
confidence=0.92,
|
|
)
|
|
```
|
|
|
|
### Configuration
|
|
|
|
```python
|
|
HybridConfig(
|
|
llm_model="glm-4", # Model name (auto-detects provider)
|
|
llm_api_key=None, # API key (or use ZAI_API_TOKEN env var)
|
|
enable_refinement=True, # Enable/disable this stage
|
|
enable_relationships=True, # Extract relationships
|
|
)
|
|
```
|
|
|
|
## Stage 3: Validation (Cross-Check)
|
|
|
|
**Purpose**: Cross-validate outputs, detect hallucinations, ensure consistency.
|
|
|
|
**Input**: Candidates from both Stage 1 and Stage 2
|
|
|
|
**Output**: Final `List[EntityClaim]` with status `VALIDATED` or `REJECTED`
|
|
|
|
### Process
|
|
|
|
1. **Merge candidates** from GLiNER2 and LLM:
|
|
- Match by span overlap (configurable threshold)
|
|
- Prefer LLM types on conflict (configurable)
|
|
- Create `MERGED` candidates from both sources
|
|
|
|
2. **Hallucination detection**:
|
|
- Verify LLM-only entities exist in source text
|
|
- Check for fabricated relationships
|
|
- Flag suspicious confidence scores
|
|
|
|
3. **Consistency checking**:
|
|
- Validate relationship domain/range constraints
|
|
- Check temporal coherence
|
|
- Verify entity type compatibility
|
|
|
|
4. **Final filtering**:
|
|
- Apply minimum confidence threshold
|
|
- Remove rejected candidates (or keep with flag)
|
|
|
|
### Merge Strategy
|
|
|
|
```
|
|
GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7)
|
|
LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95)
|
|
|
|
Overlap ratio: 0.67 > threshold (0.3)
|
|
→ MERGE: Use LLM span + confidence, mark as MERGED
|
|
→ Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED)
|
|
```
|
|
|
|
### Configuration
|
|
|
|
```python
|
|
HybridConfig(
|
|
enable_validation=True, # Enable/disable this stage
|
|
merge_threshold=0.3, # Minimum overlap ratio for merging
|
|
prefer_llm_on_conflict=True, # LLM types take precedence
|
|
minimum_confidence=0.3, # Filter low-confidence results
|
|
include_rejected=False, # Include rejected in output
|
|
)
|
|
```
|
|
|
|
## Data Structures
|
|
|
|
### AnnotationCandidate
|
|
|
|
Shared intermediate representation used across all pipeline stages:
|
|
|
|
```python
|
|
@dataclass
|
|
class AnnotationCandidate:
|
|
candidate_id: str # Unique identifier
|
|
text: str # Extracted text span
|
|
start_offset: int # Character start position
|
|
end_offset: int # Character end position
|
|
hypernym: Optional[str] # Top-level type (AGT, GRP, TOP, etc.)
|
|
hyponym: Optional[str] # Fine-grained type (AGT.PER, GRP.HER)
|
|
|
|
# Confidence scores
|
|
detection_confidence: float # GLiNER2 detection score
|
|
classification_confidence: float # Type classification score
|
|
overall_confidence: float # Combined confidence
|
|
|
|
# Source tracking
|
|
source: CandidateSource # GLINER2, LLM, HYBRID, MERGED
|
|
status: CandidateStatus # DETECTED, REFINED, VALIDATED, REJECTED
|
|
|
|
# Entity linking
|
|
wikidata_id: Optional[str]
|
|
viaf_id: Optional[str]
|
|
isil_id: Optional[str]
|
|
|
|
# Relationships (populated during LLM refinement)
|
|
relationships: List[Dict[str, Any]]
|
|
|
|
# Provenance
|
|
provenance: Optional[Provenance]
|
|
```
|
|
|
|
### RelationshipCandidate
|
|
|
|
Intermediate representation for relationships:
|
|
|
|
```python
|
|
@dataclass
|
|
class RelationshipCandidate:
|
|
relationship_id: str
|
|
relationship_type: str # e.g., REL.CRE.AUT, REL.SPA.LOC
|
|
relationship_label: str # Human-readable label
|
|
|
|
subject_id: str # Reference to AnnotationCandidate
|
|
subject_text: str
|
|
subject_type: str
|
|
|
|
object_id: str
|
|
object_text: str
|
|
object_type: str
|
|
|
|
temporal_scope: Optional[str] # e.g., "1885-1890"
|
|
spatial_scope: Optional[str] # e.g., "Amsterdam"
|
|
|
|
confidence: float
|
|
is_valid: bool # Domain/range validation result
|
|
```
|
|
|
|
### HybridAnnotationResult
|
|
|
|
Final output structure:
|
|
|
|
```python
|
|
@dataclass
|
|
class HybridAnnotationResult:
|
|
entities: List[AnnotationCandidate]
|
|
relationships: List[RelationshipCandidate]
|
|
source_text: str
|
|
|
|
# Pipeline stage flags
|
|
gliner_pass: bool = False
|
|
llm_pass: bool = False
|
|
validation_pass: bool = False
|
|
|
|
# Statistics
|
|
total_candidates: int = 0
|
|
merged_count: int = 0
|
|
rejected_count: int = 0
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from glam_extractor.annotators import HybridAnnotator, HybridConfig
|
|
|
|
# Default configuration
|
|
annotator = HybridAnnotator()
|
|
|
|
# Annotate text
|
|
result = await annotator.annotate("""
|
|
The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects.
|
|
Vincent van Gogh's works are among the most famous in the collection.
|
|
""")
|
|
|
|
# Access results
|
|
for entity in result.entities:
|
|
print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})")
|
|
|
|
for rel in result.relationships:
|
|
print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}")
|
|
```
|
|
|
|
### Custom Configuration
|
|
|
|
```python
|
|
config = HybridConfig(
|
|
# Use smaller, faster GLiNER2 model
|
|
gliner_model="urchade/gliner_small",
|
|
gliner_threshold=0.6,
|
|
|
|
# Use Claude instead of Z.AI
|
|
llm_model="claude-3-sonnet-20240229",
|
|
|
|
# Disable relationship extraction for speed
|
|
enable_relationships=False,
|
|
|
|
# Stricter filtering
|
|
minimum_confidence=0.5,
|
|
)
|
|
|
|
annotator = HybridAnnotator(config=config)
|
|
```
|
|
|
|
### GLiNER2-Only Mode
|
|
|
|
For maximum speed when relationships aren't needed:
|
|
|
|
```python
|
|
config = HybridConfig(
|
|
enable_fast_pass=True,
|
|
enable_refinement=False, # Skip LLM
|
|
enable_validation=True,
|
|
)
|
|
|
|
annotator = HybridAnnotator(config=config)
|
|
result = await annotator.annotate(text)
|
|
```
|
|
|
|
### LLM-Only Mode
|
|
|
|
When GLiNER2 isn't available or for maximum accuracy:
|
|
|
|
```python
|
|
config = HybridConfig(
|
|
enable_fast_pass=False, # Skip GLiNER2
|
|
enable_refinement=True,
|
|
enable_validation=False, # No cross-validation without GLiNER2
|
|
)
|
|
|
|
annotator = HybridAnnotator(config=config)
|
|
result = await annotator.annotate(text)
|
|
```
|
|
|
|
## Performance Characteristics
|
|
|
|
| Configuration | Speed | Entity Recall | Entity Precision | Relationships |
|
|
|---------------|-------|---------------|------------------|---------------|
|
|
| GLiNER2-only | ~100x | High | Medium | None |
|
|
| LLM-only | 1x | Medium | High | Full |
|
|
| Hybrid (default) | ~10x | High | High | Full |
|
|
| Hybrid (no-rel) | ~20x | High | High | None |
|
|
|
|
## Dependencies
|
|
|
|
### Required
|
|
- Python 3.10+
|
|
- `dataclasses`
|
|
- `typing`
|
|
|
|
### Optional
|
|
- `gliner` - For GLiNER2 fast-pass (gracefully degrades if not installed)
|
|
- `httpx` or `aiohttp` - For LLM API calls
|
|
- `torch` - For GLiNER2 GPU acceleration
|
|
|
|
### Install GLiNER2
|
|
|
|
```bash
|
|
pip install gliner
|
|
|
|
# For GPU support
|
|
pip install gliner torch
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
src/glam_extractor/annotators/
|
|
├── __init__.py # Module exports
|
|
├── base.py # EntityClaim, Provenance, hypernyms
|
|
├── hybrid_annotator.py # HybridAnnotator, candidates, pipeline
|
|
├── llm_annotator.py # LLMAnnotator, provider configs
|
|
└── schema_builder.py # GLAMSchema, field specs
|
|
|
|
tests/annotators/
|
|
├── __init__.py
|
|
└── test_hybrid_annotator.py # 24 unit tests
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [GLAM-NER v1.7.0 Entity Annotation Convention](./GLAM_NER_CONVENTION.md)
|
|
- [LLM Annotator Documentation](./LLM_ANNOTATOR.md)
|
|
- [Schema Builder Guide](./SCHEMA_BUILDER.md)
|