glam/docs/HYBRID_ANNOTATOR_ARCHITECTURE.md

# Hybrid GLiNER2 + LLM Annotator Architecture

This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction.

## Overview

The hybrid annotator addresses a fundamental trade-off in NLP annotation:

| Approach | Speed | Accuracy | Relationships | Domain Knowledge |
|----------|-------|----------|---------------|------------------|
| GLiNER2 (encoder) | ~100x faster | Good recall | Limited | Generic |
| LLM (decoder) | Slower | High precision | Excellent | Rich |
| **Hybrid** | Fast + thorough | Best of both | Full support | Domain-aware |

## Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         HYBRID ANNOTATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │   INPUT     │    │  STAGE 1    │    │  STAGE 2    │    │  STAGE 3    │  │
│  │   TEXT      │───▶│  FAST-PASS  │───▶│ REFINEMENT  │───▶│ VALIDATION  │  │
│  │             │    │  (GLiNER2)  │    │    (LLM)    │    │   (CROSS)   │  │
│  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
│                            │                  │                  │          │
│                            ▼                  ▼                  ▼          │
│                     AnnotationCandidate  AnnotationCandidate  EntityClaim   │
│                     (DETECTED)           (REFINED)            (VALIDATED)   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Stage 1: Fast-Pass (GLiNER2)

**Purpose**: High-recall entity mention detection at ~100x speed of LLM-only.

**Technology**: GLiNER2 encoder model (`urchade/gliner_multi-v2.1`)

**Input**: Raw text document

**Output**: `List[AnnotationCandidate]` with status `DETECTED`

### Process

1. Tokenize input text
2. Run GLiNER2 span prediction with configurable threshold (default 0.5)
3. Map GLiNER2 generic labels to GLAM-NER hyponyms using `GLINER2_TO_GLAM_MAPPING`
4. Create `AnnotationCandidate` for each detected span

### GLiNER2 to GLAM-NER Type Mapping

```python
GLINER2_TO_GLAM_MAPPING = {
    # Person types
    "person": "AGT.PER",
    "people": "AGT.PER",

    # Organization types
    "organization": "GRP",
    "museum": "GRP.HER",
    "library": "GRP.HER",
    "archive": "GRP.HER",
    "university": "GRP.EDU",

    # Location types
    "location": "TOP",
    "city": "TOP.SET",
    "country": "TOP.CTY",
    "building": "TOP.BLD",

    # Temporal types
    "date": "TMP.DAB",
    "time": "TMP.TAB",
    "period": "TMP.ERA",

    # ... (see hybrid_annotator.py for complete mapping)
}
```

### Configuration

```python
HybridConfig(
    gliner_model="urchade/gliner_multi-v2.1",  # Model to use
    gliner_threshold=0.5,                       # Detection confidence threshold
    gliner_entity_labels=None,                  # Custom labels (or use defaults)
    gliner_device="cpu",                        # Device (cpu/cuda)
    enable_fast_pass=True,                      # Enable/disable this stage
)
```

## Stage 2: Refinement (LLM)

**Purpose**: Entity type refinement, relationship extraction, and domain knowledge injection.

**Technology**: Z.AI GLM-4 (default), Claude, or GPT-4

**Input**: Original text + `List[AnnotationCandidate]` from Stage 1

**Output**: Refined `List[AnnotationCandidate]` + `List[RelationshipCandidate]`

### Process

1. Construct prompt with:
   - Original text
   - GLiNER2 candidate spans (as hints)
   - GLAM-NER type definitions
   - Relationship extraction instructions

2. LLM performs:
   - **Type refinement**: Upgrade generic types (e.g., `GRP` → `GRP.HER`)
   - **New entity detection**: Find entities GLiNER2 missed
   - **Relationship extraction**: Identify semantic relationships
   - **Entity linking hints**: Suggest Wikidata/VIAF IDs
   - **Temporal/spatial scoping**: Add context

3. Parse LLM response and update candidates

### Relationship Extraction

The LLM extracts relationships following GLAM-NER relationship hyponyms:

```python
RelationshipCandidate(
    subject_id="candidate-uuid-1",
    subject_text="Rijksmuseum",
    subject_type="GRP.HER",
    relationship_type="REL.SPA.LOC",  # Located at
    relationship_label="located in",
    object_id="candidate-uuid-2",
    object_text="Amsterdam",
    object_type="TOP.SET",
    confidence=0.92,
)
```

### Configuration

```python
HybridConfig(
    llm_model="glm-4",           # Model name (auto-detects provider)
    llm_api_key=None,            # API key (or use ZAI_API_TOKEN env var)
    enable_refinement=True,      # Enable/disable this stage
    enable_relationships=True,   # Extract relationships
)
```

## Stage 3: Validation (Cross-Check)

**Purpose**: Cross-validate outputs, detect hallucinations, ensure consistency.

**Input**: Candidates from both Stage 1 and Stage 2

**Output**: Final `List[EntityClaim]` with status `VALIDATED` or `REJECTED`

### Process

1. **Merge candidates** from GLiNER2 and LLM:
   - Match by span overlap (configurable threshold)
   - Prefer LLM types on conflict (configurable)
   - Create `MERGED` candidates from both sources

2. **Hallucination detection**:
   - Verify LLM-only entities exist in source text
   - Check for fabricated relationships
   - Flag suspicious confidence scores

3. **Consistency checking**:
   - Validate relationship domain/range constraints
   - Check temporal coherence
   - Verify entity type compatibility

4. **Final filtering**:
   - Apply minimum confidence threshold
   - Remove rejected candidates (or keep with flag)

### Merge Strategy

```
GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7)
LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95)

Overlap ratio: 0.67 > threshold (0.3)
→ MERGE: Use LLM span + confidence, mark as MERGED
→ Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED)
```

### Configuration

```python
HybridConfig(
    enable_validation=True,        # Enable/disable this stage
    merge_threshold=0.3,           # Minimum overlap ratio for merging
    prefer_llm_on_conflict=True,   # LLM types take precedence
    minimum_confidence=0.3,        # Filter low-confidence results
    include_rejected=False,        # Include rejected in output
)
```

## Data Structures

### AnnotationCandidate

Shared intermediate representation used across all pipeline stages:

```python
@dataclass
class AnnotationCandidate:
    candidate_id: str              # Unique identifier
    text: str                      # Extracted text span
    start_offset: int              # Character start position
    end_offset: int                # Character end position
    hypernym: Optional[str]        # Top-level type (AGT, GRP, TOP, etc.)
    hyponym: Optional[str]         # Fine-grained type (AGT.PER, GRP.HER)

    # Confidence scores
    detection_confidence: float    # GLiNER2 detection score
    classification_confidence: float  # Type classification score
    overall_confidence: float      # Combined confidence

    # Source tracking
    source: CandidateSource        # GLINER2, LLM, HYBRID, MERGED
    status: CandidateStatus        # DETECTED, REFINED, VALIDATED, REJECTED

    # Entity linking
    wikidata_id: Optional[str]
    viaf_id: Optional[str]
    isil_id: Optional[str]

    # Relationships (populated during LLM refinement)
    relationships: List[Dict[str, Any]]

    # Provenance
    provenance: Optional[Provenance]
```

### RelationshipCandidate

Intermediate representation for relationships:

```python
@dataclass
class RelationshipCandidate:
    relationship_id: str
    relationship_type: str         # e.g., REL.CRE.AUT, REL.SPA.LOC
    relationship_label: str        # Human-readable label

    subject_id: str                # Reference to AnnotationCandidate
    subject_text: str
    subject_type: str

    object_id: str
    object_text: str
    object_type: str

    temporal_scope: Optional[str]  # e.g., "1885-1890"
    spatial_scope: Optional[str]   # e.g., "Amsterdam"

    confidence: float
    is_valid: bool                 # Domain/range validation result
```

### HybridAnnotationResult

Final output structure:

```python
@dataclass
class HybridAnnotationResult:
    entities: List[AnnotationCandidate]
    relationships: List[RelationshipCandidate]
    source_text: str

    # Pipeline stage flags
    gliner_pass: bool = False
    llm_pass: bool = False
    validation_pass: bool = False

    # Statistics
    total_candidates: int = 0
    merged_count: int = 0
    rejected_count: int = 0
```

## Usage

### Basic Usage

```python
from glam_extractor.annotators import HybridAnnotator, HybridConfig

# Default configuration
annotator = HybridAnnotator()

# Annotate text
result = await annotator.annotate("""
    The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects.
    Vincent van Gogh's works are among the most famous in the collection.
""")

# Access results
for entity in result.entities:
    print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})")

for rel in result.relationships:
    print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}")
```

### Custom Configuration

```python
config = HybridConfig(
    # Use smaller, faster GLiNER2 model
    gliner_model="urchade/gliner_small",
    gliner_threshold=0.6,

    # Use Claude instead of Z.AI
    llm_model="claude-3-sonnet-20240229",

    # Disable relationship extraction for speed
    enable_relationships=False,

    # Stricter filtering
    minimum_confidence=0.5,
)

annotator = HybridAnnotator(config=config)
```

### GLiNER2-Only Mode

For maximum speed when relationships aren't needed:

```python
config = HybridConfig(
    enable_fast_pass=True,
    enable_refinement=False,  # Skip LLM
    enable_validation=True,
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)
```

### LLM-Only Mode

When GLiNER2 isn't available or for maximum accuracy:

```python
config = HybridConfig(
    enable_fast_pass=False,   # Skip GLiNER2
    enable_refinement=True,
    enable_validation=False,  # No cross-validation without GLiNER2
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)
```

## Performance Characteristics

| Configuration | Speed | Entity Recall | Entity Precision | Relationships |
|---------------|-------|---------------|------------------|---------------|
| GLiNER2-only | ~100x | High | Medium | None |
| LLM-only | 1x | Medium | High | Full |
| Hybrid (default) | ~10x | High | High | Full |
| Hybrid (no-rel) | ~20x | High | High | None |

## Dependencies

### Required
- Python 3.10+
- `dataclasses`
- `typing`

### Optional
- `gliner` - For GLiNER2 fast-pass (gracefully degrades if not installed)
- `httpx` or `aiohttp` - For LLM API calls
- `torch` - For GLiNER2 GPU acceleration

### Install GLiNER2

```bash
pip install gliner

# For GPU support
pip install gliner torch
```

## File Structure

```
src/glam_extractor/annotators/
├── __init__.py              # Module exports
├── base.py                  # EntityClaim, Provenance, hypernyms
├── hybrid_annotator.py      # HybridAnnotator, candidates, pipeline
├── llm_annotator.py         # LLMAnnotator, provider configs
└── schema_builder.py        # GLAMSchema, field specs

tests/annotators/
├── __init__.py
└── test_hybrid_annotator.py # 24 unit tests
```

## See Also

- [GLAM-NER v1.7.0 Entity Annotation Convention](./GLAM_NER_CONVENTION.md)
- [LLM Annotator Documentation](./LLM_ANNOTATOR.md)
- [Schema Builder Guide](./SCHEMA_BUILDER.md)