glam/docs/HYBRID_ANNOTATOR_ARCHITECTURE.md
2025-12-05 16:25:39 +01:00

13 KiB

Hybrid GLiNER2 + LLM Annotator Architecture

This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction.

Overview

The hybrid annotator addresses a fundamental trade-off in NLP annotation:

Approach Speed Accuracy Relationships Domain Knowledge
GLiNER2 (encoder) ~100x faster Good recall Limited Generic
LLM (decoder) Slower High precision Excellent Rich
Hybrid Fast + thorough Best of both Full support Domain-aware

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HYBRID ANNOTATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │   INPUT     │    │  STAGE 1    │    │  STAGE 2    │    │  STAGE 3    │  │
│  │   TEXT      │───▶│  FAST-PASS  │───▶│ REFINEMENT  │───▶│ VALIDATION  │  │
│  │             │    │  (GLiNER2)  │    │    (LLM)    │    │   (CROSS)   │  │
│  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
│                            │                  │                  │          │
│                            ▼                  ▼                  ▼          │
│                     AnnotationCandidate  AnnotationCandidate  EntityClaim   │
│                     (DETECTED)           (REFINED)            (VALIDATED)   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 1: Fast-Pass (GLiNER2)

Purpose: High-recall entity mention detection at ~100x speed of LLM-only.

Technology: GLiNER2 encoder model (urchade/gliner_multi-v2.1)

Input: Raw text document

Output: List[AnnotationCandidate] with status DETECTED

Process

  1. Tokenize input text
  2. Run GLiNER2 span prediction with configurable threshold (default 0.5)
  3. Map GLiNER2 generic labels to GLAM-NER hyponyms using GLINER2_TO_GLAM_MAPPING
  4. Create AnnotationCandidate for each detected span

GLiNER2 to GLAM-NER Type Mapping

GLINER2_TO_GLAM_MAPPING = {
    # Person types
    "person": "AGT.PER",
    "people": "AGT.PER",
    
    # Organization types  
    "organization": "GRP",
    "museum": "GRP.HER",
    "library": "GRP.HER",
    "archive": "GRP.HER",
    "university": "GRP.EDU",
    
    # Location types
    "location": "TOP",
    "city": "TOP.SET",
    "country": "TOP.CTY",
    "building": "TOP.BLD",
    
    # Temporal types
    "date": "TMP.DAB",
    "time": "TMP.TAB",
    "period": "TMP.ERA",
    
    # ... (see hybrid_annotator.py for complete mapping)
}

Configuration

HybridConfig(
    gliner_model="urchade/gliner_multi-v2.1",  # Model to use
    gliner_threshold=0.5,                       # Detection confidence threshold
    gliner_entity_labels=None,                  # Custom labels (or use defaults)
    gliner_device="cpu",                        # Device (cpu/cuda)
    enable_fast_pass=True,                      # Enable/disable this stage
)

Stage 2: Refinement (LLM)

Purpose: Entity type refinement, relationship extraction, and domain knowledge injection.

Technology: Z.AI GLM-4 (default), Claude, or GPT-4

Input: Original text + List[AnnotationCandidate] from Stage 1

Output: Refined List[AnnotationCandidate] + List[RelationshipCandidate]

Process

  1. Construct prompt with:

    • Original text
    • GLiNER2 candidate spans (as hints)
    • GLAM-NER type definitions
    • Relationship extraction instructions
  2. LLM performs:

    • Type refinement: Upgrade generic types (e.g., GRPGRP.HER)
    • New entity detection: Find entities GLiNER2 missed
    • Relationship extraction: Identify semantic relationships
    • Entity linking hints: Suggest Wikidata/VIAF IDs
    • Temporal/spatial scoping: Add context
  3. Parse LLM response and update candidates

Relationship Extraction

The LLM extracts relationships following GLAM-NER relationship hyponyms:

RelationshipCandidate(
    subject_id="candidate-uuid-1",
    subject_text="Rijksmuseum",
    subject_type="GRP.HER",
    relationship_type="REL.SPA.LOC",  # Located at
    relationship_label="located in",
    object_id="candidate-uuid-2", 
    object_text="Amsterdam",
    object_type="TOP.SET",
    confidence=0.92,
)

Configuration

HybridConfig(
    llm_model="glm-4",           # Model name (auto-detects provider)
    llm_api_key=None,            # API key (or use ZAI_API_TOKEN env var)
    enable_refinement=True,      # Enable/disable this stage
    enable_relationships=True,   # Extract relationships
)

Stage 3: Validation (Cross-Check)

Purpose: Cross-validate outputs, detect hallucinations, ensure consistency.

Input: Candidates from both Stage 1 and Stage 2

Output: Final List[EntityClaim] with status VALIDATED or REJECTED

Process

  1. Merge candidates from GLiNER2 and LLM:

    • Match by span overlap (configurable threshold)
    • Prefer LLM types on conflict (configurable)
    • Create MERGED candidates from both sources
  2. Hallucination detection:

    • Verify LLM-only entities exist in source text
    • Check for fabricated relationships
    • Flag suspicious confidence scores
  3. Consistency checking:

    • Validate relationship domain/range constraints
    • Check temporal coherence
    • Verify entity type compatibility
  4. Final filtering:

    • Apply minimum confidence threshold
    • Remove rejected candidates (or keep with flag)

Merge Strategy

GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7)
LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95)

Overlap ratio: 0.67 > threshold (0.3)
→ MERGE: Use LLM span + confidence, mark as MERGED
→ Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED)

Configuration

HybridConfig(
    enable_validation=True,        # Enable/disable this stage
    merge_threshold=0.3,           # Minimum overlap ratio for merging
    prefer_llm_on_conflict=True,   # LLM types take precedence
    minimum_confidence=0.3,        # Filter low-confidence results
    include_rejected=False,        # Include rejected in output
)

Data Structures

AnnotationCandidate

Shared intermediate representation used across all pipeline stages:

@dataclass
class AnnotationCandidate:
    candidate_id: str              # Unique identifier
    text: str                      # Extracted text span
    start_offset: int              # Character start position
    end_offset: int                # Character end position
    hypernym: Optional[str]        # Top-level type (AGT, GRP, TOP, etc.)
    hyponym: Optional[str]         # Fine-grained type (AGT.PER, GRP.HER)
    
    # Confidence scores
    detection_confidence: float    # GLiNER2 detection score
    classification_confidence: float  # Type classification score
    overall_confidence: float      # Combined confidence
    
    # Source tracking
    source: CandidateSource        # GLINER2, LLM, HYBRID, MERGED
    status: CandidateStatus        # DETECTED, REFINED, VALIDATED, REJECTED
    
    # Entity linking
    wikidata_id: Optional[str]
    viaf_id: Optional[str]
    isil_id: Optional[str]
    
    # Relationships (populated during LLM refinement)
    relationships: List[Dict[str, Any]]
    
    # Provenance
    provenance: Optional[Provenance]

RelationshipCandidate

Intermediate representation for relationships:

@dataclass
class RelationshipCandidate:
    relationship_id: str
    relationship_type: str         # e.g., REL.CRE.AUT, REL.SPA.LOC
    relationship_label: str        # Human-readable label
    
    subject_id: str                # Reference to AnnotationCandidate
    subject_text: str
    subject_type: str
    
    object_id: str
    object_text: str
    object_type: str
    
    temporal_scope: Optional[str]  # e.g., "1885-1890"
    spatial_scope: Optional[str]   # e.g., "Amsterdam"
    
    confidence: float
    is_valid: bool                 # Domain/range validation result

HybridAnnotationResult

Final output structure:

@dataclass
class HybridAnnotationResult:
    entities: List[AnnotationCandidate]
    relationships: List[RelationshipCandidate]
    source_text: str
    
    # Pipeline stage flags
    gliner_pass: bool = False
    llm_pass: bool = False
    validation_pass: bool = False
    
    # Statistics
    total_candidates: int = 0
    merged_count: int = 0
    rejected_count: int = 0

Usage

Basic Usage

from glam_extractor.annotators import HybridAnnotator, HybridConfig

# Default configuration
annotator = HybridAnnotator()

# Annotate text
result = await annotator.annotate("""
    The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects.
    Vincent van Gogh's works are among the most famous in the collection.
""")

# Access results
for entity in result.entities:
    print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})")
    
for rel in result.relationships:
    print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}")

Custom Configuration

config = HybridConfig(
    # Use smaller, faster GLiNER2 model
    gliner_model="urchade/gliner_small",
    gliner_threshold=0.6,
    
    # Use Claude instead of Z.AI
    llm_model="claude-3-sonnet-20240229",
    
    # Disable relationship extraction for speed
    enable_relationships=False,
    
    # Stricter filtering
    minimum_confidence=0.5,
)

annotator = HybridAnnotator(config=config)

GLiNER2-Only Mode

For maximum speed when relationships aren't needed:

config = HybridConfig(
    enable_fast_pass=True,
    enable_refinement=False,  # Skip LLM
    enable_validation=True,
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)

LLM-Only Mode

When GLiNER2 isn't available or for maximum accuracy:

config = HybridConfig(
    enable_fast_pass=False,   # Skip GLiNER2
    enable_refinement=True,
    enable_validation=False,  # No cross-validation without GLiNER2
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)

Performance Characteristics

Configuration Speed Entity Recall Entity Precision Relationships
GLiNER2-only ~100x High Medium None
LLM-only 1x Medium High Full
Hybrid (default) ~10x High High Full
Hybrid (no-rel) ~20x High High None

Dependencies

Required

  • Python 3.10+
  • dataclasses
  • typing

Optional

  • gliner - For GLiNER2 fast-pass (gracefully degrades if not installed)
  • httpx or aiohttp - For LLM API calls
  • torch - For GLiNER2 GPU acceleration

Install GLiNER2

pip install gliner

# For GPU support
pip install gliner torch

File Structure

src/glam_extractor/annotators/
├── __init__.py              # Module exports
├── base.py                  # EntityClaim, Provenance, hypernyms
├── hybrid_annotator.py      # HybridAnnotator, candidates, pipeline
├── llm_annotator.py         # LLMAnnotator, provider configs
└── schema_builder.py        # GLAMSchema, field specs

tests/annotators/
├── __init__.py
└── test_hybrid_annotator.py # 24 unit tests

See Also