kempersc 4da64eeebf improve annotator

2025-12-05 16:25:39 +01:00

13 KiB

Raw Blame History

Hybrid GLiNER2 + LLM Annotator Architecture

This document describes the hybrid annotation pipeline that combines fast encoder-based NER (GLiNER2) with powerful LLM reasoning for comprehensive entity and relationship extraction.

Overview

The hybrid annotator addresses a fundamental trade-off in NLP annotation:

Approach	Speed	Accuracy	Relationships	Domain Knowledge
GLiNER2 (encoder)	~100x faster	Good recall	Limited	Generic
LLM (decoder)	Slower	High precision	Excellent	Rich
Hybrid	Fast + thorough	Best of both	Full support	Domain-aware

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HYBRID ANNOTATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │   INPUT     │    │  STAGE 1    │    │  STAGE 2    │    │  STAGE 3    │  │
│  │   TEXT      │───▶│  FAST-PASS  │───▶│ REFINEMENT  │───▶│ VALIDATION  │  │
│  │             │    │  (GLiNER2)  │    │    (LLM)    │    │   (CROSS)   │  │
│  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
│                            │                  │                  │          │
│                            ▼                  ▼                  ▼          │
│                     AnnotationCandidate  AnnotationCandidate  EntityClaim   │
│                     (DETECTED)           (REFINED)            (VALIDATED)   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 1: Fast-Pass (GLiNER2)

Purpose: High-recall entity mention detection at ~100x speed of LLM-only.

Technology: GLiNER2 encoder model (urchade/gliner_multi-v2.1)

Input: Raw text document

Output: List[AnnotationCandidate] with status DETECTED

Process

Tokenize input text
Run GLiNER2 span prediction with configurable threshold (default 0.5)
Map GLiNER2 generic labels to GLAM-NER hyponyms using GLINER2_TO_GLAM_MAPPING
Create AnnotationCandidate for each detected span

GLiNER2 to GLAM-NER Type Mapping

GLINER2_TO_GLAM_MAPPING = {
    # Person types
    "person": "AGT.PER",
    "people": "AGT.PER",
    
    # Organization types  
    "organization": "GRP",
    "museum": "GRP.HER",
    "library": "GRP.HER",
    "archive": "GRP.HER",
    "university": "GRP.EDU",
    
    # Location types
    "location": "TOP",
    "city": "TOP.SET",
    "country": "TOP.CTY",
    "building": "TOP.BLD",
    
    # Temporal types
    "date": "TMP.DAB",
    "time": "TMP.TAB",
    "period": "TMP.ERA",
    
    # ... (see hybrid_annotator.py for complete mapping)
}

Configuration

HybridConfig(
    gliner_model="urchade/gliner_multi-v2.1",  # Model to use
    gliner_threshold=0.5,                       # Detection confidence threshold
    gliner_entity_labels=None,                  # Custom labels (or use defaults)
    gliner_device="cpu",                        # Device (cpu/cuda)
    enable_fast_pass=True,                      # Enable/disable this stage
)

Stage 2: Refinement (LLM)

Purpose: Entity type refinement, relationship extraction, and domain knowledge injection.

Technology: Z.AI GLM-4 (default), Claude, or GPT-4

Input: Original text + List[AnnotationCandidate] from Stage 1

Output: Refined List[AnnotationCandidate] + List[RelationshipCandidate]

Process

Construct prompt with:
- Original text
- GLiNER2 candidate spans (as hints)
- GLAM-NER type definitions
- Relationship extraction instructions
LLM performs:
- Type refinement: Upgrade generic types (e.g., GRP → GRP.HER)
- New entity detection: Find entities GLiNER2 missed
- Relationship extraction: Identify semantic relationships
- Entity linking hints: Suggest Wikidata/VIAF IDs
- Temporal/spatial scoping: Add context
Parse LLM response and update candidates

Relationship Extraction

The LLM extracts relationships following GLAM-NER relationship hyponyms:

RelationshipCandidate(
    subject_id="candidate-uuid-1",
    subject_text="Rijksmuseum",
    subject_type="GRP.HER",
    relationship_type="REL.SPA.LOC",  # Located at
    relationship_label="located in",
    object_id="candidate-uuid-2", 
    object_text="Amsterdam",
    object_type="TOP.SET",
    confidence=0.92,
)

Configuration

HybridConfig(
    llm_model="glm-4",           # Model name (auto-detects provider)
    llm_api_key=None,            # API key (or use ZAI_API_TOKEN env var)
    enable_refinement=True,      # Enable/disable this stage
    enable_relationships=True,   # Extract relationships
)

Stage 3: Validation (Cross-Check)

Purpose: Cross-validate outputs, detect hallucinations, ensure consistency.

Input: Candidates from both Stage 1 and Stage 2

Output: Final List[EntityClaim] with status VALIDATED or REJECTED

Process

Merge candidates from GLiNER2 and LLM:
- Match by span overlap (configurable threshold)
- Prefer LLM types on conflict (configurable)
- Create MERGED candidates from both sources
Hallucination detection:
- Verify LLM-only entities exist in source text
- Check for fabricated relationships
- Flag suspicious confidence scores
Consistency checking:
- Validate relationship domain/range constraints
- Check temporal coherence
- Verify entity type compatibility
Final filtering:
- Apply minimum confidence threshold
- Remove rejected candidates (or keep with flag)

Merge Strategy

GLiNER2 Candidate: "Van Gogh" (AGT.PER, confidence=0.7)
LLM Candidate: "Vincent van Gogh" (AGT.PER, confidence=0.95)

Overlap ratio: 0.67 > threshold (0.3)
→ MERGE: Use LLM span + confidence, mark as MERGED
→ Result: "Vincent van Gogh" (AGT.PER, confidence=0.95, source=MERGED)

Configuration

HybridConfig(
    enable_validation=True,        # Enable/disable this stage
    merge_threshold=0.3,           # Minimum overlap ratio for merging
    prefer_llm_on_conflict=True,   # LLM types take precedence
    minimum_confidence=0.3,        # Filter low-confidence results
    include_rejected=False,        # Include rejected in output
)

Data Structures

AnnotationCandidate

Shared intermediate representation used across all pipeline stages:

@dataclass
class AnnotationCandidate:
    candidate_id: str              # Unique identifier
    text: str                      # Extracted text span
    start_offset: int              # Character start position
    end_offset: int                # Character end position
    hypernym: Optional[str]        # Top-level type (AGT, GRP, TOP, etc.)
    hyponym: Optional[str]         # Fine-grained type (AGT.PER, GRP.HER)
    
    # Confidence scores
    detection_confidence: float    # GLiNER2 detection score
    classification_confidence: float  # Type classification score
    overall_confidence: float      # Combined confidence
    
    # Source tracking
    source: CandidateSource        # GLINER2, LLM, HYBRID, MERGED
    status: CandidateStatus        # DETECTED, REFINED, VALIDATED, REJECTED
    
    # Entity linking
    wikidata_id: Optional[str]
    viaf_id: Optional[str]
    isil_id: Optional[str]
    
    # Relationships (populated during LLM refinement)
    relationships: List[Dict[str, Any]]
    
    # Provenance
    provenance: Optional[Provenance]

RelationshipCandidate

Intermediate representation for relationships:

@dataclass
class RelationshipCandidate:
    relationship_id: str
    relationship_type: str         # e.g., REL.CRE.AUT, REL.SPA.LOC
    relationship_label: str        # Human-readable label
    
    subject_id: str                # Reference to AnnotationCandidate
    subject_text: str
    subject_type: str
    
    object_id: str
    object_text: str
    object_type: str
    
    temporal_scope: Optional[str]  # e.g., "1885-1890"
    spatial_scope: Optional[str]   # e.g., "Amsterdam"
    
    confidence: float
    is_valid: bool                 # Domain/range validation result

HybridAnnotationResult

Final output structure:

@dataclass
class HybridAnnotationResult:
    entities: List[AnnotationCandidate]
    relationships: List[RelationshipCandidate]
    source_text: str
    
    # Pipeline stage flags
    gliner_pass: bool = False
    llm_pass: bool = False
    validation_pass: bool = False
    
    # Statistics
    total_candidates: int = 0
    merged_count: int = 0
    rejected_count: int = 0

Usage

Basic Usage

from glam_extractor.annotators import HybridAnnotator, HybridConfig

# Default configuration
annotator = HybridAnnotator()

# Annotate text
result = await annotator.annotate("""
    The Rijksmuseum in Amsterdam, founded in 1800, houses over 8,000 objects.
    Vincent van Gogh's works are among the most famous in the collection.
""")

# Access results
for entity in result.entities:
    print(f"{entity.text}: {entity.hyponym} ({entity.overall_confidence:.2f})")
    
for rel in result.relationships:
    print(f"{rel.subject_text} --[{rel.relationship_label}]--> {rel.object_text}")

Custom Configuration

config = HybridConfig(
    # Use smaller, faster GLiNER2 model
    gliner_model="urchade/gliner_small",
    gliner_threshold=0.6,
    
    # Use Claude instead of Z.AI
    llm_model="claude-3-sonnet-20240229",
    
    # Disable relationship extraction for speed
    enable_relationships=False,
    
    # Stricter filtering
    minimum_confidence=0.5,
)

annotator = HybridAnnotator(config=config)

GLiNER2-Only Mode

For maximum speed when relationships aren't needed:

config = HybridConfig(
    enable_fast_pass=True,
    enable_refinement=False,  # Skip LLM
    enable_validation=True,
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)

LLM-Only Mode

When GLiNER2 isn't available or for maximum accuracy:

config = HybridConfig(
    enable_fast_pass=False,   # Skip GLiNER2
    enable_refinement=True,
    enable_validation=False,  # No cross-validation without GLiNER2
)

annotator = HybridAnnotator(config=config)
result = await annotator.annotate(text)

Performance Characteristics

Configuration	Speed	Entity Recall	Entity Precision	Relationships
GLiNER2-only	~100x	High	Medium	None
LLM-only	1x	Medium	High	Full
Hybrid (default)	~10x	High	High	Full
Hybrid (no-rel)	~20x	High	High	None

Dependencies

Required

Python 3.10+
dataclasses
typing

Optional

gliner - For GLiNER2 fast-pass (gracefully degrades if not installed)
httpx or aiohttp - For LLM API calls
torch - For GLiNER2 GPU acceleration

Install GLiNER2

pip install gliner

# For GPU support
pip install gliner torch

File Structure

src/glam_extractor/annotators/
├── __init__.py              # Module exports
├── base.py                  # EntityClaim, Provenance, hypernyms
├── hybrid_annotator.py      # HybridAnnotator, candidates, pipeline
├── llm_annotator.py         # LLMAnnotator, provider configs
└── schema_builder.py        # GLAMSchema, field specs

tests/annotators/
├── __init__.py
└── test_hybrid_annotator.py # 24 unit tests

13 KiB Raw Blame History

Hybrid GLiNER2 + LLM Annotator Architecture

Overview

Pipeline Architecture

Stage 1: Fast-Pass (GLiNER2)

Process

GLiNER2 to GLAM-NER Type Mapping

Configuration

Stage 2: Refinement (LLM)

Process

Relationship Extraction

Configuration

Stage 3: Validation (Cross-Check)

Process

Merge Strategy

Configuration

Data Structures

AnnotationCandidate

RelationshipCandidate

HybridAnnotationResult

Usage

Basic Usage

Custom Configuration

GLiNER2-Only Mode

LLM-Only Mode

Performance Characteristics

Dependencies

Required

Optional

Install GLiNER2

File Structure

See Also

13 KiB

Raw Blame History