glam/backend/rag
kempersc d1c9aebd84 feat(rag): Add hybrid language detection and enhanced ontology mapping
Implement Heritage RAG pipeline enhancements:

1. Ontology Mapping (new file: ontology_mapping.py)
   - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default
   - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy
   - FastText-based ML detection with 0.6 confidence threshold
   - Support for Dutch, French, German, Spanish, Italian, Portuguese, English
   - Dynamic synonym extraction from LinkML enum values
   - 93 comprehensive tests (all passing)

2. Schema Loader Enhancements (schema_loader.py)
   - Language-tagged multilingual synonym extraction for DSPy signatures
   - Enhanced enum value parsing with annotations support
   - Better error handling for malformed schema files

3. DSPy Heritage RAG (dspy_heritage_rag.py)
   - Fixed all 10 mypy type errors
   - Enhanced type annotations throughout
   - Improved query routing with multilingual support

4. Dependencies (pyproject.toml)
   - Added fast-langdetect ^1.0.0 (primary language detection)
   - Added types-pyyaml ^6.0.12 (mypy type stubs)

Tests: 93 new tests for ontology_mapping, all passing
Mypy: Clean (no type errors)
2025-12-14 15:55:18 +01:00
..
optimized_models enrich custodians 2025-12-11 22:32:09 +01:00
__init__.py enrich custodians 2025-12-11 22:32:09 +01:00
atomic_decomposer.py enrich custodians 2025-12-11 22:32:09 +01:00
cache_config.py enrich custodians 2025-12-11 22:32:09 +01:00
dspy_heritage_rag.py feat(rag): Add hybrid language detection and enhanced ontology mapping 2025-12-14 15:55:18 +01:00
gepa_training_extended.py enrich custodians 2025-12-11 22:32:09 +01:00
main.py Add test script for PiCo extraction from Arabic waqf documents 2025-12-12 17:50:17 +01:00
ontology_mapping.py feat(rag): Add hybrid language detection and enhanced ontology mapping 2025-12-14 15:55:18 +01:00
optimization_log.txt enrich custodians 2025-12-11 22:32:09 +01:00
requirements.txt enrich custodians 2025-12-11 22:32:09 +01:00
run_bootstrap_optimization.py enrich custodians 2025-12-11 22:32:09 +01:00
run_gepa_optimization.py enrich custodians 2025-12-11 22:32:09 +01:00
schema_loader.py feat(rag): Add hybrid language detection and enhanced ontology mapping 2025-12-14 15:55:18 +01:00
semantic_cache.py enrich person profiles 2025-12-12 12:51:10 +01:00
test_dspy_rag.py enrich custodians 2025-12-11 22:32:09 +01:00
test_live_rag.py enrich person profiles 2025-12-12 12:51:10 +01:00