Commit graph

14 commits

Author SHA1 Message Date
kempersc
0c1d19e98b enrich entries 2025-12-23 13:27:35 +01:00
kempersc
879cddc47e fix(rag): update HeritageSPARQLGenerator with correct ontology
- Use hc: <https://w3id.org/heritage/custodian/> prefix
- Use hc:institutionType with single-letter codes (M, L, A, etc.)
- Use Wikidata URIs for countries (Q55=NL, Q31=BE, etc.)
- Update all SPARQL examples to use correct ontology
- Align with actual RDF data in Oxigraph
2025-12-22 22:32:08 +01:00
kempersc
8e97a7beca fix(rag): correct SPARQL ontology prefixes for LinkML schema
- Update HeritageSPARQLGenerator docstring with correct prefixes
- Change main class from hc:Custodian to crm:E39_Actor
- Change type property from hcp:institutionType to org:classification
- Update type values from single letters to full names (MUSEUM, ARCHIVE, etc.)
- Add rate limit handling with exponential backoff for 429 errors
- Fix test_live_rag.py sample queries to use correct ontology
- Update optimized_models instructions with correct prefixes
2025-12-22 21:31:08 +01:00
kempersc
7a056fa746 enrich entries 2025-12-21 22:12:34 +01:00
kempersc
aca68ea47f remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00
kempersc
23b1d8ee5f clean up GHCID 2025-12-17 11:58:40 +01:00
kempersc
99430c2a70 add new entries and semantic routing 2025-12-17 10:11:56 +01:00
kempersc
68c5aa2724 feat(api): Add heritage person classification and RAG retry logic
- Add GLAMORCUBESFIXPHDNT heritage type detection for person profiles
- Two-stage classification: blocklist non-heritage orgs, then match keywords
- Special handling for Digital (D) type: requires heritage org context
- Add career_history heritage_relevant and heritage_type fields
- Add exponential backoff retry for Anthropic API overload errors
- Fix DSPy 3.x async context with dspy.context() wrapper
2025-12-15 01:31:54 +01:00
kempersc
c6aee998db correct person labels 2025-12-14 17:29:39 +01:00
kempersc
c50c35fd3a enrich person custodian 2025-12-14 17:09:55 +01:00
kempersc
d1c9aebd84 feat(rag): Add hybrid language detection and enhanced ontology mapping
Implement Heritage RAG pipeline enhancements:

1. Ontology Mapping (new file: ontology_mapping.py)
   - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default
   - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy
   - FastText-based ML detection with 0.6 confidence threshold
   - Support for Dutch, French, German, Spanish, Italian, Portuguese, English
   - Dynamic synonym extraction from LinkML enum values
   - 93 comprehensive tests (all passing)

2. Schema Loader Enhancements (schema_loader.py)
   - Language-tagged multilingual synonym extraction for DSPy signatures
   - Enhanced enum value parsing with annotations support
   - Better error handling for malformed schema files

3. DSPy Heritage RAG (dspy_heritage_rag.py)
   - Fixed all 10 mypy type errors
   - Enhanced type annotations throughout
   - Improved query routing with multilingual support

4. Dependencies (pyproject.toml)
   - Added fast-langdetect ^1.0.0 (primary language detection)
   - Added types-pyyaml ^6.0.12 (mypy type stubs)

Tests: 93 new tests for ontology_mapping, all passing
Mypy: Clean (no type errors)
2025-12-14 15:55:18 +01:00
kempersc
505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00
kempersc
b1f93b6f22 enrich person profiles 2025-12-12 12:51:10 +01:00
kempersc
1b1cfbfca0 enrich custodians 2025-12-11 22:32:09 +01:00