Commit graph

16 commits

Author SHA1 Message Date
kempersc
fce186b649 enrich person profiles 2026-01-11 18:08:40 +01:00
kempersc
11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00
kempersc
4f0cafe98a enrich HC profiles 2026-01-02 02:11:04 +01:00
kempersc
45e873ec0a enrich JP BE AR profiles 2025-12-30 23:07:03 +01:00
kempersc
d64f857aa9 add sparql validator and RAG injector 2025-12-30 03:43:31 +01:00
kempersc
84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians 2025-12-28 14:56:35 +01:00
kempersc
99430c2a70 add new entries and semantic routing 2025-12-17 10:11:56 +01:00
kempersc
d1c9aebd84 feat(rag): Add hybrid language detection and enhanced ontology mapping
Implement Heritage RAG pipeline enhancements:

1. Ontology Mapping (new file: ontology_mapping.py)
   - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default
   - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy
   - FastText-based ML detection with 0.6 confidence threshold
   - Support for Dutch, French, German, Spanish, Italian, Portuguese, English
   - Dynamic synonym extraction from LinkML enum values
   - 93 comprehensive tests (all passing)

2. Schema Loader Enhancements (schema_loader.py)
   - Language-tagged multilingual synonym extraction for DSPy signatures
   - Enhanced enum value parsing with annotations support
   - Better error handling for malformed schema files

3. DSPy Heritage RAG (dspy_heritage_rag.py)
   - Fixed all 10 mypy type errors
   - Enhanced type annotations throughout
   - Improved query routing with multilingual support

4. Dependencies (pyproject.toml)
   - Added fast-langdetect ^1.0.0 (primary language detection)
   - Added types-pyyaml ^6.0.12 (mypy type stubs)

Tests: 93 new tests for ontology_mapping, all passing
Mypy: Clean (no type errors)
2025-12-14 15:55:18 +01:00
kempersc
891692a4d6 feat(ghcid): add diacritics normalization and transliteration scripts
- Add fix_ghcid_diacritics.py for normalizing non-ASCII in GHCIDs
- Add resolve_diacritics_collisions.py for collision handling
- Add transliterate_emic_names.py for non-Latin script handling
- Add transliteration tests
2025-12-08 14:59:28 +01:00
kempersc
4da64eeebf improve annotator 2025-12-05 16:25:39 +01:00
kempersc
3a242370fc annotation standards added 2025-12-05 15:30:23 +01:00
kempersc
2761857b0d Add scripts for converting OWL/Turtle ontology to Mermaid and PlantUML diagrams
- Implemented `owl_to_mermaid.py` to convert OWL/Turtle files into Mermaid class diagrams.
- Implemented `owl_to_plantuml.py` to convert OWL/Turtle files into PlantUML class diagrams.
- Added two new PlantUML files for custodian multi-aspect diagrams.
2025-11-22 23:01:13 +01:00
kempersc
fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00
kempersc
edb1e07941 updated schemata 2025-11-21 22:12:33 +01:00
kempersc
3c80de87e0 add isil entries 2025-11-19 23:25:22 +01:00
kempersc
e5a532a8bc Add comprehensive tests for NLP institution extraction and RDF partnership integration
- Introduced `test_nlp_extractor.py` with unit tests for the InstitutionExtractor, covering various extraction patterns (ISIL, Wikidata, VIAF, city names) and ensuring proper classification of institutions (museum, library, archive).
- Added tests for extracted entities and result handling to validate the extraction process.
- Created `test_partnership_rdf_integration.py` to validate the end-to-end process of extracting partnerships from a conversation and exporting them to RDF format.
- Implemented tests for temporal properties in partnerships and ensured compliance with W3C Organization Ontology patterns.
- Verified that extracted partnerships are correctly linked with PROV-O provenance metadata.
2025-11-19 23:20:47 +01:00