kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	c0d31b3905	fix(rag): add fallback imports for semantic_router and temporal_intent Support both relative and absolute imports for running as module or script.	2026-01-09 18:26:40 +01:00
kempersc	ce66a294e5	fix(rag): transform SPARQL results to match frontend metadata format for map coordinates - Convert flat SPARQL results {lat, lon} to nested {metadata: {latitude, longitude}} - Parse string coordinates to float values - Add city/country/institution_type from template slots - Enables ChatMapPanel to render map markers correctly	2026-01-09 15:49:18 +01:00
kempersc	787f4dacb0	feat(rag): implement database routing in query endpoint Log database routing decisions and add databases_used to response metadata. When template specifies databases: ["oxigraph"], Qdrant vector search is skipped.	2026-01-09 12:15:49 +01:00
kempersc	4d5641b6c5	feat(rag): add database routing configuration to templates - Add 'databases' field to TemplateDefinition and TemplateMatchResult - Support values: 'oxigraph' (SPARQL/KG), 'qdrant' (vector search) - Add helper methods use_oxigraph() and use_qdrant() - Default to both databases for backward compatibility - Allows templates to skip vector search for factual/geographic queries	2026-01-09 11:54:17 +01:00
kempersc	c88fd3af70	Refactor code structure for improved readability and maintainability	2026-01-09 11:05:26 +01:00
kempersc	6608a207d4	update frontend	2026-01-08 15:56:28 +01:00
kempersc	0b0ea75070	feat(rag): add factual query fast path - skip LLM for count/list queries - Add ontology cache warming at startup in lifespan() function - Add is_factual_query() detection in template_sparql.py (12 templates) - Add factual_result and sparql_query fields to DSPyQueryResponse - Skip LLM generation for factual templates (count, list, compare) - Execute SPARQL directly and return results as table (~15s → ~2s latency) - Update ConversationPanel.tsx to render factual results table - Add CSS styling for factual results with green theme For queries like 'hoeveel archieven zijn er in Den Haag', the SPARQL results ARE the answer - no need for expensive LLM prose generation.	2026-01-08 13:34:23 +01:00
kempersc	99dc608826	Refactor RAG to template-based SPARQL generation Major architectural changes based on Formica et al. (2023) research: - Add TemplateClassifier for deterministic SPARQL template matching - Add SlotExtractor with synonym resolution for slot values - Add TemplateInstantiator using Jinja2 for query rendering - Refactor dspy_heritage_rag.py to use template system - Update main.py with streamlined pipeline - Fix semantic_router.py ordering issues - Add comprehensive metrics tracking Template-based approach achieves 65% precision vs 10% LLM-only per Formica et al. research on SPARQL generation.	2026-01-07 22:04:43 +01:00
kempersc	98c42bf272	Fix LinkML URI conflicts and generate RDF outputs - Fix scope_note → finding_aid_scope_note in FindingAid.yaml - Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead) - Remove duplicate rico_record_set_type from class_metadata_slots.yaml - Fix range types for equals_string compatibility (uriorcurie → string) - Move class names from close_mappings to see_also in 10 RecordSetTypes files - Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context - Sync schemas to frontend/public/schemas/ Files: 1,151 changed (includes prior CustodianType migration)	2026-01-07 12:32:59 +01:00
kempersc	11983014bb	Enhance specificity scoring system integration with existing infrastructure - Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework. - Added detailed mapping of SPARQL templates to context templates for improved specificity filtering. - Implemented wrapper patterns around existing classifiers to extend functionality without duplication. - Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality. - Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.	2026-01-05 17:37:49 +01:00
kempersc	2dca28d8c1	enrich CH entries with mission statements	2026-01-04 13:12:32 +01:00
kempersc	4f0cafe98a	enrich HC profiles	2026-01-02 02:11:04 +01:00
kempersc	349f31ae6f	enrich custodian profiles	2026-01-02 02:10:18 +01:00
kempersc	1d8fd68e3a	backup custodian web profiles	2025-12-30 23:53:16 +01:00
kempersc	30b701a5ec	backup HC data	2025-12-30 23:41:15 +01:00
kempersc	90b402dba6	enrich AR en Czech files	2025-12-30 23:01:01 +01:00
kempersc	d64f857aa9	add sparql validator and RAG injector	2025-12-30 03:43:31 +01:00
kempersc	84904e344b	Make AGENTS more succint by referring to opencode rules & enrich custodians	2025-12-28 14:56:35 +01:00
kempersc	cdb633b0c9	enrich custodian entries with logo	2025-12-27 02:15:17 +01:00
kempersc	6af5009444	enrich entries	2025-12-26 21:41:18 +01:00
kempersc	59963c8d3f	Logo enrichment batch: JP+300, CZ-0 - 12,833 files (40.4%) - JP: 4,496 processed (37.2% of 12,096) ✅ COMPLETE - CZ: 2,820 processed (33.4% of 8,432) - batch completed, slight decrease - CH, NL, BE, AT, BR: 100% complete - Total: 12,833 of 31,772 files (40.4%) - Using crawl4ai favicon extraction	2025-12-26 13:42:21 +01:00
kempersc	fb7993e3af	fix: filter DSPy field markers from streaming output Implements a state machine to filter streaming tokens: - Only stream tokens from the 'answer' field to the frontend - Skip tokens from 'reasoning', 'citations', 'confidence', 'follow_up' fields - Remove DSPy field markers like '[[ ## answer ## ]]' from streamed content This fixes the issue where raw DSPy signature field markers were being displayed in the chat interface instead of clean answer text.	2025-12-26 03:11:44 +01:00
kempersc	6ab0b19ae2	Logo enrichment batch: CZ+260, JP+260 - 11,663 files (36.7%) - CZ: 2,810 processed (33.3% of 8,432) - JP: 3,336 processed (27.6% of 12,096) - Total: 11,663 of 31,772 (36.7%) - Using crawl4ai favicon extraction	2025-12-25 19:23:41 +01:00
kempersc	717ee3408a	Logo enrichment batch: JP+771, CZ+380 - 10,913 files (34%) - JP: 2,846 processed (24% of 12,096) - CZ: 2,550 processed (30% of 8,432) - CH, NL, BE, AT, BR: 100% complete - Total: 10,913 of 31,772 files (34%) - Using crawl4ai favicon extraction	2025-12-25 13:44:26 +01:00
kempersc	38292d1918	enrich: logo enrichment for JP custodians (1350 processed, 10746 remaining)	2025-12-23 20:56:21 +01:00
kempersc	5e8a432ef0	enrich japanese and dutch custodians	2025-12-23 18:08:45 +01:00
kempersc	0c1d19e98b	enrich entries	2025-12-23 13:27:35 +01:00
kempersc	879cddc47e	fix(rag): update HeritageSPARQLGenerator with correct ontology - Use hc: <https://w3id.org/heritage/custodian/> prefix - Use hc:institutionType with single-letter codes (M, L, A, etc.) - Use Wikidata URIs for countries (Q55=NL, Q31=BE, etc.) - Update all SPARQL examples to use correct ontology - Align with actual RDF data in Oxigraph	2025-12-22 22:32:08 +01:00
kempersc	8e97a7beca	fix(rag): correct SPARQL ontology prefixes for LinkML schema - Update HeritageSPARQLGenerator docstring with correct prefixes - Change main class from hc:Custodian to crm:E39_Actor - Change type property from hcp:institutionType to org:classification - Update type values from single letters to full names (MUSEUM, ARCHIVE, etc.) - Add rate limit handling with exponential backoff for 429 errors - Fix test_live_rag.py sample queries to use correct ontology - Update optimized_models instructions with correct prefixes	2025-12-22 21:31:08 +01:00
kempersc	7a056fa746	enrich entries	2025-12-21 22:12:34 +01:00
kempersc	aca68ea47f	remove a,bihguous web-claims	2025-12-21 00:01:54 +01:00
kempersc	23b1d8ee5f	clean up GHCID	2025-12-17 11:58:40 +01:00
kempersc	99430c2a70	add new entries and semantic routing	2025-12-17 10:11:56 +01:00
kempersc	68c5aa2724	feat(api): Add heritage person classification and RAG retry logic - Add GLAMORCUBESFIXPHDNT heritage type detection for person profiles - Two-stage classification: blocklist non-heritage orgs, then match keywords - Special handling for Digital (D) type: requires heritage org context - Add career_history heritage_relevant and heritage_type fields - Add exponential backoff retry for Anthropic API overload errors - Fix DSPy 3.x async context with dspy.context() wrapper	2025-12-15 01:31:54 +01:00
kempersc	c6aee998db	correct person labels	2025-12-14 17:29:39 +01:00
kempersc	c50c35fd3a	enrich person custodian	2025-12-14 17:09:55 +01:00
kempersc	d1c9aebd84	feat(rag): Add hybrid language detection and enhanced ontology mapping Implement Heritage RAG pipeline enhancements: 1. Ontology Mapping (new file: ontology_mapping.py) - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy - FastText-based ML detection with 0.6 confidence threshold - Support for Dutch, French, German, Spanish, Italian, Portuguese, English - Dynamic synonym extraction from LinkML enum values - 93 comprehensive tests (all passing) 2. Schema Loader Enhancements (schema_loader.py) - Language-tagged multilingual synonym extraction for DSPy signatures - Enhanced enum value parsing with annotations support - Better error handling for malformed schema files 3. DSPy Heritage RAG (dspy_heritage_rag.py) - Fixed all 10 mypy type errors - Enhanced type annotations throughout - Improved query routing with multilingual support 4. Dependencies (pyproject.toml) - Added fast-langdetect ^1.0.0 (primary language detection) - Added types-pyyaml ^6.0.12 (mypy type stubs) Tests: 93 new tests for ontology_mapping, all passing Mypy: Clean (no type errors)	2025-12-14 15:55:18 +01:00
kempersc	505c12601a	Add test script for PiCo extraction from Arabic waqf documents - Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents. - The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results. - Added comprehensive logging for API responses, extraction results, and validation errors. - Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.	2025-12-12 17:50:17 +01:00
kempersc	b1f93b6f22	enrich person profiles	2025-12-12 12:51:10 +01:00
kempersc	1b1cfbfca0	enrich custodians	2025-12-11 22:32:09 +01:00

40 commits