kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	3dd1f11059	chore: sync repository to self-hosted Forgejo	2026-01-11 00:12:16 +01:00
kempersc	a4184cb805	feat(infra): add webhook-based schema deployment pipeline - Add FastAPI webhook receiver for Forgejo push events - Add setup script for server deployment - Add Caddy snippet for webhook endpoint - Add local sync-schemas.sh helper script - Sync frontend schemas with source (archived deprecated slots) Infrastructure scripts staged for optional webhook deployment. Current deployment uses: ./infrastructure/deploy.sh --frontend	2026-01-10 21:45:02 +01:00
kempersc	f02cffe1e8	refactor(schema): migrate 5 deprecated slots to temporal naming convention Migrate slots to follow RiC-O-style temporal naming (Rule 39): - accepts_external_work → accepts_or_accepted_external_work - accepts_visiting_scholars → accepts_or_accepted_visiting_scholar - accepts_payment_methods → accepts_or_accepted_payment_method - access → has_or_had_access_condition - access_policy_ref → has_or_had_access_policy_reference Updated classes to use new slot names: - ConservationLab.yaml - ResearchCenter.yaml - GiftShop.yaml - ArchiveReference.yaml - FindingAid.yaml - Collection.yaml Archived deprecated slots to schemas/20251121/linkml/archive/slots/ with _archived_20260110 suffix per Rule 9 (enum-to-class principle).	2026-01-10 21:09:29 +01:00
kempersc	ac36b80476	feat(rag): add companion queries for count templates Add companion_query support to fetch full entity records alongside aggregate count queries. Enables displaying results on map/list when asking 'how many museums in Amsterdam?' Backend changes: - Add companion_query, companion_query_region, companion_query_country fields to TemplateDefinition and TemplateMatchResult - Add render_template_string() for raw companion query rendering Template changes: - Add companion queries to count_institutions_by_type_and_location for settlement, region, and country level queries - Returns institution URI, name, coordinates, city for visualization	2026-01-10 18:44:06 +01:00
kempersc	f8b4ecad7d	data(person): enrich 7 person profiles with detailed employment history Update heritage professional profiles with: - Separate role entries for different positions at same institution - Employment date ranges (start_date, end_date) - Updated observed_on timestamps - Direct LinkedIn profile URLs as source Profiles updated: - Antoinet Nijssen (Noord-Hollands Archief) - Anna Lakmaker - Annelies Reus - Marianne Hamersma - Marcel Auwers - Hans Felius - Nico Vriend	2026-01-10 18:43:27 +01:00
kempersc	6c19ef8661	feat(rag): add Rule 46 epistemic provenance tracking Track full lineage of RAG responses: WHERE data comes from, WHEN it was retrieved, HOW it was processed (SPARQL/vector/LLM). Backend changes: - Add provenance.py with EpistemicProvenance, DataTier, SourceAttribution - Integrate provenance into MultiSourceRetriever.merge_results() - Return epistemic_provenance in DSPyQueryResponse Frontend changes: - Pass EpistemicProvenance through useMultiDatabaseRAG hook - Display provenance in ConversationPage (for cache transparency) Schema fixes: - Fix truncated example in has_observation.yaml slot definition References: - Pavlyshyn's Context Graphs and Data Traces paper - LinkML ProvenanceBlock schema pattern	2026-01-10 18:42:43 +01:00
kempersc	54dd4a9803	docs(server): add SERVER_OPERATIONS.md for Hetzner cx32 deployment Document server disk architecture, PyTorch CPU-only setup, service management, and recovery procedures learned from disk space crisis. - Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB) - PyTorch CPU-only installation via --index-url whl/cpu - Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/ - Service restart procedures for Oxigraph, GLAM API, Qdrant, etc. - Emergency recovery commands for disk space crises	2026-01-10 18:42:15 +01:00
kempersc	28c3aaf33f	enrich profiles	2026-01-10 17:31:02 +01:00
kempersc	bd257c52f4	data(person): update 2 additional profiles	2026-01-10 15:39:12 +01:00
kempersc	2f33e6a230	data(person): update DR-STAPEL profile	2026-01-10 15:38:37 +01:00
kempersc	cce484c6b8	feat(archief-assistent): enhance semantic cache with ontology-driven vocabulary - Integrate tier-2 embeddings from types-vocab.json - Add segment-based caching for improved retrieval - Update tests and documentation	2026-01-10 15:38:11 +01:00
kempersc	ad74d8379e	feat(scripts): improve types-vocab extraction to derive all vocabulary from schema - Remove hardcoded type mappings, derive dynamically from LinkML - Extract keywords from annotations, structured_aliases, and comments - Add rename_plural_slot.py utility for schema slot renaming	2026-01-10 15:37:52 +01:00
kempersc	ec18e1810d	data(person): enrich 7 profiles with detailed affiliations and GHCIDs - Add GHCID references to custodian affiliations - Add start dates for employment periods - Expand heritage type classifications (A→[A,F]) - Add detailed rationales based on career history - Add full_initials from archival publications	2026-01-10 15:36:49 +01:00
kempersc	626bd3a095	refactor(schemas): apply naming conventions to 261 class files - Apply Rule 39: RiC-O style hasOrHad/isOrWas for temporal slots - Apply Rule 43: Singular noun convention (keywords → keyword) - Update slot references to match renamed slot files - Maintain schema integrity across all class definitions	2026-01-10 15:36:33 +01:00
kempersc	94bfc9061e	refactor(schemas): consolidate slot definitions and remove 305 redundant files - Apply Rule 39: RiC-O style temporal naming (hasOrHad, isOrWas) - Apply Rule 43: Singular noun convention for slot names - Remove duplicate slot definitions consolidated into centralized files - Net reduction: 6,162 lines across 305 deleted files	2026-01-10 15:36:13 +01:00
kempersc	13938c92ca	chore(schemas): sync LinkML schemas to frontend apps Copies authoritative schemas from schemas/20251121/ to: - frontend/public/schemas/20251121/ - apps/archief-assistent/public/schemas/20251121/ This ensures slot definitions with corrected ontology property references (commit `2808dad6cd`) are available to frontend apps.	2026-01-10 15:02:25 +01:00
kempersc	e5a08a353d	enrich person profiles	2026-01-10 14:14:04 +01:00
kempersc	9339de2cfb	data(person): process 44,512 heritage-relevant profiles from entity extractions Processing Summary: - Scanned 94,716 LinkedIn entity files - Identified 44,512 heritage-relevant individuals (47%) - Created 1,430 new PPID-formatted profiles - Updated 43,070 existing profiles with entity data - Final count: 40,731 person profiles Profile updates include: - Merged web_claims with full provenance - Added/updated heritage_relevance scoring - Added affiliation data with custodian references - Added inferred birth decades with provenance chains (Rule 45) All data preserved per Rule 5 (additive only)	2026-01-10 14:01:29 +01:00
kempersc	3a15f2bdaa	feat(scripts): add entity-to-PPID processing script - Processes 94,716 LinkedIn entity files from data/custodian/person/entity/ - Identifies heritage-relevant profiles (47% of total) - Generates PPID-formatted filenames with inferred locations/dates - Merges with existing profiles, preserving all provenance data - Applies Rules 12, 20, 27, 44, 45 for person data architecture - Fixed edge case: handle null education/experience arrays	2026-01-10 13:58:06 +01:00
kempersc	57e77c8b19	chore(deps): add tsx, yaml, and @types/node for schema extraction script Dependencies for scripts/extract-types-vocab.ts: - tsx: TypeScript execution for Node.js scripts - yaml: Parse LinkML schema files - @types/node: TypeScript definitions for Node.js APIs	2026-01-10 13:33:12 +01:00
kempersc	0845d9f30e	feat(scripts): add person enrichment and slot mapping utilities Person Enrichment Scripts: - enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication) - enrich_ppids_linkup.py: Batch PPID enrichment pipeline - extract_persons_with_provenance.py: Extract person data from LinkedIn HTML with XPath provenance tracking LinkML Slot Management: - update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and semantic URI requirements (Rule 38) - update_class_slot_references.py: Update class files referencing renamed slots - validate_slot_mappings.py: Validate slot definitions against ontology rules All scripts follow established project conventions for provenance and ontology alignment.	2026-01-10 13:32:32 +01:00
kempersc	6f3cf95492	data(person): fix data quality issues and PPID corrections Data Quality Corrections: - TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016, not death). Set is_living=true. Reassess heritage_relevance=false (tourism ministry is not a GLAM institution) - ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam) based on verified birth location. Update birth year to 1980 Profile Enrichments (5 profiles with XX-XX-XXX placeholders): - Add web claims with proper provenance timestamps - Add LinkedIn-verified education and position claims - Document correction rationale in modification_reason Heritage Relevance Reassessments: - Government ministries (Tourism, etc.) marked as non-heritage - Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify	2026-01-10 13:31:39 +01:00
kempersc	f2bc2d54cb	feat(archief-assistent): integrate ontology-driven vocabulary into semantic cache Implements Rule 46: Ontology-Driven Cache Segmentation Semantic Cache Enhancements: - Add institutionSubtype, recordSetType, wikidataEntity to ExtractedEntities - Add extractionMethod field to track vocabulary vs regex extraction - Implement async extractEntitiesWithVocabulary() using term log - Maintain sync regex fallback for cache key generation (<5ms) Build Pipeline: - Add prebuild hook to regenerate types-vocab.json from LinkML schemas - Extract vocabulary from Type.yaml and Types.yaml schema files - Generate GLAMORCUBESFIXPHDNT code mappings automatically New Script: - scripts/extract-types-vocab.ts - Extracts vocabulary from LinkML schemas - Supports --skip-embeddings flag for faster builds - Outputs to apps/archief-assistent/public/types-vocab.json This enables richer cache segmentation using ontology-derived subtypes (e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM') instead of just top-level GLAMORCUBESFIXPHDNT codes.	2026-01-10 13:30:30 +01:00
kempersc	2808dad6cd	fix(linkml): correct invalid ontology property references in slot definitions - confidence_score: prov:confidence doesn't exist → hc:confidenceScore - deliverables: schema:result doesn't exist → hc:deliverables - circumstances_of_death: wikidata:P1196 is identifier, not predicate → hc:circumstancesOfDeath - deceased: schema:deathDate wrong semantics for boolean → hc:deceased - death_place: fix sdo prefix to schema, remove wd:P20 as exact mapping - date_of_death: wikidata:P570 is identifier, not predicate - martyred: correct prefix inconsistencies - given_name/literal_name: fix sdo→schema prefix - occupation/religion/status: standardize prefix declarations Add comments documenting why Wikidata properties (P-numbers) cannot be used as slot_uri (they are entity identifiers, not RDF predicates).	2026-01-10 13:29:55 +01:00
kempersc	49f4054802	data(person/entity): add 83,845 LinkedIn profile extractions from company pages Bulk extraction of heritage professional profiles from LinkedIn company pages using extract_persons_with_provenance.py script. Key characteristics: - Source: LinkedIn company 'People' pages for heritage institutions - File format: {linkedin-slug}_{timestamp}.json - Total size: ~3.6GB - Includes: profile_data, heritage_relevance, affiliations, web_claims - Provenance: Full XPath + archived HTML references (Rule 6 compliant) - Dual timestamps: statement_created_at + source_archived_at (Rule 35) Extraction metadata includes: - extraction_agent: extract_persons_with_provenance.py - source_file: Original archived HTML filename - source_archived_at: When LinkedIn page was captured - schema_version: 1.0.0 Note: URL-encoded filenames preserve international characters (Arabic, Hebrew, Chinese, Turkish, accented Latin, etc.)	2026-01-10 13:27:08 +01:00
kempersc	01b9d77566	feat(archief-assistent): add ontology-driven types vocabulary for cache segmentation Add LinkML-derived vocabulary for semantic cache entity extraction (Rule 46): - types-vocab.json: 10,142 lines of institution type vocabulary from LinkML - 19 GLAMORCUBESFIXPHDNT type codes with Dutch/English/German/French labels - Includes subtypes (kunstmuseum, rijksmuseum, streekarchief, etc.) - Extracted from CustodianType.yaml and CustodianTypes.yaml - types-vocabulary.ts: TypeScript module for entity extraction - Exports INSTITUTION_TYPES with regex patterns per type code - Replaces hardcoded patterns with schema-derived vocabulary - Supports multilingual matching - Rule 46 documentation (.opencode/rules/) - Specifies vocabulary extraction workflow - Defines cache key generation algorithm - Migration path from hardcoded patterns	2026-01-10 12:57:03 +01:00
kempersc	30cd8842d9	data(person): update profiles with web claims and PPID corrections - Rename SENNAY-GHEBREAB profile: NL-ZH-ROT → ET-XX-ADD (Ethiopian birth) - Enrich profiles with inferred birth decades and settlements - Add web claims provenance for enriched data - Update 16 profiles with improved location resolution Files: +1 new (renamed), 16 modified, 1 deleted	2026-01-10 12:56:28 +01:00
kempersc	095a3f949c	refactor(linkml): apply RiC-O slot naming conventions to /schemas/ (Rule 39) Apply same RiC-O-style slot naming refactor to /schemas/20251121/linkml/ that was previously applied to frontend/public/schemas/: - Add 'has_' prefix for possession predicates - Add 'is_or_was_' prefix for temporal inverse relationships - Add 'has_or_had_' for bidirectional temporal relations - Add new slots: is_or_was_aggregated_by, is_or_was_allocated_by, etc. - Update count slots with proper descriptions This ensures consistency between the source schema directory and the frontend-served schemas. 514 files changed, +6,325 insertions, -4,255 deletions	2026-01-10 12:55:45 +01:00
kempersc	3c4f7acf87	test(archief-assistent): update E2E tests for entity extraction cache - Simplify cache spec assertions after structured matching implementation - Refactor map-panel spec for better test isolation and reliability - Remove redundant geographic false positive tests (handled by entity extraction)	2026-01-10 12:55:22 +01:00
kempersc	5eaab2bd30	data(person): enrich heritage professional profiles with web claims Batch enrichment of 3,728 person profiles with additional data: - Birth decade inference from education/career history - Location resolution for inferred birth settlements - Web claims with full provenance (source_url, retrieved_on) - Organizational subdivision extraction - Heritage relevance scoring Also includes: - 14 profile renames for PPID format corrections - Updated _manifest.json with extraction statistics - New _extraction_log.txt and _extraction_summary.json Enrichment follows AGENTS.md rules: - Rule 44: EDTF unknown date notation (XXXX, 196X, etc.) - Rule 45: Inferred data with explicit provenance - Rule 30: Confidence scoring (0.50-0.95) - Rule 31: Organizational subdivision extraction 35,052 files changed, +4,507,411 insertions, -63,118 deletions	2026-01-10 10:35:20 +01:00
kempersc	8a475d5c02	refactor(linkml): apply RiC-O slot naming conventions (Rule 39) Rename slots to follow Records in Contexts (RiC-O) style naming: - Add 'has_' prefix for possession predicates (has_acquisition_method) - Add 'is_or_was_' prefix for temporal relationships - Add 'has_or_had_' for bidirectional temporal relations Key changes across 496 schema files: - acquisition_method → has_acquisition_method - acquisition_date → has_acquisition_date - acquisition_source → has_acquisition_source - access_policy_ref → has_access_policy_reference - arrangement → has_arrangement - parent_custodian → is_or_was_suborganization_of (hierarchy) - parent_custodian → associated_custodian (event association) Also adds new slots following RiC-O patterns: - is_or_was_aggregated_by - is_or_was_allocated_by - is_or_was_archive_department_of - was_approved_by, was_archived_at, was_asserted_by This aligns with AGENTS.md Rule 39: Slot Naming Convention (RiC-O Style) for accurate temporal semantics in heritage custodian ontology. Net change: +2,063 lines (new slots added, old patterns consolidated)	2026-01-10 10:33:51 +01:00
kempersc	7fbff2ff5f	feat(archief-assistent): add entity extraction to semantic cache Prevent geographic false positives in cache lookups. Queries like "musea in Amsterdam" vs "musea in Noord-Holland" have ~93% embedding similarity but completely different answers. Changes: - Add ExtractedEntities interface for structured cache keys - Implement fast entity extraction (<5ms, no LLM) with regex patterns - Extract institution types (GLAMORCUBESFIXPHDNT), locations, and intent - Generate structured cache keys (e.g., "count:M:amsterdam") - Raise similarity threshold from 0.85 to 0.97 to match backend DSPy - Add 'structured' match method to CacheLookupResult The entity extractor recognizes: - 19 institution types (Dutch + English patterns) - 12 Dutch provinces with ISO 3166-2:NL codes - Major Dutch cities with settlement codes - Query intents (count, list, info) This ensures geographic queries get different cache entries even when embeddings are highly similar.	2026-01-10 10:33:21 +01:00
kempersc	519b0b47a8	Add Playwright test results JSON file with initial test suite and failure details	2026-01-09 21:33:31 +01:00
kempersc	004d342935	chore: minor updates and evaluation results - auth.setup.ts: require env vars for test credentials (no hardcoded defaults) - manifest.json: update schema manifest - full_evaluation_results.json: add RAG evaluation results - petra-links.json: update birth date from web claim	2026-01-09 21:10:55 +01:00
kempersc	dd0ee2cf11	feat(scripts): expand university location mappings and add web enrichment - enrich_ppids.py: Add 40+ Dutch universities and hogescholen to location mapping - enrich_ppids_web.py: New script for web-based PPID enrichment - resolve_pending_known_orgs.py: Updates for pending org resolution	2026-01-09 21:10:14 +01:00
kempersc	ea35da02dc	test(archief-assistent): add Playwright E2E test suite - Add chat.spec.ts for RAG query testing - Add count-queries.spec.ts for aggregation validation - Add map-panel.spec.ts for geographic feature testing - Add cache.spec.ts for response caching verification - Add auth.setup.ts for authentication handling - Configure playwright.config.ts for multi-browser testing - Tests run against production archief.support	2026-01-09 21:09:56 +01:00
kempersc	855fff5962	data(person): resolve PPID locations and enrich profiles - Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations - Update 2,463 profiles with enriched data - Add 512 new person profiles (AU, international heritage professionals) - PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}	2026-01-09 21:09:28 +01:00
kempersc	eb122e2532	data(custodian): remove 380 PENDING files after collision merge PENDING files were merged into existing custodian records in commit `eaf80ec`. These temporary collision placeholder files are no longer needed.	2026-01-09 21:06:22 +01:00
kempersc	97f85e0050	deps(archief-assistent): add playwright for E2E testing - Add @playwright/test as dev dependency - Alphabetize dependencies list	2026-01-09 21:06:12 +01:00
kempersc	f7bd3e9edc	feat(linkml-viewer): add slot_usage side-by-side comparison view - Add 'Compare' toggle button next to slots with slot_usage overrides - Show generic slot definition vs class-specific override in 3-column grid - Highlight changed properties with green 'changed' badge - Display '(inherited)' when override matches generic definition - Display '(not defined)' when generic has no value for property - Compare: range, description, required, multivalued, slot_uri, pattern, identifier - Full i18n support (Dutch/English translations) - Responsive design: stacks vertically on mobile (<640px)	2026-01-09 21:02:14 +01:00
kempersc	9e67d0f967	enrich profiles	2026-01-09 20:35:19 +01:00
kempersc	12fed83d6e	fix(rag): preserve count value for COUNT queries in non-streaming endpoint - Detect COUNT queries by checking for 'count' key in SPARQL results - Skip institution transformation for COUNT queries to preserve count value - Fixes bug where 'Hoeveel archieven in Utrecht?' returned 1 instead of 10 - COUNT queries now correctly extract integer count from SPARQL response	2026-01-09 18:57:40 +01:00
kempersc	8a7ed757b8	fix(rag): use SPARQL results for COUNT queries in streaming fast-path - Fix bug where COUNT queries showed Qdrant result count (10) instead of actual SPARQL count (e.g., 204 musea in Noord-Holland) - Use sparql_results for count extraction in factual query fast-path - Also fix fallback COUNT/LIST handling to use sparql_results	2026-01-09 18:47:56 +01:00
kempersc	eaf80ec756	data(custodian): merge PENDING collision files into existing custodians Merge staff data from 7 PENDING files into their matching custodian records: - NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff) - NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO - NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS - NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP - NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM - NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ - NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV Originals archived in data/custodian/archive/pending_collisions_20250109/ Add scripts/merge_collision_files.py for reproducible merging	2026-01-09 18:33:00 +01:00
kempersc	e9c9aefc37	data(person): regenerate PPIDs with unidecode support for non-Latin scripts - Add display_name and name_romanized fields to all 7948 person profiles - Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize) - Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN - Collision count reduced from 82 to 81 groups Regenerated using generate_ppids.py with unidecode support (commit `abe30cb`)	2026-01-09 18:31:53 +01:00
kempersc	04791a7a91	fix(ppid): fix unidecode import reference typo	2026-01-09 18:29:36 +01:00
kempersc	c45367c60f	data(custodian): resolve more PENDING files with proper GHCIDs Additional batch of PENDING file resolutions: - DK: Aalborg Teater - FR: Airborne Museum, ALCA Nouvelle-Aquitaine - NL: 12 institutions (CODA Apeldoorn, Airborne Museum Arnhem, etc.) - SA: Saudi Arabia Ministry of Culture Files renamed from NL-XX-XXX-PENDING-* to proper country/region codes.	2026-01-09 18:29:09 +01:00
kempersc	abe30cb302	feat(ppid): add unidecode support for non-Latin script transliteration Add optional unidecode dependency to handle Hebrew, Arabic, Chinese, and other non-Latin scripts when generating Person Persistent IDs.	2026-01-09 18:28:41 +01:00
kempersc	932ec5438c	add person profiles with PPID	2026-01-09 18:26:58 +01:00
kempersc	c0d31b3905	fix(rag): add fallback imports for semantic_router and temporal_intent Support both relative and absolute imports for running as module or script.	2026-01-09 18:26:40 +01:00

1 2 3 4 5 ...

296 commits