Commit graph

322 commits

Author SHA1 Message Date
kempersc
e94b58a289 fix(ci): install rsync in CI container
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 4m48s
The node:20-bookworm image doesn't include rsync which is needed
for the sync-schemas npm script.
2026-01-11 17:02:24 +01:00
kempersc
29ef609465 fix(ci): let pnpm version be read from package.json packageManager field
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 1m23s
The pnpm/action-setup detects version from package.json's packageManager
field automatically. Specifying version in workflow causes conflict.
2026-01-11 17:00:02 +01:00
kempersc
03a506382d fix(ci): include pnpm workspace files in sparse checkout
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 32s
The pnpm-lock.yaml and other workspace files are at the repository root,
not in frontend/. Add them to sparse checkout for pnpm install to work.
2026-01-11 16:58:22 +01:00
kempersc
5aeda1c195 fix(ci): disable pnpm caching due to path resolution issues
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 50s
The setup-node action fails to cache pnpm dependencies because the
store path /workspace/kempersc/glam/.pnpm-store/v3 can't be resolved.
Disabling caching for now to get the build working.
2026-01-11 16:43:49 +01:00
kempersc
b2b80cdad8 fix(ci): use pnpm instead of npm for workspace:* dependency support
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 59s
The frontend uses pnpm workspaces with 'workspace:*' protocol that npm
doesn't support. This updates the workflow to:
- Install pnpm using pnpm/action-setup
- Use pnpm for install, sync-schemas, generate-manifest, and build
- Cache pnpm dependencies using pnpm-lock.yaml
2026-01-11 16:41:57 +01:00
kempersc
b91be82af2 fix(ci): use sparse checkout to avoid large data/ directory
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 5m59s
The repository has 314K+ files including backup data that exceeds
the CI runner's disk space. This change uses sparse checkout to only
fetch frontend/ and schemas/ directories needed for the build.
2026-01-11 16:32:58 +01:00
kempersc
66ab2908d0 fix: remove deprecated AnnotationMotivationEnum, add European surname data
Some checks failed
Deploy Frontend / build-and-deploy (push) Failing after 3m21s
- Move deprecated AnnotationMotivationEnum to archive-deprecated/ (outside served paths)
- Add French, Italian, Polish, Spanish surname datasets for entity resolution
- Update name_commonality.py with expanded European surname detection
- Triggers GitOps workflow to test Forgejo Actions runner
2026-01-11 16:03:18 +01:00
kempersc
fd792fce2c Refactor code structure for improved readability and maintainability
Some checks failed
Deploy Frontend / build-and-deploy (push) Has been cancelled
2026-01-11 15:27:14 +01:00
kempersc
055fd890ff test: verify pre-commit hook regenerates manifest
Some checks are pending
Deploy Frontend / build-and-deploy (push) Waiting to run
2026-01-11 15:21:58 +01:00
kempersc
3a661d6013 fix(schema): regenerate manifest to remove stale AnnotationMotivationEnum reference
Some checks are pending
Deploy Frontend / build-and-deploy (push) Waiting to run
The old enum was properly archived to modules/enums/archive/ with .deprecated
suffix per Rule 9, but the manifest wasn't regenerated. Now correctly shows
only AnnotationMotivationType.yaml and AnnotationMotivationTypes.yaml.
2026-01-11 15:16:50 +01:00
kempersc
b6e069a1d5 chore(schema): bump version to 0.9.12 - test webhook deployment
Some checks are pending
Deploy Frontend / build-and-deploy (push) Waiting to run
2026-01-11 15:08:35 +01:00
kempersc
0f7fbf1ca0 feat(ci): add Forgejo Actions workflow for auto-deploy on LinkML schema changes
Some checks are pending
Deploy Frontend / build-and-deploy (push) Waiting to run
Infrastructure changes to enable automatic frontend deployment when schemas change:

- Add .forgejo/workflows/deploy-frontend.yml workflow triggered by:
  - Changes to frontend/** or schemas/20251121/linkml/**
  - Manual workflow dispatch

- Rewrite generate-schema-manifest.cjs to properly scan all schema directories
  - Recursively scans classes, enums, slots, modules directories
  - Uses singular category names (class, enum, slot) matching TypeScript types
  - Includes all 4 main schemas at root level
  - Skips archive directories and backup files

- Update schema-loader.ts to match new manifest format
  - Add SchemaCategory interface
  - Update SchemaManifest to use categories as array
  - Add flattenCategories() helper function
  - Add getSchemaCategories() and getSchemaCategoriesSync() functions

The workflow builds frontend with updated manifest and deploys to bronhouder.nl
2026-01-11 14:16:57 +01:00
kempersc
329b341bb1 refactor(schema): sync AnnotationMotivationType changes to frontend public schemas
- Update VideoAnnotation class with new motivation type references
- Add AnnotationMotivationType and AnnotationMotivationTypes class files
- Add motivation_type slots (description, id, name)
- Archive deprecated AnnotationMotivationEnum
- Update slot references for derived_from_entity, has_observation, has_person_observation
2026-01-11 14:16:39 +01:00
kempersc
9726cc7917 feat(frontend): Add AnnotationMotivationType to LinkML schema manifest
Add new AnnotationMotivationType and AnnotationMotivationTypes to the
SCHEMA_FILES array so they appear in the /linkml viewer.
2026-01-11 13:56:11 +01:00
kempersc
55ef2a831d feat(data): add Belgian surnames dataset with metadata and surname counts 2026-01-11 13:50:20 +01:00
kempersc
be8b14f6ac refactor: Convert AnnotationMotivationEnum to Type/Types class hierarchy
- Create AnnotationMotivationType abstract base class (oa:Motivation)
- Create 10 concrete motivation subclasses in AnnotationMotivationTypes.yaml:
  - 6 W3C Web Annotation standard: classifying, describing, identifying,
    tagging, linking, commenting
  - 4 heritage-specific: accessibility, discovery, preservation, research
- Update has_annotation_motivation slot to use AnnotationMotivationType range
- Update VideoAnnotation.yaml imports and remove inline enum
- Archive deprecated AnnotationMotivationEnum.yaml
- Add motivation_type_id, motivation_type_name, motivation_type_description slots

Follows Rule 0b (Type/Types naming convention) and Rule 9 (enum-to-class promotion)
2026-01-11 13:48:28 +01:00
kempersc
7d09e4179c Add US surnames dataset from 2010 Census with metadata and surname counts 2026-01-11 12:28:58 +01:00
kempersc
dfb4744dc7 Evaluate data enrichments of persons 2026-01-11 12:15:27 +01:00
kempersc
49a8c341b5 chore(data): update geonames database journal file 2026-01-11 02:51:52 +01:00
kempersc
170fd73c49 feat(agents): update critical rules section to include entity resolution guidelines 2026-01-11 02:51:18 +01:00
kempersc
556cc6c294 Add workspace configuration for Git and Gitea integration
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
2026-01-11 02:50:39 +01:00
kempersc
b3e57e709c Refactor code structure for improved readability and maintainability 2026-01-11 02:24:34 +01:00
kempersc
3c3be47e32 feat(infra): add fast push-based schema sync to production
- Replace slow Forgejo→Server git pull with direct local rsync
- Add git-push-schemas.sh wrapper script for manual pushes
- Add post-commit hook for automatic schema sync
- Fix YAML syntax errors in slot comment blocks
- Update deploy-webhook.py to use master branch
2026-01-11 01:22:47 +01:00
kempersc
0df26a6e44 data(person): additional person profile enrichments 2026-01-11 00:41:59 +01:00
kempersc
0a888ec682 chore: add node_modules to .gitignore and remove from tracking
- Add node_modules/ and .pnpm-store/ to .gitignore
- Remove 76k node_modules files from git tracking
- Update frontend manifest
2026-01-11 00:41:21 +01:00
kempersc
3eb097d92e data(person): enrich 64 person profiles with comprehensive metadata
- Add inferred birth dates using EDTF notation
- Add inferred birth/current settlements
- Enrich employment history with temporal data
- Add heritage sector relevance scores
- Improve PPID component tracking
- Update .gitignore with large file patterns (warc, nt, trix, geonames.db)
2026-01-11 00:38:09 +01:00
kempersc
3dd1f11059 chore: sync repository to self-hosted Forgejo 2026-01-11 00:12:16 +01:00
kempersc
a4184cb805 feat(infra): add webhook-based schema deployment pipeline
- Add FastAPI webhook receiver for Forgejo push events
- Add setup script for server deployment
- Add Caddy snippet for webhook endpoint
- Add local sync-schemas.sh helper script
- Sync frontend schemas with source (archived deprecated slots)

Infrastructure scripts staged for optional webhook deployment.
Current deployment uses: ./infrastructure/deploy.sh --frontend
2026-01-10 21:45:02 +01:00
kempersc
f02cffe1e8 refactor(schema): migrate 5 deprecated slots to temporal naming convention
Migrate slots to follow RiC-O-style temporal naming (Rule 39):
- accepts_external_work → accepts_or_accepted_external_work
- accepts_visiting_scholars → accepts_or_accepted_visiting_scholar
- accepts_payment_methods → accepts_or_accepted_payment_method
- access → has_or_had_access_condition
- access_policy_ref → has_or_had_access_policy_reference

Updated classes to use new slot names:
- ConservationLab.yaml
- ResearchCenter.yaml
- GiftShop.yaml
- ArchiveReference.yaml
- FindingAid.yaml
- Collection.yaml

Archived deprecated slots to schemas/20251121/linkml/archive/slots/
with _archived_20260110 suffix per Rule 9 (enum-to-class principle).
2026-01-10 21:09:29 +01:00
kempersc
ac36b80476 feat(rag): add companion queries for count templates
Add companion_query support to fetch full entity records alongside
aggregate count queries. Enables displaying results on map/list when
asking 'how many museums in Amsterdam?'

Backend changes:
- Add companion_query, companion_query_region, companion_query_country
  fields to TemplateDefinition and TemplateMatchResult
- Add render_template_string() for raw companion query rendering

Template changes:
- Add companion queries to count_institutions_by_type_and_location
  for settlement, region, and country level queries
- Returns institution URI, name, coordinates, city for visualization
2026-01-10 18:44:06 +01:00
kempersc
f8b4ecad7d data(person): enrich 7 person profiles with detailed employment history
Update heritage professional profiles with:
- Separate role entries for different positions at same institution
- Employment date ranges (start_date, end_date)
- Updated observed_on timestamps
- Direct LinkedIn profile URLs as source

Profiles updated:
- Antoinet Nijssen (Noord-Hollands Archief)
- Anna Lakmaker
- Annelies Reus
- Marianne Hamersma
- Marcel Auwers
- Hans Felius
- Nico Vriend
2026-01-10 18:43:27 +01:00
kempersc
6c19ef8661 feat(rag): add Rule 46 epistemic provenance tracking
Track full lineage of RAG responses: WHERE data comes from, WHEN it was
retrieved, HOW it was processed (SPARQL/vector/LLM).

Backend changes:
- Add provenance.py with EpistemicProvenance, DataTier, SourceAttribution
- Integrate provenance into MultiSourceRetriever.merge_results()
- Return epistemic_provenance in DSPyQueryResponse

Frontend changes:
- Pass EpistemicProvenance through useMultiDatabaseRAG hook
- Display provenance in ConversationPage (for cache transparency)

Schema fixes:
- Fix truncated example in has_observation.yaml slot definition

References:
- Pavlyshyn's Context Graphs and Data Traces paper
- LinkML ProvenanceBlock schema pattern
2026-01-10 18:42:43 +01:00
kempersc
54dd4a9803 docs(server): add SERVER_OPERATIONS.md for Hetzner cx32 deployment
Document server disk architecture, PyTorch CPU-only setup, service
management, and recovery procedures learned from disk space crisis.

- Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB)
- PyTorch CPU-only installation via --index-url whl/cpu
- Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/
- Service restart procedures for Oxigraph, GLAM API, Qdrant, etc.
- Emergency recovery commands for disk space crises
2026-01-10 18:42:15 +01:00
kempersc
28c3aaf33f enrich profiles 2026-01-10 17:31:02 +01:00
kempersc
bd257c52f4 data(person): update 2 additional profiles 2026-01-10 15:39:12 +01:00
kempersc
2f33e6a230 data(person): update DR-STAPEL profile 2026-01-10 15:38:37 +01:00
kempersc
cce484c6b8 feat(archief-assistent): enhance semantic cache with ontology-driven vocabulary
- Integrate tier-2 embeddings from types-vocab.json
- Add segment-based caching for improved retrieval
- Update tests and documentation
2026-01-10 15:38:11 +01:00
kempersc
ad74d8379e feat(scripts): improve types-vocab extraction to derive all vocabulary from schema
- Remove hardcoded type mappings, derive dynamically from LinkML
- Extract keywords from annotations, structured_aliases, and comments
- Add rename_plural_slot.py utility for schema slot renaming
2026-01-10 15:37:52 +01:00
kempersc
ec18e1810d data(person): enrich 7 profiles with detailed affiliations and GHCIDs
- Add GHCID references to custodian affiliations
- Add start dates for employment periods
- Expand heritage type classifications (A→[A,F])
- Add detailed rationales based on career history
- Add full_initials from archival publications
2026-01-10 15:36:49 +01:00
kempersc
626bd3a095 refactor(schemas): apply naming conventions to 261 class files
- Apply Rule 39: RiC-O style hasOrHad*/isOrWas* for temporal slots
- Apply Rule 43: Singular noun convention (keywords → keyword)
- Update slot references to match renamed slot files
- Maintain schema integrity across all class definitions
2026-01-10 15:36:33 +01:00
kempersc
94bfc9061e refactor(schemas): consolidate slot definitions and remove 305 redundant files
- Apply Rule 39: RiC-O style temporal naming (hasOrHad*, isOrWas*)
- Apply Rule 43: Singular noun convention for slot names
- Remove duplicate slot definitions consolidated into centralized files
- Net reduction: 6,162 lines across 305 deleted files
2026-01-10 15:36:13 +01:00
kempersc
13938c92ca chore(schemas): sync LinkML schemas to frontend apps
Copies authoritative schemas from schemas/20251121/ to:
- frontend/public/schemas/20251121/
- apps/archief-assistent/public/schemas/20251121/

This ensures slot definitions with corrected ontology property
references (commit 2808dad6cd) are available to frontend apps.
2026-01-10 15:02:25 +01:00
kempersc
e5a08a353d enrich person profiles 2026-01-10 14:14:04 +01:00
kempersc
9339de2cfb data(person): process 44,512 heritage-relevant profiles from entity extractions
Processing Summary:
- Scanned 94,716 LinkedIn entity files
- Identified 44,512 heritage-relevant individuals (47%)
- Created 1,430 new PPID-formatted profiles
- Updated 43,070 existing profiles with entity data
- Final count: 40,731 person profiles

Profile updates include:
- Merged web_claims with full provenance
- Added/updated heritage_relevance scoring
- Added affiliation data with custodian references
- Added inferred birth decades with provenance chains (Rule 45)

All data preserved per Rule 5 (additive only)
2026-01-10 14:01:29 +01:00
kempersc
3a15f2bdaa feat(scripts): add entity-to-PPID processing script
- Processes 94,716 LinkedIn entity files from data/custodian/person/entity/
- Identifies heritage-relevant profiles (47% of total)
- Generates PPID-formatted filenames with inferred locations/dates
- Merges with existing profiles, preserving all provenance data
- Applies Rules 12, 20, 27, 44, 45 for person data architecture
- Fixed edge case: handle null education/experience arrays
2026-01-10 13:58:06 +01:00
kempersc
57e77c8b19 chore(deps): add tsx, yaml, and @types/node for schema extraction script
Dependencies for scripts/extract-types-vocab.ts:
- tsx: TypeScript execution for Node.js scripts
- yaml: Parse LinkML schema files
- @types/node: TypeScript definitions for Node.js APIs
2026-01-10 13:33:12 +01:00
kempersc
0845d9f30e feat(scripts): add person enrichment and slot mapping utilities
Person Enrichment Scripts:
- enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup
  with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication)
- enrich_ppids_linkup.py: Batch PPID enrichment pipeline
- extract_persons_with_provenance.py: Extract person data from LinkedIn HTML
  with XPath provenance tracking

LinkML Slot Management:
- update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and
  semantic URI requirements (Rule 38)
- update_class_slot_references.py: Update class files referencing renamed slots
- validate_slot_mappings.py: Validate slot definitions against ontology rules

All scripts follow established project conventions for provenance and
ontology alignment.
2026-01-10 13:32:32 +01:00
kempersc
6f3cf95492 data(person): fix data quality issues and PPID corrections
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
  not death). Set is_living=true. Reassess heritage_relevance=false (tourism
  ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
  based on verified birth location. Update birth year to 1980

Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason

Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
2026-01-10 13:31:39 +01:00
kempersc
f2bc2d54cb feat(archief-assistent): integrate ontology-driven vocabulary into semantic cache
Implements Rule 46: Ontology-Driven Cache Segmentation

Semantic Cache Enhancements:
- Add institutionSubtype, recordSetType, wikidataEntity to ExtractedEntities
- Add extractionMethod field to track vocabulary vs regex extraction
- Implement async extractEntitiesWithVocabulary() using term log
- Maintain sync regex fallback for cache key generation (<5ms)

Build Pipeline:
- Add prebuild hook to regenerate types-vocab.json from LinkML schemas
- Extract vocabulary from *Type.yaml and *Types.yaml schema files
- Generate GLAMORCUBESFIXPHDNT code mappings automatically

New Script:
- scripts/extract-types-vocab.ts - Extracts vocabulary from LinkML schemas
- Supports --skip-embeddings flag for faster builds
- Outputs to apps/archief-assistent/public/types-vocab.json

This enables richer cache segmentation using ontology-derived subtypes
(e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM') instead of just top-level
GLAMORCUBESFIXPHDNT codes.
2026-01-10 13:30:30 +01:00
kempersc
2808dad6cd fix(linkml): correct invalid ontology property references in slot definitions
- confidence_score: prov:confidence doesn't exist → hc:confidenceScore
- deliverables: schema:result doesn't exist → hc:deliverables
- circumstances_of_death: wikidata:P1196 is identifier, not predicate → hc:circumstancesOfDeath
- deceased: schema:deathDate wrong semantics for boolean → hc:deceased
- death_place: fix sdo prefix to schema, remove wd:P20 as exact mapping
- date_of_death: wikidata:P570 is identifier, not predicate
- martyred: correct prefix inconsistencies
- given_name/literal_name: fix sdo→schema prefix
- occupation/religion/status: standardize prefix declarations

Add comments documenting why Wikidata properties (P-numbers) cannot be
used as slot_uri (they are entity identifiers, not RDF predicates).
2026-01-10 13:29:55 +01:00