Add companion_query support to fetch full entity records alongside
aggregate count queries. Enables displaying results on map/list when
asking 'how many museums in Amsterdam?'
Backend changes:
- Add companion_query, companion_query_region, companion_query_country
fields to TemplateDefinition and TemplateMatchResult
- Add render_template_string() for raw companion query rendering
Template changes:
- Add companion queries to count_institutions_by_type_and_location
for settlement, region, and country level queries
- Returns institution URI, name, coordinates, city for visualization
Update heritage professional profiles with:
- Separate role entries for different positions at same institution
- Employment date ranges (start_date, end_date)
- Updated observed_on timestamps
- Direct LinkedIn profile URLs as source
Profiles updated:
- Antoinet Nijssen (Noord-Hollands Archief)
- Anna Lakmaker
- Annelies Reus
- Marianne Hamersma
- Marcel Auwers
- Hans Felius
- Nico Vriend
Track full lineage of RAG responses: WHERE data comes from, WHEN it was
retrieved, HOW it was processed (SPARQL/vector/LLM).
Backend changes:
- Add provenance.py with EpistemicProvenance, DataTier, SourceAttribution
- Integrate provenance into MultiSourceRetriever.merge_results()
- Return epistemic_provenance in DSPyQueryResponse
Frontend changes:
- Pass EpistemicProvenance through useMultiDatabaseRAG hook
- Display provenance in ConversationPage (for cache transparency)
Schema fixes:
- Fix truncated example in has_observation.yaml slot definition
References:
- Pavlyshyn's Context Graphs and Data Traces paper
- LinkML ProvenanceBlock schema pattern
Document server disk architecture, PyTorch CPU-only setup, service
management, and recovery procedures learned from disk space crisis.
- Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB)
- PyTorch CPU-only installation via --index-url whl/cpu
- Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/
- Service restart procedures for Oxigraph, GLAM API, Qdrant, etc.
- Emergency recovery commands for disk space crises
- Remove hardcoded type mappings, derive dynamically from LinkML
- Extract keywords from annotations, structured_aliases, and comments
- Add rename_plural_slot.py utility for schema slot renaming
- Add GHCID references to custodian affiliations
- Add start dates for employment periods
- Expand heritage type classifications (A→[A,F])
- Add detailed rationales based on career history
- Add full_initials from archival publications
Copies authoritative schemas from schemas/20251121/ to:
- frontend/public/schemas/20251121/
- apps/archief-assistent/public/schemas/20251121/
This ensures slot definitions with corrected ontology property
references (commit 2808dad6cd) are available to frontend apps.
Person Enrichment Scripts:
- enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup
with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication)
- enrich_ppids_linkup.py: Batch PPID enrichment pipeline
- extract_persons_with_provenance.py: Extract person data from LinkedIn HTML
with XPath provenance tracking
LinkML Slot Management:
- update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and
semantic URI requirements (Rule 38)
- update_class_slot_references.py: Update class files referencing renamed slots
- validate_slot_mappings.py: Validate slot definitions against ontology rules
All scripts follow established project conventions for provenance and
ontology alignment.
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
not death). Set is_living=true. Reassess heritage_relevance=false (tourism
ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
based on verified birth location. Update birth year to 1980
Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason
Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
Apply same RiC-O-style slot naming refactor to /schemas/20251121/linkml/
that was previously applied to frontend/public/schemas/:
- Add 'has_' prefix for possession predicates
- Add 'is_or_was_' prefix for temporal inverse relationships
- Add 'has_or_had_' for bidirectional temporal relations
- Add new slots: is_or_was_aggregated_by, is_or_was_allocated_by, etc.
- Update count slots with proper descriptions
This ensures consistency between the source schema directory and the
frontend-served schemas.
514 files changed, +6,325 insertions, -4,255 deletions
Prevent geographic false positives in cache lookups. Queries like
"musea in Amsterdam" vs "musea in Noord-Holland" have ~93%
embedding similarity but completely different answers.
Changes:
- Add ExtractedEntities interface for structured cache keys
- Implement fast entity extraction (<5ms, no LLM) with regex patterns
- Extract institution types (GLAMORCUBESFIXPHDNT), locations, and intent
- Generate structured cache keys (e.g., "count:M:amsterdam")
- Raise similarity threshold from 0.85 to 0.97 to match backend DSPy
- Add 'structured' match method to CacheLookupResult
The entity extractor recognizes:
- 19 institution types (Dutch + English patterns)
- 12 Dutch provinces with ISO 3166-2:NL codes
- Major Dutch cities with settlement codes
- Query intents (count, list, info)
This ensures geographic queries get different cache entries even when
embeddings are highly similar.
- auth.setup.ts: require env vars for test credentials (no hardcoded defaults)
- manifest.json: update schema manifest
- full_evaluation_results.json: add RAG evaluation results
- petra-links.json: update birth date from web claim
- Add chat.spec.ts for RAG query testing
- Add count-queries.spec.ts for aggregation validation
- Add map-panel.spec.ts for geographic feature testing
- Add cache.spec.ts for response caching verification
- Add auth.setup.ts for authentication handling
- Configure playwright.config.ts for multi-browser testing
- Tests run against production archief.support
- Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations
- Update 2,463 profiles with enriched data
- Add 512 new person profiles (AU, international heritage professionals)
- PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}
- Add 'Compare' toggle button next to slots with slot_usage overrides
- Show generic slot definition vs class-specific override in 3-column grid
- Highlight changed properties with green 'changed' badge
- Display '(inherited)' when override matches generic definition
- Display '(not defined)' when generic has no value for property
- Compare: range, description, required, multivalued, slot_uri, pattern, identifier
- Full i18n support (Dutch/English translations)
- Responsive design: stacks vertically on mobile (<640px)
- Detect COUNT queries by checking for 'count' key in SPARQL results
- Skip institution transformation for COUNT queries to preserve count value
- Fixes bug where 'Hoeveel archieven in Utrecht?' returned 1 instead of 10
- COUNT queries now correctly extract integer count from SPARQL response
- Fix bug where COUNT queries showed Qdrant result count (10) instead of
actual SPARQL count (e.g., 204 musea in Noord-Holland)
- Use sparql_results for count extraction in factual query fast-path
- Also fix fallback COUNT/LIST handling to use sparql_results
- Add display_name and name_romanized fields to all 7948 person profiles
- Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize)
- Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN
- Collision count reduced from 82 to 81 groups
Regenerated using generate_ppids.py with unidecode support (commit abe30cb)