- Remove hardcoded type mappings, derive dynamically from LinkML
- Extract keywords from annotations, structured_aliases, and comments
- Add rename_plural_slot.py utility for schema slot renaming
- Add GHCID references to custodian affiliations
- Add start dates for employment periods
- Expand heritage type classifications (A→[A,F])
- Add detailed rationales based on career history
- Add full_initials from archival publications
Copies authoritative schemas from schemas/20251121/ to:
- frontend/public/schemas/20251121/
- apps/archief-assistent/public/schemas/20251121/
This ensures slot definitions with corrected ontology property
references (commit 2808dad6cd) are available to frontend apps.
Person Enrichment Scripts:
- enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup
with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication)
- enrich_ppids_linkup.py: Batch PPID enrichment pipeline
- extract_persons_with_provenance.py: Extract person data from LinkedIn HTML
with XPath provenance tracking
LinkML Slot Management:
- update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and
semantic URI requirements (Rule 38)
- update_class_slot_references.py: Update class files referencing renamed slots
- validate_slot_mappings.py: Validate slot definitions against ontology rules
All scripts follow established project conventions for provenance and
ontology alignment.
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
not death). Set is_living=true. Reassess heritage_relevance=false (tourism
ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
based on verified birth location. Update birth year to 1980
Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason
Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
Apply same RiC-O-style slot naming refactor to /schemas/20251121/linkml/
that was previously applied to frontend/public/schemas/:
- Add 'has_' prefix for possession predicates
- Add 'is_or_was_' prefix for temporal inverse relationships
- Add 'has_or_had_' for bidirectional temporal relations
- Add new slots: is_or_was_aggregated_by, is_or_was_allocated_by, etc.
- Update count slots with proper descriptions
This ensures consistency between the source schema directory and the
frontend-served schemas.
514 files changed, +6,325 insertions, -4,255 deletions
Prevent geographic false positives in cache lookups. Queries like
"musea in Amsterdam" vs "musea in Noord-Holland" have ~93%
embedding similarity but completely different answers.
Changes:
- Add ExtractedEntities interface for structured cache keys
- Implement fast entity extraction (<5ms, no LLM) with regex patterns
- Extract institution types (GLAMORCUBESFIXPHDNT), locations, and intent
- Generate structured cache keys (e.g., "count:M:amsterdam")
- Raise similarity threshold from 0.85 to 0.97 to match backend DSPy
- Add 'structured' match method to CacheLookupResult
The entity extractor recognizes:
- 19 institution types (Dutch + English patterns)
- 12 Dutch provinces with ISO 3166-2:NL codes
- Major Dutch cities with settlement codes
- Query intents (count, list, info)
This ensures geographic queries get different cache entries even when
embeddings are highly similar.
- auth.setup.ts: require env vars for test credentials (no hardcoded defaults)
- manifest.json: update schema manifest
- full_evaluation_results.json: add RAG evaluation results
- petra-links.json: update birth date from web claim
- Add chat.spec.ts for RAG query testing
- Add count-queries.spec.ts for aggregation validation
- Add map-panel.spec.ts for geographic feature testing
- Add cache.spec.ts for response caching verification
- Add auth.setup.ts for authentication handling
- Configure playwright.config.ts for multi-browser testing
- Tests run against production archief.support
- Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations
- Update 2,463 profiles with enriched data
- Add 512 new person profiles (AU, international heritage professionals)
- PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}
- Add 'Compare' toggle button next to slots with slot_usage overrides
- Show generic slot definition vs class-specific override in 3-column grid
- Highlight changed properties with green 'changed' badge
- Display '(inherited)' when override matches generic definition
- Display '(not defined)' when generic has no value for property
- Compare: range, description, required, multivalued, slot_uri, pattern, identifier
- Full i18n support (Dutch/English translations)
- Responsive design: stacks vertically on mobile (<640px)
- Detect COUNT queries by checking for 'count' key in SPARQL results
- Skip institution transformation for COUNT queries to preserve count value
- Fixes bug where 'Hoeveel archieven in Utrecht?' returned 1 instead of 10
- COUNT queries now correctly extract integer count from SPARQL response
- Fix bug where COUNT queries showed Qdrant result count (10) instead of
actual SPARQL count (e.g., 204 musea in Noord-Holland)
- Use sparql_results for count extraction in factual query fast-path
- Also fix fallback COUNT/LIST handling to use sparql_results
- Add display_name and name_romanized fields to all 7948 person profiles
- Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize)
- Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN
- Collision count reduced from 82 to 81 groups
Regenerated using generate_ppids.py with unidecode support (commit abe30cb)
Merge data from PENDING files (with XX-XXX placeholders) into their
corresponding enriched custodian records with proper GHCIDs.
Countries affected:
- DE: 4 institutions (Deutsche Stiftung, Jewish Museum Berlin, etc.)
- ES: 1 institution (Biblioteca Nacional de España)
- FR: 1 institution (NMO)
- ID: 18 Indonesian museums and archives
- NL: 111 Dutch institutions across all provinces
- US: 1 institution (ARCA)
The PENDING files are deleted after merge; originals archived in
data/custodian/archive/pending_merged_20250109/
- Add green 'slot_usage' badge for slots with class-specific overrides
- Add ✦ markers next to properties that are overridden vs inherited
- Add green left border styling for slots with slot_usage
- Add i18n translations (nl/en) for override indicators
- Merge generic slot definitions with class-specific slot_usage properties
This helps users understand which slot properties come from the generic
slot definition vs which are overridden at the class level via slot_usage.
Identified 125 institutions from LinkedIn staff extraction that are NOT Dutch:
- FR: 45 (French museums, archives, libraries)
- ID: 14 (Indonesian institutions)
- GB: 14 (British institutions)
- DE: 13 (German museums, foundations)
- BE: 11 (Belgian museums)
- IT: 6 (Italian institutions)
- AU: 6 (Australian archives, museums)
- Plus smaller counts from IN, US, ES, CH, DK, AT, SA, NO, IL
These files have staff data from LinkedIn company pages but need
GHCID resolution (currently XX-XXX placeholders for region/city).
Dutch PENDING files remain: 1,283
Add final two chapters of the Person PID (PPID) design document:
- 08_implementation_guidelines.md: Database architecture, API design,
data ingestion pipeline, GHCID integration, security, performance,
technology stack, deployment, and monitoring specifications
- 09_governance_and_sustainability.md: Data governance policies,
quality assurance, sustainability planning, community engagement,
legal considerations, and long-term maintenance strategies
Merged LinkedIn-extracted staff sections from PENDING files into their
corresponding proper GHCID custodian files. This consolidates data from
two extraction sources:
- Existing enriched files: Google Maps, Museum Register, YouTube, etc.
- PENDING files: LinkedIn staff data extraction
Files modified:
- 28 custodian files enriched with staff data
- 35 PENDING files deleted (merged into proper locations)
- Originals archived to archive/pending_duplicates_20250109/
Key institutions enriched:
- Rijksmuseum (NL-NH-AMS-M-RM)
- Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA)
- Amsterdam Museum (NL-NH-AMS-M-AM)
- Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA)
- Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR)
- And 23 more museums/archives across NL
New scripts:
- scripts/merge_staff_data.py: Automated staff data merger
- scripts/categorize_pending_files.py: PENDING file analysis utility
Add databases: ["oxigraph"] to 5 more templates that don't benefit from vector search:
- count_institutions_by_type_location
- compare_locations
- find_by_founding
- find_custodians_by_budget_threshold
- find_institutions_by_founding_date
Total templates with Oxigraph-only routing: 10