Commit graph

68 commits

Author SHA1 Message Date
kempersc
dfa667c90f Fix LinkML schema for valid RDF generation with proper slot_uri
Summary:
- Create 46 missing slot definition files with proper slot_uri values
- Add slot imports to main schema (01_custodian_name_modular.yaml)
- Fix YAML examples sections in 116+ class and slot files
- Fix PersonObservation.yaml examples section (nested objects → string literals)

Technical changes:
- All slots now have explicit slot_uri mapping to base ontologies (RiC-O, Schema.org, SKOS)
- Eliminates malformed URIs like 'custodian/:slot_name' in generated RDF
- gen-owl now produces valid Turtle with 153,166 triples

New slot files (46):
- RiC-O slots: rico_note, rico_organizational_principle, rico_has_or_had_holder, etc.
- Scope slots: scope_includes, scope_excludes, archive_scope
- Organization slots: organization_type, governance_authority, area_served
- Platform slots: platform_type_category, portal_type_category
- Social media slots: social_media_platform_category, post_type_*
- Type hierarchy slots: broader_type, narrower_types, custodian_type_broader
- Wikidata slots: wikidata_equivalent, wikidata_mapping

Generated output:
- schemas/20251121/rdf/01_custodian_name_modular_20260107_134534_clean.owl.ttl (6.9MB)
- Validated with rdflib: 153,166 triples, no malformed URIs
2026-01-07 13:48:03 +01:00
kempersc
98c42bf272 Fix LinkML URI conflicts and generate RDF outputs
- Fix scope_note → finding_aid_scope_note in FindingAid.yaml
- Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead)
- Remove duplicate rico_record_set_type from class_metadata_slots.yaml
- Fix range types for equals_string compatibility (uriorcurie → string)
- Move class names from close_mappings to see_also in 10 RecordSetTypes files
- Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context
- Sync schemas to frontend/public/schemas/

Files: 1,151 changed (includes prior CustodianType migration)
2026-01-07 12:32:59 +01:00
kempersc
6c6810fa43 Replace CustodianTypeCodeEnum with CustodianType class references
- Remove deprecated CustodianTypeCodeEnum from class_metadata_slots.yaml
- Update custodian_types slot to use uriorcurie range (references CustodianType subclasses)
- Update custodian_types_primary slot similarly
- Add migration note for legacy string format ['A'] vs new URI format

Per Rule 9: Enum-to-Class Promotion - Single Source of Truth
2026-01-06 12:37:40 +01:00
kempersc
b34992b1d3 Migrate all 293 class files to ontology-aligned slots
Extends migration to all class types (museums, libraries, galleries, etc.)

New slots added to class_metadata_slots.yaml:
- RiC-O: rico_record_set_type, rico_organizational_principle,
  rico_has_or_had_holder, rico_note
- Multilingual: label_de, label_es, label_fr, label_nl, label_it, label_pt
- Scope: scope_includes, scope_excludes, custodian_only,
  organizational_level, geographic_restriction
- Notes: privacy_note, preservation_note, legal_note

Migration script now handles 30+ annotation types.
All migrated schemas pass linkml-validate.

Total: 387 class files now use proper slots instead of annotations.
2026-01-06 12:24:54 +01:00
kempersc
aa763dab25 Migrate 94 archive class annotations to ontology-aligned slots
- Add migration script: scripts/migrate_annotations_to_slots.py
- Convert custodian_types, wikidata, skos_broader, specificity_* annotations
- Replace with proper slots mapped to SKOS, PROV-O, RiC-O predicates
- Add ../slots/class_metadata_slots import to all migrated files
- Remove AcademicArchive_refactored.yaml (main file now migrated)
- Sync changes to frontend/public/schemas/

Migration converts:
  - custodian_types → hc:custodianTypes slot
  - wikidata/wikidata_label → wikidata_alignment structured slot
  - skos_broader → skos:broader slot
  - specificity_* → specificity_annotation structured slot
  - dual_class_pattern → dual_class_link structured slot
  - template_specificity → template_specificity slot

All 94 migrated schemas pass linkml-validate.
2026-01-06 11:25:37 +01:00
kempersc
f37f5208ca Copy class metadata slots to frontend public folder for deployment 2026-01-06 11:17:12 +01:00
kempersc
11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00
kempersc
41d8905661 Fix Turtle parser multi-line string handling for PiCo ontology
- Fixed bug where closing triple-quotes (""") would incorrectly re-trigger
  multi-line string detection, causing subsequent class definitions to be skipped
- Added lineToProcess variable to track which portion of line to process after
  closing a multi-line string, preventing re-detection of opening quotes
- Moved UML large diagram confirmation logic from OntologyViewerPage to
  UMLVisualization component for better encapsulation
- PiCo ontology now correctly shows all 8 classes instead of 2

Deployed and verified on https://bronhouder.nl/ontology?ontology=PiCo
2026-01-05 11:25:43 +01:00
kempersc
242bc8bb35 Add new slots for heritage custodian entities
- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.
2026-01-05 00:49:05 +01:00
kempersc
2dca28d8c1 enrich CH entries with mission statements 2026-01-04 13:12:32 +01:00
kempersc
4f0cafe98a enrich HC profiles 2026-01-02 02:11:04 +01:00
kempersc
349f31ae6f enrich custodian profiles 2026-01-02 02:10:18 +01:00
kempersc
45e873ec0a enrich JP BE AR profiles 2025-12-30 23:07:03 +01:00
kempersc
d64f857aa9 add sparql validator and RAG injector 2025-12-30 03:43:31 +01:00
kempersc
ca219340f2 enrich entries 2025-12-26 14:30:31 +01:00
kempersc
5e8a432ef0 enrich japanese and dutch custodians 2025-12-23 18:08:45 +01:00
kempersc
a1fb6344e7 enriching custodian data 2025-12-23 17:26:29 +01:00
kempersc
0c1d19e98b enrich entries 2025-12-23 13:27:35 +01:00
kempersc
aca68ea47f remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00
kempersc
23b1d8ee5f clean up GHCID 2025-12-17 11:58:40 +01:00
kempersc
99430c2a70 add new entries and semantic routing 2025-12-17 10:11:56 +01:00
kempersc
e0dd847491 extend ontology 2025-12-16 20:27:39 +01:00
kempersc
b0416efc7d enrich custodians and persons 2025-12-16 11:57:34 +01:00
kempersc
52ae711c56 add timespans 2025-12-16 09:02:52 +01:00
kempersc
b1340e30c8 add timespan 2025-12-15 22:35:35 +01:00
kempersc
cb56aa7e40 enrich all custodian timespan 2025-12-15 22:31:41 +01:00
kempersc
0a38225b36 feat(frontend): Add multi-select filters, URL params, and UI improvements
- Institution Browser: multi-select for types and countries
- URL query param sync for shareable filter URLs
- New utility: countryNames.ts with flag emoji support
- New utility: imageProxy.ts for image URL handling
- New component: SearchableMultiSelect dropdown
- Career timeline CSS and component updates
- Media gallery improvements
- Lazy load error boundary component
- Version check utility
2025-12-15 01:47:11 +01:00
kempersc
c50c35fd3a enrich person custodian 2025-12-14 17:09:55 +01:00
kempersc
505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00
kempersc
b1f93b6f22 enrich person profiles 2025-12-12 12:51:10 +01:00
kempersc
03263f67d6 moved web archives 2025-12-12 00:40:26 +01:00
kempersc
1b1cfbfca0 enrich custodians 2025-12-11 22:32:09 +01:00
kempersc
d4906abae4 update postgis data 2025-12-10 23:51:51 +01:00
kempersc
be3fbac601 enrich entries and persons 2025-12-10 18:04:25 +01:00
kempersc
41959f0766 correct HCID! 2025-12-10 13:01:13 +01:00
kempersc
3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00
kempersc
a7321b1bb9 reconstruct location blocks 2025-12-09 12:25:16 +01:00
kempersc
cab712659d recover location blocks 2025-12-09 11:34:56 +01:00
kempersc
62fdd35321 Refactor code structure for improved readability and maintainability 2025-12-09 11:15:51 +01:00
kempersc
131e3ca259 normalise custodian entries 2025-12-09 07:56:35 +01:00
kempersc
13f67bed19 feat(frontend): add graph visualization and data explorer features
Database Panels:
- Add D3.js force-directed graph visualization to Oxigraph and TypeDB panels
- Add 'Explore' tab with class/entity browser, graph/table toggle, and search
- Add data explorer to PostgreSQL panel with table browser, pagination, search, export
- Fix SPARQL variable naming bug in Oxigraph getGraphData() function
- Add node details panel showing selected entity attributes
- Add zoom/pan controls and node coloring by entity type

Map Features:
- Add TimelineSlider component for temporal filtering of institutions
- Support dual-handle range slider with decade histogram
- Add quick presets (Ancient, Medieval, Modern, Contemporary)
- Show institution density visualization by founding decade

Hooks:
- Extend useOxigraph with getGraphData() for graph visualization
- Extend useTypeDB with getGraphData() for graph visualization
- Extend usePostgreSQL with getTableData() and exportTableData()
- Improve useDuckLakeInstitutions with temporal filtering support

Styles:
- Add HeritageDashboard.css with shared panel styling
- Add TimelineSlider.css for timeline component styling
2025-12-08 14:56:17 +01:00
kempersc
7e3559f7e5 add new entries 2025-12-07 23:08:02 +01:00
kempersc
1981dc28ed fix(frontend): normalize org_type names to letter codes in DuckLake hook
DuckLake stores full names like 'MUSEUM' but map expects single-letter
codes like 'M' for color styling. Also includes CSS fixes for Database page.
2025-12-07 19:21:58 +01:00
kempersc
810022d524 feat(frontend): add search filter and claim page filter to DuckLakePanel
- Add search bar to filter table data across all columns
- Filter web archive claims by selected page
- Include source_page in claim queries for filtering
- Fix TypeScript unused parameter warning
2025-12-07 19:20:40 +01:00
kempersc
d9325c0bb5 feat: add web archives integration and improve enrichment scripts
Backend:
- Attach web_archives.duckdb as read-only database in DuckLake
- Create views for web_archives, web_pages, web_claims in heritage schema

Scripts:
- enrich_cities_google.py: Add batch processing and retry logic
- migrate_web_archives.py: Improve schema handling and error recovery

Frontend:
- DuckLakePanel: Add web archives query support
- Database.css: Improve layout for query results display
2025-12-07 17:49:07 +01:00
kempersc
9d15cce65c docs: add enrichment reports and update manifest
Add enrichment reports from city resolution:
- Austrian, Belgian, Bulgarian, Czech, Swiss ISIL enrichment reports
- GeoNames update reports
- Custodian creation reports
- Entry-to-GHCID mapping file
2025-12-07 14:27:36 +01:00
kempersc
4825f57951 feat(frontend): improve werkgebied display and database UI
- Fix polygon rendering with static paint properties instead of data-driven
- Add ensureSourceAndLayers() helper for reliable layer management
- Use setPaintProperty() for historical vs modern styling distinction
- Improve Database page layout with back buttons and cleaner navigation
- Add ResizableNestedTable component for DuckLake data display
- Optimize spacing and layout in Database.css
- Update schema manifest
2025-12-07 14:26:37 +01:00
kempersc
ee4e57bc75 add new entries 2025-12-07 00:26:01 +01:00
kempersc
1635625032 added web annotations 2025-12-06 19:50:04 +01:00
kempersc
55e2cd2340 feat: implement LLM-based extraction for Archives Lab content
- Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0.
- Replaced regex-based extraction with generative LLM inference.
- Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics.
- Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons.
- Results and statistics are saved in JSON format for further analysis.
2025-12-05 23:16:21 +01:00