Commit graph

125 commits

Author SHA1 Message Date
kempersc
fd792fce2c Refactor code structure for improved readability and maintainability
Some checks failed
Deploy Frontend / build-and-deploy (push) Has been cancelled
2026-01-11 15:27:14 +01:00
kempersc
55ef2a831d feat(data): add Belgian surnames dataset with metadata and surname counts 2026-01-11 13:50:20 +01:00
kempersc
7d09e4179c Add US surnames dataset from 2010 Census with metadata and surname counts 2026-01-11 12:28:58 +01:00
kempersc
dfb4744dc7 Evaluate data enrichments of persons 2026-01-11 12:15:27 +01:00
kempersc
556cc6c294 Add workspace configuration for Git and Gitea integration
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
2026-01-11 02:50:39 +01:00
kempersc
b3e57e709c Refactor code structure for improved readability and maintainability 2026-01-11 02:24:34 +01:00
kempersc
0df26a6e44 data(person): additional person profile enrichments 2026-01-11 00:41:59 +01:00
kempersc
3eb097d92e data(person): enrich 64 person profiles with comprehensive metadata
- Add inferred birth dates using EDTF notation
- Add inferred birth/current settlements
- Enrich employment history with temporal data
- Add heritage sector relevance scores
- Improve PPID component tracking
- Update .gitignore with large file patterns (warc, nt, trix, geonames.db)
2026-01-11 00:38:09 +01:00
kempersc
28c3aaf33f enrich profiles 2026-01-10 17:31:02 +01:00
kempersc
ad74d8379e feat(scripts): improve types-vocab extraction to derive all vocabulary from schema
- Remove hardcoded type mappings, derive dynamically from LinkML
- Extract keywords from annotations, structured_aliases, and comments
- Add rename_plural_slot.py utility for schema slot renaming
2026-01-10 15:37:52 +01:00
kempersc
e5a08a353d enrich person profiles 2026-01-10 14:14:04 +01:00
kempersc
3a15f2bdaa feat(scripts): add entity-to-PPID processing script
- Processes 94,716 LinkedIn entity files from data/custodian/person/entity/
- Identifies heritage-relevant profiles (47% of total)
- Generates PPID-formatted filenames with inferred locations/dates
- Merges with existing profiles, preserving all provenance data
- Applies Rules 12, 20, 27, 44, 45 for person data architecture
- Fixed edge case: handle null education/experience arrays
2026-01-10 13:58:06 +01:00
kempersc
0845d9f30e feat(scripts): add person enrichment and slot mapping utilities
Person Enrichment Scripts:
- enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup
  with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication)
- enrich_ppids_linkup.py: Batch PPID enrichment pipeline
- extract_persons_with_provenance.py: Extract person data from LinkedIn HTML
  with XPath provenance tracking

LinkML Slot Management:
- update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and
  semantic URI requirements (Rule 38)
- update_class_slot_references.py: Update class files referencing renamed slots
- validate_slot_mappings.py: Validate slot definitions against ontology rules

All scripts follow established project conventions for provenance and
ontology alignment.
2026-01-10 13:32:32 +01:00
kempersc
f2bc2d54cb feat(archief-assistent): integrate ontology-driven vocabulary into semantic cache
Implements Rule 46: Ontology-Driven Cache Segmentation

Semantic Cache Enhancements:
- Add institutionSubtype, recordSetType, wikidataEntity to ExtractedEntities
- Add extractionMethod field to track vocabulary vs regex extraction
- Implement async extractEntitiesWithVocabulary() using term log
- Maintain sync regex fallback for cache key generation (<5ms)

Build Pipeline:
- Add prebuild hook to regenerate types-vocab.json from LinkML schemas
- Extract vocabulary from *Type.yaml and *Types.yaml schema files
- Generate GLAMORCUBESFIXPHDNT code mappings automatically

New Script:
- scripts/extract-types-vocab.ts - Extracts vocabulary from LinkML schemas
- Supports --skip-embeddings flag for faster builds
- Outputs to apps/archief-assistent/public/types-vocab.json

This enables richer cache segmentation using ontology-derived subtypes
(e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM') instead of just top-level
GLAMORCUBESFIXPHDNT codes.
2026-01-10 13:30:30 +01:00
kempersc
dd0ee2cf11 feat(scripts): expand university location mappings and add web enrichment
- enrich_ppids.py: Add 40+ Dutch universities and hogescholen to location mapping
- enrich_ppids_web.py: New script for web-based PPID enrichment
- resolve_pending_known_orgs.py: Updates for pending org resolution
2026-01-09 21:10:14 +01:00
kempersc
9e67d0f967 enrich profiles 2026-01-09 20:35:19 +01:00
kempersc
eaf80ec756 data(custodian): merge PENDING collision files into existing custodians
Merge staff data from 7 PENDING files into their matching custodian records:
- NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff)
- NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO
- NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS
- NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP
- NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM
- NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ
- NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV

Originals archived in data/custodian/archive/pending_collisions_20250109/
Add scripts/merge_collision_files.py for reproducible merging
2026-01-09 18:33:00 +01:00
kempersc
04791a7a91 fix(ppid): fix unidecode import reference typo 2026-01-09 18:29:36 +01:00
kempersc
abe30cb302 feat(ppid): add unidecode support for non-Latin script transliteration
Add optional unidecode dependency to handle Hebrew, Arabic, Chinese,
and other non-Latin scripts when generating Person Persistent IDs.
2026-01-09 18:28:41 +01:00
kempersc
932ec5438c add person profiles with PPID 2026-01-09 18:26:58 +01:00
kempersc
7ec4e05dd4 feat(merge): add script to merge PENDING files by matching emic names with existing files 2026-01-09 16:42:55 +01:00
kempersc
1f723fd5d7 feat(data): merge staff data from 35 PENDING files into enriched custodians
Merged LinkedIn-extracted staff sections from PENDING files into their
corresponding proper GHCID custodian files. This consolidates data from
two extraction sources:
- Existing enriched files: Google Maps, Museum Register, YouTube, etc.
- PENDING files: LinkedIn staff data extraction

Files modified:
- 28 custodian files enriched with staff data
- 35 PENDING files deleted (merged into proper locations)
- Originals archived to archive/pending_duplicates_20250109/

Key institutions enriched:
- Rijksmuseum (NL-NH-AMS-M-RM)
- Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA)
- Amsterdam Museum (NL-NH-AMS-M-AM)
- Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA)
- Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR)
- And 23 more museums/archives across NL

New scripts:
- scripts/merge_staff_data.py: Automated staff data merger
- scripts/categorize_pending_files.py: PENDING file analysis utility
2026-01-09 14:51:17 +01:00
kempersc
e313744cf6 feat(scripts): add resolve_pending_locations.py for GHCID resolution
Script to resolve NL-XX-XXX-PENDING files that have city names in filename:
- Looks up city in GeoNames database
- Updates YAML with location data (city, region, country)
- Generates proper GHCID with UUID v5/v8
- Renames files to match new GHCID
- Archives original PENDING files for reference
2026-01-09 12:18:46 +01:00
kempersc
933deb337c refactor(scripts): generalize GHCID location fixer for all institution types
- Add --type/-t flag to specify institution type (A, G, H, I, L, M, N, O, R, S, T, U, X, ALL)
- Default still Type I (Intangible Heritage) for backward compatibility
- Skip PENDING files that have no location data
- Update help text with all supported types
2026-01-09 11:54:28 +01:00
kempersc
c88fd3af70 Refactor code structure for improved readability and maintainability 2026-01-09 11:05:26 +01:00
kempersc
6608a207d4 update frontend 2026-01-08 15:56:28 +01:00
kempersc
9d68ed8c2e fix: mark 15 more Google Maps false matches via comprehensive review
Manual review of remaining Type I custodian files without official websites
identified additional false matches in these categories:

Wrong organization type:
- Bird catchers vs bird watchers association
- Heritage org vs webshop
- Regional org vs specific local entity
- Federation vs single member association
- Bell ringers org vs church building

Wrong location:
- Amsterdam org matched to Den Haag
- Haarlem org matched to Apeldoorn
- Rotterdam org matched to Amstelveen
- Dutch org matched to Suriname (!)
- Giethoorn event matched to Belt-Schutsloot
- Duindorp bonfire matched to Scheveningen

Different event/entity:
- Horse racing org vs summer festival
- Street name vs organization
- Heritage foundation vs specific local fair

Total Type I false matches fixed: 62 of 188 files (33%)
2026-01-08 15:21:31 +01:00
kempersc
85d9cee82f fix: mark 8 more Google Maps false matches detected via name mismatch
Additional Type I custodian files with obvious name mismatches between
KIEN registry entries and Google Maps results. These couldn't be
auto-detected via domain mismatch because they lack official websites.

Fixes:
- Dick Timmerman (person) → carpentry business
- Ria Bos (cigar maker) → money transfer agent
- Stichting Kracom (Krampuslauf) → Happy Caps retail
- Fed. Nederlandse Vertelorganisaties → NET Foundation
- Stichting dodenherdenking Alphen → wrong memorial
- Sao Joao Rotterdam → Heemraadsplein (location not org)
- sport en spel (heritage) → equipment rental
- Eiertikken Ommen → restaurant

Also adds detection and fix scripts for Google Maps false matches.
2026-01-08 13:26:53 +01:00
kempersc
dfa667c90f Fix LinkML schema for valid RDF generation with proper slot_uri
Summary:
- Create 46 missing slot definition files with proper slot_uri values
- Add slot imports to main schema (01_custodian_name_modular.yaml)
- Fix YAML examples sections in 116+ class and slot files
- Fix PersonObservation.yaml examples section (nested objects → string literals)

Technical changes:
- All slots now have explicit slot_uri mapping to base ontologies (RiC-O, Schema.org, SKOS)
- Eliminates malformed URIs like 'custodian/:slot_name' in generated RDF
- gen-owl now produces valid Turtle with 153,166 triples

New slot files (46):
- RiC-O slots: rico_note, rico_organizational_principle, rico_has_or_had_holder, etc.
- Scope slots: scope_includes, scope_excludes, archive_scope
- Organization slots: organization_type, governance_authority, area_served
- Platform slots: platform_type_category, portal_type_category
- Social media slots: social_media_platform_category, post_type_*
- Type hierarchy slots: broader_type, narrower_types, custodian_type_broader
- Wikidata slots: wikidata_equivalent, wikidata_mapping

Generated output:
- schemas/20251121/rdf/01_custodian_name_modular_20260107_134534_clean.owl.ttl (6.9MB)
- Validated with rdflib: 153,166 triples, no malformed URIs
2026-01-07 13:48:03 +01:00
kempersc
98c42bf272 Fix LinkML URI conflicts and generate RDF outputs
- Fix scope_note → finding_aid_scope_note in FindingAid.yaml
- Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead)
- Remove duplicate rico_record_set_type from class_metadata_slots.yaml
- Fix range types for equals_string compatibility (uriorcurie → string)
- Move class names from close_mappings to see_also in 10 RecordSetTypes files
- Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context
- Sync schemas to frontend/public/schemas/

Files: 1,151 changed (includes prior CustodianType migration)
2026-01-07 12:32:59 +01:00
kempersc
b34992b1d3 Migrate all 293 class files to ontology-aligned slots
Extends migration to all class types (museums, libraries, galleries, etc.)

New slots added to class_metadata_slots.yaml:
- RiC-O: rico_record_set_type, rico_organizational_principle,
  rico_has_or_had_holder, rico_note
- Multilingual: label_de, label_es, label_fr, label_nl, label_it, label_pt
- Scope: scope_includes, scope_excludes, custodian_only,
  organizational_level, geographic_restriction
- Notes: privacy_note, preservation_note, legal_note

Migration script now handles 30+ annotation types.
All migrated schemas pass linkml-validate.

Total: 387 class files now use proper slots instead of annotations.
2026-01-06 12:24:54 +01:00
kempersc
aa763dab25 Migrate 94 archive class annotations to ontology-aligned slots
- Add migration script: scripts/migrate_annotations_to_slots.py
- Convert custodian_types, wikidata, skos_broader, specificity_* annotations
- Replace with proper slots mapped to SKOS, PROV-O, RiC-O predicates
- Add ../slots/class_metadata_slots import to all migrated files
- Remove AcademicArchive_refactored.yaml (main file now migrated)
- Sync changes to frontend/public/schemas/

Migration converts:
  - custodian_types → hc:custodianTypes slot
  - wikidata/wikidata_label → wikidata_alignment structured slot
  - skos_broader → skos:broader slot
  - specificity_* → specificity_annotation structured slot
  - dual_class_pattern → dual_class_link structured slot
  - template_specificity → template_specificity slot

All 94 migrated schemas pass linkml-validate.
2026-01-06 11:25:37 +01:00
kempersc
11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00
kempersc
242bc8bb35 Add new slots for heritage custodian entities
- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.
2026-01-05 00:49:05 +01:00
kempersc
2dca28d8c1 enrich CH entries with mission statements 2026-01-04 13:12:32 +01:00
kempersc
4f0cafe98a enrich HC profiles 2026-01-02 02:11:04 +01:00
kempersc
349f31ae6f enrich custodian profiles 2026-01-02 02:10:18 +01:00
kempersc
45e873ec0a enrich JP BE AR profiles 2025-12-30 23:07:03 +01:00
kempersc
f753d7277f Add country code extraction for location validation in Google Places API 2025-12-30 03:45:29 +01:00
kempersc
d64f857aa9 add sparql validator and RAG injector 2025-12-30 03:43:31 +01:00
kempersc
84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians 2025-12-28 14:56:35 +01:00
kempersc
cdb633b0c9 enrich custodian entries with logo 2025-12-27 02:15:17 +01:00
kempersc
6af5009444 enrich entries 2025-12-26 21:41:18 +01:00
kempersc
ca219340f2 enrich entries 2025-12-26 14:30:31 +01:00
kempersc
38292d1918 enrich: logo enrichment for JP custodians (1350 processed, 10746 remaining) 2025-12-23 20:56:21 +01:00
kempersc
5e8a432ef0 enrich japanese and dutch custodians 2025-12-23 18:08:45 +01:00
kempersc
a1fb6344e7 enriching custodian data 2025-12-23 17:26:29 +01:00
kempersc
0c1d19e98b enrich entries 2025-12-23 13:27:35 +01:00
kempersc
7a056fa746 enrich entries 2025-12-21 22:12:34 +01:00
kempersc
aca68ea47f remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00