kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	c2629f6d29	Fix LinkML schema validation errors (0 errors, 30 warnings) Schema Migration Fixes: - Fix YAML import indentation in ~650 slot files (linkml:types and enum imports) - Rename slot reference: has_or_had_holds_record_set_type → hold_or_held_record_set_type (70+ archive class files, main schema, manifest.json) - Fix ProvenanceBlock.yaml: remove invalid any_of range, use string with multivalued - Fix has_or_had_provenance.yaml: remove nested template_specificity from annotations Validation Status: - 0 errors (was multiple import/reference errors) - 30 warnings (missing descriptions on inline slots, intentional SCREAMING_CASE names) Files changed: ~3,850 (slots, classes, main schema, manifest)	2026-01-15 23:21:38 +01:00
kempersc	0cc8c8ca8f	Add archived slot definitions for various attributes in the HC ontology - Introduced new YAML files for slots including typical_scope, typical_technical_feature, unit_affiliation, used, used_by, user_community, verified, web_observation, whatsapp_business_likelihood, wikidata_alignment, wikidata, wikidata_entity, wikidata_equivalent, wikidata_id, wikidata_mapping, stores_or_stored, and time_of_destruction. - Each slot includes detailed descriptions, mappings, and examples to enhance the ontology's semantic structure. - Migrated and centralized the 'stores_object' slot into 'stores_or_stored' to comply with RiC-O naming conventions. - Added comprehensive documentation for temporal-aware slots to support better data integration and querying capabilities.	2026-01-15 20:44:51 +01:00
kempersc	416aa407cc	Add new slots for financial and heritage documentation - Introduced total expense, total frames analyzed, total investment, total liability, total net asset, and traditional product slots to enhance financial reporting capabilities. - Added transition types detected, treatment description, type hypothesis, typical condition, typical HTTP methods, typical response formats, and typical scope slots for improved heritage documentation. - Implemented user community, verified, web observation, WhatsApp business likelihood, wikidata equivalent, and wikidata mapping slots to enrich institutional data representation. - Established has_or_had_asset, has_or_had_budget, has_or_had_expense, and is_or_was_threatened_by slots to capture asset, budget, expense relationships, and threats to heritage forms.	2026-01-15 19:35:39 +01:00
kempersc	3fb27c15e2	Refactor and archive deprecated slots; update migration records - Removed deprecated slots: storage_security_level, version_number, video_comment, visiting_hour, was_asserted_by, was_revision_of, writing_system. - Archived corresponding YAML files for deprecated slots with detailed migration notes. - Updated slot definitions for has_collection and encompassing_body to reflect new naming conventions and temporal patterns. - Enhanced metadata extraction in index_persons_qdrant.py to include WCMS registration and data sources. - Modified hybrid_retriever and multi_embedding_retriever to support filtering by WCMS registration status.	2026-01-15 13:16:59 +01:00
kempersc	043ea868b5	fix(schema): Resolve broken imports after slot migration All checks were successful Deploy Frontend / build-and-deploy (push) Successful in 4m31s Details - Fix empty import list elements (- # comment pattern) in Laptop, Expenses, FunctionType, Overview, WebLink, Photography classes - Replace valid_from/valid_to slots with temporal_extent in class slots lists - Update slot_usage to use temporal_extent with TimeSpan range - Update examples to use temporal_extent with begin_of_the_begin/end_of_the_end - Fix typo is_or_was_is_or_was_archived_at → is_or_was_archived_at in WebObservation - Add TimeSpan imports to classes using temporal_extent - Fix relative import paths for Timestamp in temporal slots - Fix CustodianIdentifier → Identifier imports in FundingAgenda, ReadingRoomAnnex Schema validates successfully with 902 classes and 2043 slots.	2026-01-15 12:25:27 +01:00
kempersc	6c3fa6b5a3	Remove deprecated slots and add new slot definitions for enhanced data modeling - Deleted obsolete slot definitions for work_location and workshop_space. - Introduced new TaxonName class to represent scientific taxonomic names with detailed attributes. - Archived existing slots related to surname_prefix, target_name, taxon_name, terminal_count, text_region_count, title, title_proper, total_chapter, total_characters_extracted, total_connections_extracted, track_name, transcript_format, traveling_venue, type_label, type_status, typical_responsibility, unesco_domain, unesco_inscription_year, unesco_list_status, uniform_title, unit_name, used_by_custodian, uv_filtered_required, valid_from_geo, valid_to_geo, validation_status, variant_of_name, verification_date, viability_status, within_auxiliary_place, and within_place. - Updated slot descriptions and structures to improve clarity and compliance with standards.	2026-01-15 11:42:35 +01:00
kempersc	b30711fcfb	update slots	2026-01-14 09:05:54 +01:00
kempersc	9a395f3dbe	fix: improve birth year extraction to avoid date suffix false positives - Skip YYYYMMDD and YYMMDD date patterns at end of email - Skip digit sequences longer than 4 characters - Require non-digit before 4-digit years at end - Add knid.nl/kabelnoord.nl to consumer domains (Friesland ISP) - Add 11 missing regional archive domains to HERITAGE_DOMAIN_MAP - Update recalculation script to re-extract email semantics Results: - 3,151 false birth years removed - 'Likely wrong person' reduced from 533 to 325 (-39%) - 2,944 candidates' scores boosted	2026-01-13 22:37:10 +01:00
kempersc	92b490d690	edit slots	2026-01-13 20:35:11 +01:00
kempersc	f74513e8ef	feat: Enhance entity resolution with email semantics and review merging - Updated `entity_review.py` to map email semantic fields from JSON. - Expanded `email_semantics.py` with additional museum mappings. - Introduced a new rule in `.opencode/rules/no-duplicate-ontology-mappings.md` to prevent duplicate ontology mappings. - Added a backup JSON file for entity resolution candidates. - Created `enrich_email_semantics.py` to enrich candidates with email semantic signals. - Developed `merge_entity_reviews.py` to merge reviewed decisions from a backup into new candidates.	2026-01-13 16:43:56 +01:00
kempersc	1fb924c412	feat: add ontology mappings to LinkML schema and enhance entity resolution Schema enhancements (443 files): - Add class_uri with proper ontology references (schema:, prov:, skos:, rico:) - Add close_mappings, related_mappings per Rule 50 convention - Replace stub hc: slot_uri with standard predicates (dcterms:identifier, skos:prefLabel) - Improve descriptions with ontology mapping rationale - Add prefixes blocks to all schema modules Entity Resolution improvements: - Add entity_resolution module with email semantics parsing - Enhance build_entity_resolution.py with email-based matching signals - Extend Entity Review API with filtering by signal types and count - Add candidates caching and indexing for performance - Add ReviewLoginPage component New rules and documentation: - Add Rule 51: No Hallucinated Ontology References - Add .opencode/rules/no-hallucinated-ontology-references.md - Add .opencode/rules/slot-ontology-mapping-reference.md - Add adms.ttl and dqv.ttl ontology files Frontend ontology support: - Add RiC-O_1-1.rdf and schemaorg.owl to public/ontology	2026-01-13 13:51:02 +01:00
kempersc	846a6cdcec	Add new Record Set Types for various archival collections - Introduced SoundArchiveRecordSetType, SpecialCollectionRecordSetType, SpecializedArchiveRecordSetType, SpecializedArchivesCzechiaRecordSetType, StateArchivesRecordSetType, StateArchivesSectionRecordSetType, StateDistrictArchiveRecordSetType, StateRegionalArchiveCzechiaRecordSetType, TelevisionArchiveRecordSetType, TradeUnionArchiveRecordSetType, UniversityArchiveRecordSetType, VereinsarchivRecordSetType, VerlagsarchivRecordSetType, VerwaltungsarchivRecordSetType, WebArchiveRecordSetType, and WomensArchivesRecordSetType. - Each new type includes appropriate metadata, slots, and relationships to existing classes. - Implemented a script to detect and fix Type class violations in LinkML files.	2026-01-12 15:20:29 +01:00
kempersc	355d8be51d	centralise slots	2026-01-12 14:33:56 +01:00
kempersc	070c87af7b	refactor(migrate_wcms_resume): use recursive glob to find user JSON files and skip macOS hidden files	2026-01-11 23:32:27 +01:00
kempersc	56c373bba8	Implement fast WCMS migration script with state file checkpointing and batch processing	2026-01-11 22:26:37 +01:00
kempersc	174a420c08	refactor(schema): centralize 1515 inline slot definitions per Rule 48 All checks were successful Deploy Frontend / build-and-deploy (push) Successful in 3m57s Details - Remove inline slot definitions from 144 class files - Create 7 new centralized slot files in modules/slots/: - custodian_type_broader.yaml - custodian_type_narrower.yaml - custodian_type_related.yaml - definition.yaml - finding_aid_access_restriction.yaml - finding_aid_description.yaml - finding_aid_temporal_coverage.yaml - Add centralize_inline_slots.py automation script - Update manifest with new timestamp Rule 48: Class files must NOT define inline slots - all slots must be imported from modules/slots/ directory. Note: Pre-existing IdentifierFormat duplicate class definition (in Standard.yaml and IdentifierFormat.yaml) not addressed in this commit - requires separate schema refactor.	2026-01-11 22:02:14 +01:00
kempersc	fce186b649	enrich person profiles	2026-01-11 18:08:40 +01:00
kempersc	fd792fce2c	Refactor code structure for improved readability and maintainability Some checks failed Deploy Frontend / build-and-deploy (push) Has been cancelled Details	2026-01-11 15:27:14 +01:00
kempersc	55ef2a831d	feat(data): add Belgian surnames dataset with metadata and surname counts	2026-01-11 13:50:20 +01:00
kempersc	7d09e4179c	Add US surnames dataset from 2010 Census with metadata and surname counts	2026-01-11 12:28:58 +01:00
kempersc	dfb4744dc7	Evaluate data enrichments of persons	2026-01-11 12:15:27 +01:00
kempersc	556cc6c294	Add workspace configuration for Git and Gitea integration - Set up GitHub integration to be disabled. - Configure Git settings including path and autofetch options. - Add Gitea instance URL and repository details. - Enable YAML support for LinkML schemas with validation. - Define file associations for YAML files. - Recommend essential extensions for development and exclude unwanted ones.	2026-01-11 02:50:39 +01:00
kempersc	b3e57e709c	Refactor code structure for improved readability and maintainability	2026-01-11 02:24:34 +01:00
kempersc	0df26a6e44	data(person): additional person profile enrichments	2026-01-11 00:41:59 +01:00
kempersc	3eb097d92e	data(person): enrich 64 person profiles with comprehensive metadata - Add inferred birth dates using EDTF notation - Add inferred birth/current settlements - Enrich employment history with temporal data - Add heritage sector relevance scores - Improve PPID component tracking - Update .gitignore with large file patterns (warc, nt, trix, geonames.db)	2026-01-11 00:38:09 +01:00
kempersc	28c3aaf33f	enrich profiles	2026-01-10 17:31:02 +01:00
kempersc	ad74d8379e	feat(scripts): improve types-vocab extraction to derive all vocabulary from schema - Remove hardcoded type mappings, derive dynamically from LinkML - Extract keywords from annotations, structured_aliases, and comments - Add rename_plural_slot.py utility for schema slot renaming	2026-01-10 15:37:52 +01:00
kempersc	e5a08a353d	enrich person profiles	2026-01-10 14:14:04 +01:00
kempersc	3a15f2bdaa	feat(scripts): add entity-to-PPID processing script - Processes 94,716 LinkedIn entity files from data/custodian/person/entity/ - Identifies heritage-relevant profiles (47% of total) - Generates PPID-formatted filenames with inferred locations/dates - Merges with existing profiles, preserving all provenance data - Applies Rules 12, 20, 27, 44, 45 for person data architecture - Fixed edge case: handle null education/experience arrays	2026-01-10 13:58:06 +01:00
kempersc	0845d9f30e	feat(scripts): add person enrichment and slot mapping utilities Person Enrichment Scripts: - enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication) - enrich_ppids_linkup.py: Batch PPID enrichment pipeline - extract_persons_with_provenance.py: Extract person data from LinkedIn HTML with XPath provenance tracking LinkML Slot Management: - update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and semantic URI requirements (Rule 38) - update_class_slot_references.py: Update class files referencing renamed slots - validate_slot_mappings.py: Validate slot definitions against ontology rules All scripts follow established project conventions for provenance and ontology alignment.	2026-01-10 13:32:32 +01:00
kempersc	f2bc2d54cb	feat(archief-assistent): integrate ontology-driven vocabulary into semantic cache Implements Rule 46: Ontology-Driven Cache Segmentation Semantic Cache Enhancements: - Add institutionSubtype, recordSetType, wikidataEntity to ExtractedEntities - Add extractionMethod field to track vocabulary vs regex extraction - Implement async extractEntitiesWithVocabulary() using term log - Maintain sync regex fallback for cache key generation (<5ms) Build Pipeline: - Add prebuild hook to regenerate types-vocab.json from LinkML schemas - Extract vocabulary from Type.yaml and Types.yaml schema files - Generate GLAMORCUBESFIXPHDNT code mappings automatically New Script: - scripts/extract-types-vocab.ts - Extracts vocabulary from LinkML schemas - Supports --skip-embeddings flag for faster builds - Outputs to apps/archief-assistent/public/types-vocab.json This enables richer cache segmentation using ontology-derived subtypes (e.g., 'MUNICIPAL_ARCHIVE', 'ART_MUSEUM') instead of just top-level GLAMORCUBESFIXPHDNT codes.	2026-01-10 13:30:30 +01:00
kempersc	dd0ee2cf11	feat(scripts): expand university location mappings and add web enrichment - enrich_ppids.py: Add 40+ Dutch universities and hogescholen to location mapping - enrich_ppids_web.py: New script for web-based PPID enrichment - resolve_pending_known_orgs.py: Updates for pending org resolution	2026-01-09 21:10:14 +01:00
kempersc	9e67d0f967	enrich profiles	2026-01-09 20:35:19 +01:00
kempersc	eaf80ec756	data(custodian): merge PENDING collision files into existing custodians Merge staff data from 7 PENDING files into their matching custodian records: - NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff) - NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO - NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS - NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP - NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM - NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ - NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV Originals archived in data/custodian/archive/pending_collisions_20250109/ Add scripts/merge_collision_files.py for reproducible merging	2026-01-09 18:33:00 +01:00
kempersc	04791a7a91	fix(ppid): fix unidecode import reference typo	2026-01-09 18:29:36 +01:00
kempersc	abe30cb302	feat(ppid): add unidecode support for non-Latin script transliteration Add optional unidecode dependency to handle Hebrew, Arabic, Chinese, and other non-Latin scripts when generating Person Persistent IDs.	2026-01-09 18:28:41 +01:00
kempersc	932ec5438c	add person profiles with PPID	2026-01-09 18:26:58 +01:00
kempersc	7ec4e05dd4	feat(merge): add script to merge PENDING files by matching emic names with existing files	2026-01-09 16:42:55 +01:00
kempersc	1f723fd5d7	feat(data): merge staff data from 35 PENDING files into enriched custodians Merged LinkedIn-extracted staff sections from PENDING files into their corresponding proper GHCID custodian files. This consolidates data from two extraction sources: - Existing enriched files: Google Maps, Museum Register, YouTube, etc. - PENDING files: LinkedIn staff data extraction Files modified: - 28 custodian files enriched with staff data - 35 PENDING files deleted (merged into proper locations) - Originals archived to archive/pending_duplicates_20250109/ Key institutions enriched: - Rijksmuseum (NL-NH-AMS-M-RM) - Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA) - Amsterdam Museum (NL-NH-AMS-M-AM) - Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA) - Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR) - And 23 more museums/archives across NL New scripts: - scripts/merge_staff_data.py: Automated staff data merger - scripts/categorize_pending_files.py: PENDING file analysis utility	2026-01-09 14:51:17 +01:00
kempersc	e313744cf6	feat(scripts): add resolve_pending_locations.py for GHCID resolution Script to resolve NL-XX-XXX-PENDING files that have city names in filename: - Looks up city in GeoNames database - Updates YAML with location data (city, region, country) - Generates proper GHCID with UUID v5/v8 - Renames files to match new GHCID - Archives original PENDING files for reference	2026-01-09 12:18:46 +01:00
kempersc	933deb337c	refactor(scripts): generalize GHCID location fixer for all institution types - Add --type/-t flag to specify institution type (A, G, H, I, L, M, N, O, R, S, T, U, X, ALL) - Default still Type I (Intangible Heritage) for backward compatibility - Skip PENDING files that have no location data - Update help text with all supported types	2026-01-09 11:54:28 +01:00
kempersc	c88fd3af70	Refactor code structure for improved readability and maintainability	2026-01-09 11:05:26 +01:00
kempersc	6608a207d4	update frontend	2026-01-08 15:56:28 +01:00
kempersc	9d68ed8c2e	fix: mark 15 more Google Maps false matches via comprehensive review Manual review of remaining Type I custodian files without official websites identified additional false matches in these categories: Wrong organization type: - Bird catchers vs bird watchers association - Heritage org vs webshop - Regional org vs specific local entity - Federation vs single member association - Bell ringers org vs church building Wrong location: - Amsterdam org matched to Den Haag - Haarlem org matched to Apeldoorn - Rotterdam org matched to Amstelveen - Dutch org matched to Suriname (!) - Giethoorn event matched to Belt-Schutsloot - Duindorp bonfire matched to Scheveningen Different event/entity: - Horse racing org vs summer festival - Street name vs organization - Heritage foundation vs specific local fair Total Type I false matches fixed: 62 of 188 files (33%)	2026-01-08 15:21:31 +01:00
kempersc	85d9cee82f	fix: mark 8 more Google Maps false matches detected via name mismatch Additional Type I custodian files with obvious name mismatches between KIEN registry entries and Google Maps results. These couldn't be auto-detected via domain mismatch because they lack official websites. Fixes: - Dick Timmerman (person) → carpentry business - Ria Bos (cigar maker) → money transfer agent - Stichting Kracom (Krampuslauf) → Happy Caps retail - Fed. Nederlandse Vertelorganisaties → NET Foundation - Stichting dodenherdenking Alphen → wrong memorial - Sao Joao Rotterdam → Heemraadsplein (location not org) - sport en spel (heritage) → equipment rental - Eiertikken Ommen → restaurant Also adds detection and fix scripts for Google Maps false matches.	2026-01-08 13:26:53 +01:00
kempersc	dfa667c90f	Fix LinkML schema for valid RDF generation with proper slot_uri Summary: - Create 46 missing slot definition files with proper slot_uri values - Add slot imports to main schema (01_custodian_name_modular.yaml) - Fix YAML examples sections in 116+ class and slot files - Fix PersonObservation.yaml examples section (nested objects → string literals) Technical changes: - All slots now have explicit slot_uri mapping to base ontologies (RiC-O, Schema.org, SKOS) - Eliminates malformed URIs like 'custodian/:slot_name' in generated RDF - gen-owl now produces valid Turtle with 153,166 triples New slot files (46): - RiC-O slots: rico_note, rico_organizational_principle, rico_has_or_had_holder, etc. - Scope slots: scope_includes, scope_excludes, archive_scope - Organization slots: organization_type, governance_authority, area_served - Platform slots: platform_type_category, portal_type_category - Social media slots: social_media_platform_category, post_type_* - Type hierarchy slots: broader_type, narrower_types, custodian_type_broader - Wikidata slots: wikidata_equivalent, wikidata_mapping Generated output: - schemas/20251121/rdf/01_custodian_name_modular_20260107_134534_clean.owl.ttl (6.9MB) - Validated with rdflib: 153,166 triples, no malformed URIs	2026-01-07 13:48:03 +01:00
kempersc	98c42bf272	Fix LinkML URI conflicts and generate RDF outputs - Fix scope_note → finding_aid_scope_note in FindingAid.yaml - Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead) - Remove duplicate rico_record_set_type from class_metadata_slots.yaml - Fix range types for equals_string compatibility (uriorcurie → string) - Move class names from close_mappings to see_also in 10 RecordSetTypes files - Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context - Sync schemas to frontend/public/schemas/ Files: 1,151 changed (includes prior CustodianType migration)	2026-01-07 12:32:59 +01:00
kempersc	b34992b1d3	Migrate all 293 class files to ontology-aligned slots Extends migration to all class types (museums, libraries, galleries, etc.) New slots added to class_metadata_slots.yaml: - RiC-O: rico_record_set_type, rico_organizational_principle, rico_has_or_had_holder, rico_note - Multilingual: label_de, label_es, label_fr, label_nl, label_it, label_pt - Scope: scope_includes, scope_excludes, custodian_only, organizational_level, geographic_restriction - Notes: privacy_note, preservation_note, legal_note Migration script now handles 30+ annotation types. All migrated schemas pass linkml-validate. Total: 387 class files now use proper slots instead of annotations.	2026-01-06 12:24:54 +01:00
kempersc	aa763dab25	Migrate 94 archive class annotations to ontology-aligned slots - Add migration script: scripts/migrate_annotations_to_slots.py - Convert custodian_types, wikidata, skos_broader, specificity_* annotations - Replace with proper slots mapped to SKOS, PROV-O, RiC-O predicates - Add ../slots/class_metadata_slots import to all migrated files - Remove AcademicArchive_refactored.yaml (main file now migrated) - Sync changes to frontend/public/schemas/ Migration converts: - custodian_types → hc:custodianTypes slot - wikidata/wikidata_label → wikidata_alignment structured slot - skos_broader → skos:broader slot - specificity_* → specificity_annotation structured slot - dual_class_pattern → dual_class_link structured slot - template_specificity → template_specificity slot All 94 migrated schemas pass linkml-validate.	2026-01-06 11:25:37 +01:00
kempersc	11983014bb	Enhance specificity scoring system integration with existing infrastructure - Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework. - Added detailed mapping of SPARQL templates to context templates for improved specificity filtering. - Implemented wrapper patterns around existing classifiers to extend functionality without duplication. - Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality. - Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.	2026-01-05 17:37:49 +01:00

1 2 3

142 commits