kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	92b490d690	edit slots	2026-01-13 20:35:11 +01:00
kempersc	f74513e8ef	feat: Enhance entity resolution with email semantics and review merging - Updated `entity_review.py` to map email semantic fields from JSON. - Expanded `email_semantics.py` with additional museum mappings. - Introduced a new rule in `.opencode/rules/no-duplicate-ontology-mappings.md` to prevent duplicate ontology mappings. - Added a backup JSON file for entity resolution candidates. - Created `enrich_email_semantics.py` to enrich candidates with email semantic signals. - Developed `merge_entity_reviews.py` to merge reviewed decisions from a backup into new candidates.	2026-01-13 16:43:56 +01:00
kempersc	1fb924c412	feat: add ontology mappings to LinkML schema and enhance entity resolution Schema enhancements (443 files): - Add class_uri with proper ontology references (schema:, prov:, skos:, rico:) - Add close_mappings, related_mappings per Rule 50 convention - Replace stub hc: slot_uri with standard predicates (dcterms:identifier, skos:prefLabel) - Improve descriptions with ontology mapping rationale - Add prefixes blocks to all schema modules Entity Resolution improvements: - Add entity_resolution module with email semantics parsing - Enhance build_entity_resolution.py with email-based matching signals - Extend Entity Review API with filtering by signal types and count - Add candidates caching and indexing for performance - Add ReviewLoginPage component New rules and documentation: - Add Rule 51: No Hallucinated Ontology References - Add .opencode/rules/no-hallucinated-ontology-references.md - Add .opencode/rules/slot-ontology-mapping-reference.md - Add adms.ttl and dqv.ttl ontology files Frontend ontology support: - Add RiC-O_1-1.rdf and schemaorg.owl to public/ontology	2026-01-13 13:51:02 +01:00
kempersc	846a6cdcec	Add new Record Set Types for various archival collections - Introduced SoundArchiveRecordSetType, SpecialCollectionRecordSetType, SpecializedArchiveRecordSetType, SpecializedArchivesCzechiaRecordSetType, StateArchivesRecordSetType, StateArchivesSectionRecordSetType, StateDistrictArchiveRecordSetType, StateRegionalArchiveCzechiaRecordSetType, TelevisionArchiveRecordSetType, TradeUnionArchiveRecordSetType, UniversityArchiveRecordSetType, VereinsarchivRecordSetType, VerlagsarchivRecordSetType, VerwaltungsarchivRecordSetType, WebArchiveRecordSetType, and WomensArchivesRecordSetType. - Each new type includes appropriate metadata, slots, and relationships to existing classes. - Implemented a script to detect and fix Type class violations in LinkML files.	2026-01-12 15:20:29 +01:00
kempersc	355d8be51d	centralise slots	2026-01-12 14:33:56 +01:00
kempersc	070c87af7b	refactor(migrate_wcms_resume): use recursive glob to find user JSON files and skip macOS hidden files	2026-01-11 23:32:27 +01:00
kempersc	56c373bba8	Implement fast WCMS migration script with state file checkpointing and batch processing	2026-01-11 22:26:37 +01:00
kempersc	fce186b649	enrich person profiles	2026-01-11 18:08:40 +01:00
kempersc	fd792fce2c	Refactor code structure for improved readability and maintainability Some checks failed Deploy Frontend / build-and-deploy (push) Has been cancelled Details	2026-01-11 15:27:14 +01:00
kempersc	55ef2a831d	feat(data): add Belgian surnames dataset with metadata and surname counts	2026-01-11 13:50:20 +01:00
kempersc	7d09e4179c	Add US surnames dataset from 2010 Census with metadata and surname counts	2026-01-11 12:28:58 +01:00
kempersc	dfb4744dc7	Evaluate data enrichments of persons	2026-01-11 12:15:27 +01:00
kempersc	49a8c341b5	chore(data): update geonames database journal file	2026-01-11 02:51:52 +01:00
kempersc	170fd73c49	feat(agents): update critical rules section to include entity resolution guidelines	2026-01-11 02:51:18 +01:00
kempersc	556cc6c294	Add workspace configuration for Git and Gitea integration - Set up GitHub integration to be disabled. - Configure Git settings including path and autofetch options. - Add Gitea instance URL and repository details. - Enable YAML support for LinkML schemas with validation. - Define file associations for YAML files. - Recommend essential extensions for development and exclude unwanted ones.	2026-01-11 02:50:39 +01:00
kempersc	b3e57e709c	Refactor code structure for improved readability and maintainability	2026-01-11 02:24:34 +01:00
kempersc	0df26a6e44	data(person): additional person profile enrichments	2026-01-11 00:41:59 +01:00
kempersc	3eb097d92e	data(person): enrich 64 person profiles with comprehensive metadata - Add inferred birth dates using EDTF notation - Add inferred birth/current settlements - Enrich employment history with temporal data - Add heritage sector relevance scores - Improve PPID component tracking - Update .gitignore with large file patterns (warc, nt, trix, geonames.db)	2026-01-11 00:38:09 +01:00
kempersc	ac36b80476	feat(rag): add companion queries for count templates Add companion_query support to fetch full entity records alongside aggregate count queries. Enables displaying results on map/list when asking 'how many museums in Amsterdam?' Backend changes: - Add companion_query, companion_query_region, companion_query_country fields to TemplateDefinition and TemplateMatchResult - Add render_template_string() for raw companion query rendering Template changes: - Add companion queries to count_institutions_by_type_and_location for settlement, region, and country level queries - Returns institution URI, name, coordinates, city for visualization	2026-01-10 18:44:06 +01:00
kempersc	f8b4ecad7d	data(person): enrich 7 person profiles with detailed employment history Update heritage professional profiles with: - Separate role entries for different positions at same institution - Employment date ranges (start_date, end_date) - Updated observed_on timestamps - Direct LinkedIn profile URLs as source Profiles updated: - Antoinet Nijssen (Noord-Hollands Archief) - Anna Lakmaker - Annelies Reus - Marianne Hamersma - Marcel Auwers - Hans Felius - Nico Vriend	2026-01-10 18:43:27 +01:00
kempersc	28c3aaf33f	enrich profiles	2026-01-10 17:31:02 +01:00
kempersc	bd257c52f4	data(person): update 2 additional profiles	2026-01-10 15:39:12 +01:00
kempersc	2f33e6a230	data(person): update DR-STAPEL profile	2026-01-10 15:38:37 +01:00
kempersc	ec18e1810d	data(person): enrich 7 profiles with detailed affiliations and GHCIDs - Add GHCID references to custodian affiliations - Add start dates for employment periods - Expand heritage type classifications (A→[A,F]) - Add detailed rationales based on career history - Add full_initials from archival publications	2026-01-10 15:36:49 +01:00
kempersc	e5a08a353d	enrich person profiles	2026-01-10 14:14:04 +01:00
kempersc	9339de2cfb	data(person): process 44,512 heritage-relevant profiles from entity extractions Processing Summary: - Scanned 94,716 LinkedIn entity files - Identified 44,512 heritage-relevant individuals (47%) - Created 1,430 new PPID-formatted profiles - Updated 43,070 existing profiles with entity data - Final count: 40,731 person profiles Profile updates include: - Merged web_claims with full provenance - Added/updated heritage_relevance scoring - Added affiliation data with custodian references - Added inferred birth decades with provenance chains (Rule 45) All data preserved per Rule 5 (additive only)	2026-01-10 14:01:29 +01:00
kempersc	6f3cf95492	data(person): fix data quality issues and PPID corrections Data Quality Corrections: - TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016, not death). Set is_living=true. Reassess heritage_relevance=false (tourism ministry is not a GLAM institution) - ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam) based on verified birth location. Update birth year to 1980 Profile Enrichments (5 profiles with XX-XX-XXX placeholders): - Add web claims with proper provenance timestamps - Add LinkedIn-verified education and position claims - Document correction rationale in modification_reason Heritage Relevance Reassessments: - Government ministries (Tourism, etc.) marked as non-heritage - Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify	2026-01-10 13:31:39 +01:00
kempersc	49f4054802	data(person/entity): add 83,845 LinkedIn profile extractions from company pages Bulk extraction of heritage professional profiles from LinkedIn company pages using extract_persons_with_provenance.py script. Key characteristics: - Source: LinkedIn company 'People' pages for heritage institutions - File format: {linkedin-slug}_{timestamp}.json - Total size: ~3.6GB - Includes: profile_data, heritage_relevance, affiliations, web_claims - Provenance: Full XPath + archived HTML references (Rule 6 compliant) - Dual timestamps: statement_created_at + source_archived_at (Rule 35) Extraction metadata includes: - extraction_agent: extract_persons_with_provenance.py - source_file: Original archived HTML filename - source_archived_at: When LinkedIn page was captured - schema_version: 1.0.0 Note: URL-encoded filenames preserve international characters (Arabic, Hebrew, Chinese, Turkish, accented Latin, etc.)	2026-01-10 13:27:08 +01:00
kempersc	30cd8842d9	data(person): update profiles with web claims and PPID corrections - Rename SENNAY-GHEBREAB profile: NL-ZH-ROT → ET-XX-ADD (Ethiopian birth) - Enrich profiles with inferred birth decades and settlements - Add web claims provenance for enriched data - Update 16 profiles with improved location resolution Files: +1 new (renamed), 16 modified, 1 deleted	2026-01-10 12:56:28 +01:00
kempersc	5eaab2bd30	data(person): enrich heritage professional profiles with web claims Batch enrichment of 3,728 person profiles with additional data: - Birth decade inference from education/career history - Location resolution for inferred birth settlements - Web claims with full provenance (source_url, retrieved_on) - Organizational subdivision extraction - Heritage relevance scoring Also includes: - 14 profile renames for PPID format corrections - Updated _manifest.json with extraction statistics - New _extraction_log.txt and _extraction_summary.json Enrichment follows AGENTS.md rules: - Rule 44: EDTF unknown date notation (XXXX, 196X, etc.) - Rule 45: Inferred data with explicit provenance - Rule 30: Confidence scoring (0.50-0.95) - Rule 31: Organizational subdivision extraction 35,052 files changed, +4,507,411 insertions, -63,118 deletions	2026-01-10 10:35:20 +01:00
kempersc	519b0b47a8	Add Playwright test results JSON file with initial test suite and failure details	2026-01-09 21:33:31 +01:00
kempersc	004d342935	chore: minor updates and evaluation results - auth.setup.ts: require env vars for test credentials (no hardcoded defaults) - manifest.json: update schema manifest - full_evaluation_results.json: add RAG evaluation results - petra-links.json: update birth date from web claim	2026-01-09 21:10:55 +01:00
kempersc	855fff5962	data(person): resolve PPID locations and enrich profiles - Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations - Update 2,463 profiles with enriched data - Add 512 new person profiles (AU, international heritage professionals) - PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}	2026-01-09 21:09:28 +01:00
kempersc	eb122e2532	data(custodian): remove 380 PENDING files after collision merge PENDING files were merged into existing custodian records in commit `eaf80ec`. These temporary collision placeholder files are no longer needed.	2026-01-09 21:06:22 +01:00
kempersc	9e67d0f967	enrich profiles	2026-01-09 20:35:19 +01:00
kempersc	eaf80ec756	data(custodian): merge PENDING collision files into existing custodians Merge staff data from 7 PENDING files into their matching custodian records: - NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff) - NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO - NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS - NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP - NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM - NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ - NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV Originals archived in data/custodian/archive/pending_collisions_20250109/ Add scripts/merge_collision_files.py for reproducible merging	2026-01-09 18:33:00 +01:00
kempersc	e9c9aefc37	data(person): regenerate PPIDs with unidecode support for non-Latin scripts - Add display_name and name_romanized fields to all 7948 person profiles - Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize) - Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN - Collision count reduced from 82 to 81 groups Regenerated using generate_ppids.py with unidecode support (commit `abe30cb`)	2026-01-09 18:31:53 +01:00
kempersc	c45367c60f	data(custodian): resolve more PENDING files with proper GHCIDs Additional batch of PENDING file resolutions: - DK: Aalborg Teater - FR: Airborne Museum, ALCA Nouvelle-Aquitaine - NL: 12 institutions (CODA Apeldoorn, Airborne Museum Arnhem, etc.) - SA: Saudi Arabia Ministry of Culture Files renamed from NL-XX-XXX-PENDING-* to proper country/region codes.	2026-01-09 18:29:09 +01:00
kempersc	932ec5438c	add person profiles with PPID	2026-01-09 18:26:58 +01:00
kempersc	bd06e4f864	data(custodian): merge 135 PENDING files into existing enriched records Merge data from PENDING files (with XX-XXX placeholders) into their corresponding enriched custodian records with proper GHCIDs. Countries affected: - DE: 4 institutions (Deutsche Stiftung, Jewish Museum Berlin, etc.) - ES: 1 institution (Biblioteca Nacional de España) - FR: 1 institution (NMO) - ID: 18 Indonesian museums and archives - NL: 111 Dutch institutions across all provinces - US: 1 institution (ARCA) The PENDING files are deleted after merge; originals archived in data/custodian/archive/pending_merged_20250109/	2026-01-09 18:25:56 +01:00
kempersc	a51c8c400c	data(pending): add 125 international PENDING custodian files with proper country codes Identified 125 institutions from LinkedIn staff extraction that are NOT Dutch: - FR: 45 (French museums, archives, libraries) - ID: 14 (Indonesian institutions) - GB: 14 (British institutions) - DE: 13 (German museums, foundations) - BE: 11 (Belgian museums) - IT: 6 (Italian institutions) - AU: 6 (Australian archives, museums) - Plus smaller counts from IN, US, ES, CH, DK, AT, SA, NO, IL These files have staff data from LinkedIn company pages but need GHCID resolution (currently XX-XXX placeholders for region/city). Dutch PENDING files remain: 1,283	2026-01-09 15:55:31 +01:00
kempersc	14be18e7c4	feat(data): merge staff data from 30 more PENDING files into enriched custodians Batch 2 of PENDING file resolution: - Merged LinkedIn staff data from 30 PENDING files into matching enriched custodians - Archived processed PENDING files to data/custodian/archive/pending_merged_20250109/ - Notable merges: ASML (994 staff), BBB (117), Apenheul (100), BOEI (93) Files merged include: - Corporate: ASML, BOS Foundation, Constructing the Limes - Museums: Allard Pierson, Apenheul, various regional museums - Research: Catholic Documentation Centre, Creating Cultures of Care - Cultural orgs: Cultuur Ondernemen, CultuurOost, CultuurKwadraat This continues the effort to consolidate PENDING files (1283 remaining).	2026-01-09 15:42:32 +01:00
kempersc	1f723fd5d7	feat(data): merge staff data from 35 PENDING files into enriched custodians Merged LinkedIn-extracted staff sections from PENDING files into their corresponding proper GHCID custodian files. This consolidates data from two extraction sources: - Existing enriched files: Google Maps, Museum Register, YouTube, etc. - PENDING files: LinkedIn staff data extraction Files modified: - 28 custodian files enriched with staff data - 35 PENDING files deleted (merged into proper locations) - Originals archived to archive/pending_duplicates_20250109/ Key institutions enriched: - Rijksmuseum (NL-NH-AMS-M-RM) - Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA) - Amsterdam Museum (NL-NH-AMS-M-AM) - Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA) - Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR) - And 23 more museums/archives across NL New scripts: - scripts/merge_staff_data.py: Automated staff data merger - scripts/categorize_pending_files.py: PENDING file analysis utility	2026-01-09 14:51:17 +01:00
kempersc	2c2a312e0a	feat(rag): add database routing to 8 more factual query templates Add databases: ["oxigraph"] to skip vector search for deterministic queries: - count_institutions_by_type_location (count) - count_institutions_by_type (aggregation) - find_institutions_by_founding_date (temporal) - find_custodians_by_budget_threshold (financial) - compare_locations (comparative) - find_by_founding (temporal) - events_in_period (temporal events) - institutions_by_founding_decade (temporal aggregation) Total templates with oxigraph-only routing: 12	2026-01-09 12:33:41 +01:00
kempersc	b9c30fc970	feat(rag): extend database routing to count, temporal, and financial templates Add databases: ["oxigraph"] to 5 more templates that don't benefit from vector search: - count_institutions_by_type_location - compare_locations - find_by_founding - find_custodians_by_budget_threshold - find_institutions_by_founding_date Total templates with Oxigraph-only routing: 10	2026-01-09 12:32:28 +01:00
kempersc	17a94613f3	data(custodian): resolve 57 PENDING files to proper GHCID locations Resolved NL-XX-XXX-PENDING files to proper regional GHCIDs: - 57 new files with proper location codes (city, region) - Cities include: Amsterdam, Rotterdam, Utrecht, Leiden, Groningen, etc. - 34 original PENDING files archived to archive/pending_duplicates_20250109/ Examples: - NL-XX-XXX-PENDING-AMSTERDAM-MUSEUM → NL-NH-AMS-M-AM (Amsterdam Museum) - NL-XX-XXX-PENDING-GRONINGEN-MUSEUM → NL-GR-GRO-M-GM (Groninger Museum) - NL-XX-XXX-PENDING-KUNSTHAL-ROTTERDAM → NL-ZH-ROT-G-KR (Kunsthal Rotterdam)	2026-01-09 12:19:19 +01:00
kempersc	76644f55f5	feat(rag): add database routing to geographic query templates Add databases: ["oxigraph"] to 4 geographic templates to skip vector search: - list_institutions_by_type_city - list_institutions_by_type_region - list_institutions_by_type_country - list_institutions_in_city Also add documentation explaining database routing configuration in _metadata.	2026-01-09 11:56:18 +01:00
kempersc	5255128159	fix(data): correct GHCID locations for 4 heritage custodians Location corrections based on GeoNames reverse geocoding: - NL-FR-LAN-S-L → NL-FR-DKN-S-L (Historische Werkgroep Kynhout: De Knipe) - NL-LI-HEE-A-CRGR → NL-LI-MAA-A-CRGR (Centrum Regionale Geschiedenis: Maastricht) - NL-NB-MID-S-M → NL-NB-BER-S-M (Heemkundekring De Plaets: Berlicum) - NL-OV-NIJ-A-GH → NL-OV-HEL-A-GH (Gemeente Hellendoorn: Hellendoorn)	2026-01-09 11:55:08 +01:00
kempersc	e128727b13	fix(data): correct GHCID location for Museumreddingboot Terschelling - Rename NL-FR-HOO-M-MT.yaml → NL-FR-TER-M-MT.yaml - HOO (Hooghalen) → TER (Terschelling) - correct island location - Institution is on Terschelling island, not in Drenthe	2026-01-09 11:54:37 +01:00
kempersc	c88fd3af70	Refactor code structure for improved readability and maintainability	2026-01-09 11:05:26 +01:00

1 2 3 4

197 commits