kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	242bc8bb35	Add new slots for heritage custodian entities - Created deliverables_slot for expected or achieved deliverable outputs. - Introduced event_id_slot for persistent unique event identifiers. - Added follow_up_date_slot for scheduled follow-up action dates. - Implemented object_ref_slot for references to heritage objects. - Established price_slot for price information across entities. - Added price_currency_slot for currency codes in price information. - Created protocol_slot for API protocol specifications. - Introduced provenance_text_slot for full provenance entry text. - Added record_type_slot for classification of record types. - Implemented response_formats_slot for supported API response formats. - Established status_slot for current status of entities or activities. - Added FactualCountDisplay component for displaying count query results. - Introduced ReplyTypeIndicator component for visualizing reply types. - Created approval_date_slot for formal approval dates. - Added authentication_required_slot for API authentication status. - Implemented capacity_items_slot for maximum storage capacity. - Established conservation_lab_slot for conservation laboratory information. - Added cost_usd_slot for API operation costs in USD.	2026-01-05 00:49:05 +01:00
kempersc	2dca28d8c1	enrich CH entries with mission statements	2026-01-04 13:12:32 +01:00
kempersc	4f0cafe98a	enrich HC profiles	2026-01-02 02:11:04 +01:00
kempersc	349f31ae6f	enrich custodian profiles	2026-01-02 02:10:18 +01:00
kempersc	45e873ec0a	enrich JP BE AR profiles	2025-12-30 23:07:03 +01:00
kempersc	f753d7277f	Add country code extraction for location validation in Google Places API	2025-12-30 03:45:29 +01:00
kempersc	d64f857aa9	add sparql validator and RAG injector	2025-12-30 03:43:31 +01:00
kempersc	84904e344b	Make AGENTS more succint by referring to opencode rules & enrich custodians	2025-12-28 14:56:35 +01:00
kempersc	cdb633b0c9	enrich custodian entries with logo	2025-12-27 02:15:17 +01:00
kempersc	6af5009444	enrich entries	2025-12-26 21:41:18 +01:00
kempersc	ca219340f2	enrich entries	2025-12-26 14:30:31 +01:00
kempersc	38292d1918	enrich: logo enrichment for JP custodians (1350 processed, 10746 remaining)	2025-12-23 20:56:21 +01:00
kempersc	5e8a432ef0	enrich japanese and dutch custodians	2025-12-23 18:08:45 +01:00
kempersc	a1fb6344e7	enriching custodian data	2025-12-23 17:26:29 +01:00
kempersc	0c1d19e98b	enrich entries	2025-12-23 13:27:35 +01:00
kempersc	7a056fa746	enrich entries	2025-12-21 22:12:34 +01:00
kempersc	aca68ea47f	remove a,bihguous web-claims	2025-12-21 00:01:54 +01:00
kempersc	23b1d8ee5f	clean up GHCID	2025-12-17 11:58:40 +01:00
kempersc	99430c2a70	add new entries and semantic routing	2025-12-17 10:11:56 +01:00
kempersc	e0dd847491	extend ontology	2025-12-16 20:27:39 +01:00
kempersc	b0416efc7d	enrich custodians and persons	2025-12-16 11:57:34 +01:00
kempersc	52ae711c56	add timespans	2025-12-16 09:02:52 +01:00
kempersc	cb56aa7e40	enrich all custodian timespan	2025-12-15 22:31:41 +01:00
kempersc	0c36429257	feat(scripts): Add batch crawling and data quality scripts - batch_crawl4ai_recrawl.py: Retry failed URL crawls - batch_firecrawl_recrawl.py: FireCrawl batch processing - batch_httpx_scrape.py: HTTPX-based scraping - detect_name_mismatch.py: Find name mismatches in data - enrich_dutch_custodians_crawl4ai.py: Dutch custodian enrichment - fix_collision_victims.py: GHCID collision resolution - fix_generic_platform_names*.py: Platform name cleanup - fix_ghcid_type.py: GHCID type corrections - fix_simon_kemper_contamination.py: Data cleanup - scan_dutch_data_quality.py: Data quality scanning - transform_crawl4ai_to_digital_platform.py: Data transformation	2025-12-15 01:47:46 +01:00
kempersc	1d26cade66	correct person labels	2025-12-14 17:58:55 +01:00
kempersc	c6aee998db	correct person labels	2025-12-14 17:29:39 +01:00
kempersc	c50c35fd3a	enrich person custodian	2025-12-14 17:09:55 +01:00
kempersc	505c12601a	Add test script for PiCo extraction from Arabic waqf documents - Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents. - The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results. - Added comprehensive logging for API responses, extraction results, and validation errors. - Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.	2025-12-12 17:50:17 +01:00
kempersc	b1f93b6f22	enrich person profiles	2025-12-12 12:51:10 +01:00
kempersc	03263f67d6	moved web archives	2025-12-12 00:40:26 +01:00
kempersc	1b1cfbfca0	enrich custodians	2025-12-11 22:32:09 +01:00
kempersc	d4906abae4	update postgis data	2025-12-10 23:51:51 +01:00
kempersc	be3fbac601	enrich entries and persons	2025-12-10 18:04:25 +01:00
kempersc	41959f0766	correct HCID!	2025-12-10 13:01:13 +01:00
kempersc	c4b0f17a43	geocode: complete 100% coverage - add coordinates to final 26 files (CZ, BE, AR, LB, ML)	2025-12-10 01:07:34 +01:00
kempersc	6e2c36413e	geocode: add coordinates to 540 Japanese custodian files using postal codes - Download GeoNames JP postal code database (142K entries) - Create geocode_japan_postal.py with postal code lookup - Handle unicode hyphen variants in postal codes - Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima) - Implement prefix fallback for company postal codes - Total JP files geocoded: 540 (99.81% coverage) This brings overall geocoding coverage from 97.84% to 99.81%	2025-12-10 00:27:33 +01:00
kempersc	dee7a4c7d9	geocode: add coordinates to 147 Swiss custodian files - Improved city name normalization to handle: - St. Gallen / St.Gallen -> Sankt Gallen - Canton suffixes (Buchs SG, Brugg AG) - Hyphenated districts (Bernex - Genève) - Postal codes with slashes (Ecublens/VD) - German prepositions (Hausen b. Brugg) - Created scripts/geocode_from_city_name.py for unified geocoding	2025-12-09 22:38:33 +01:00
kempersc	cc61d99acf	geocode: add coordinates to BG and EG custodian files - BG: Add lat/lon from existing GeoNames IDs (28 files) - EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files) - Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml - Overall coverage: 96.4% → 96.6%	2025-12-09 21:59:58 +01:00
kempersc	2137c522db	geocode: add coordinates to JP compound cities and CZ files from GeoNames - JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files) - CZ: Map city codes to GeoNames entries (667 files) - Overall coverage: 84.5% → 96.4%	2025-12-09 21:49:40 +01:00
kempersc	3a6ead8fde	feat: Add legal form filtering rule for CustodianName - Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations. - Documented rationale, examples, and implementation guidelines for the filtering process. docs: Create README for value standardization rules - Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes. - Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution. feat: Implement transliteration standards for non-Latin scripts - Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration. - Included detailed guidelines for various scripts and languages, along with implementation examples. feat: Define XPath provenance rules for web observations - Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources. - Established a workflow for archiving websites and verifying claims against archived HTML. chore: Update records lifecycle diagram - Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians. - Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.	2025-12-09 16:58:41 +01:00
kempersc	fade1ed5b3	fix: add safety measures to prevent data loss during enrichment Key changes: - Created scripts/lib/safe_yaml_update.py with PROTECTED_KEYS constant - Fixed enrich_custodians_wikidata_full.py to re-read files before writing (prevents race conditions where another script modified the file) - Added safety check to abort if protected keys would be lost - Protected keys include: location, original_entry, ghcid, provenance, google_maps_enrichment, osm_enrichment, etc. Root cause of data loss in `62fdd35321`: - Script loaded files into list, then processed them later - If another script modified files between load and write, changes were lost - Now files are re-read immediately before modification Per AGENTS.md Rule 5: NEVER Delete Enriched Data - Additive Only	2025-12-09 12:27:09 +01:00
kempersc	a7321b1bb9	reconstruct location blocks	2025-12-09 12:25:16 +01:00
kempersc	cab712659d	recover location blocks	2025-12-09 11:34:56 +01:00
kempersc	62fdd35321	Refactor code structure for improved readability and maintainability	2025-12-09 11:15:51 +01:00
kempersc	b61271220b	enrich entries	2025-12-09 10:46:43 +01:00
kempersc	bf7c773955	edit Japanese entries	2025-12-09 09:16:19 +01:00
kempersc	c283daa1a2	normalise dutch entries	2025-12-09 08:02:27 +01:00
kempersc	131e3ca259	normalise custodian entries	2025-12-09 07:56:35 +01:00
kempersc	0938cce6cf	feat(loaders): update DuckLake and TypeDB loaders with relation support	2025-12-08 15:00:14 +01:00
kempersc	486bbee813	feat(wikidata): add re-enrichment and duplicate removal scripts - Add reenrich_wikidata_with_verification.py for re-running enrichment - Add remove_wikidata_duplicates.py for deduplication	2025-12-08 14:59:38 +01:00
kempersc	891692a4d6	feat(ghcid): add diacritics normalization and transliteration scripts - Add fix_ghcid_diacritics.py for normalizing non-ASCII in GHCIDs - Add resolve_diacritics_collisions.py for collision handling - Add transliterate_emic_names.py for non-Latin script handling - Add transliteration tests	2025-12-08 14:59:28 +01:00
kempersc	6a6557bbe8	feat(enrichment): add emic name enrichment and update CustodianName schema - Add emic_name, name_language, standardized_name to CustodianName - Add scripts for enriching custodian emic names from Wikidata - Add YouTube and Google Maps enrichment scripts - Update DuckLake loader for new schema fields	2025-12-08 14:58:50 +01:00
kempersc	7e3559f7e5	add new entries	2025-12-07 23:08:02 +01:00
kempersc	18874e6070	fix(scripts): normalize org_type codes in DuckLake loader - Handle single-letter GLAM type codes (G, L, A, M, O, R, C, etc.) - Handle legacy GRP.HER.* format - Support compound types like 'M,F' -> 'MUSEUM,FEATURES' - Fix type hint syntax for Python 3.10+	2025-12-07 19:21:14 +01:00
kempersc	400b1c04c1	fix(scripts): force table recreation in web archives migration Drop existing tables before creating to ensure schema updates are applied properly instead of using IF NOT EXISTS which would skip schema changes.	2025-12-07 18:47:46 +01:00
kempersc	90a1f20271	chore: add YAML history fix scripts and update ducklake/deploy tooling - Add fix_yaml_history.py and fix_yaml_history_v2.py for cleaning up malformed ghcid_history entries with duplicate/redundant data - Update load_custodians_to_ducklake.py for DuckDB lakehouse loading - Update migrate_web_archives.py for web archive management - Update deploy.sh with improvements - Ignore entire data/ducklake/ directory (generated databases)	2025-12-07 18:45:52 +01:00
kempersc	7f85238f67	fix(scripts): update CBS GeoJSON field names for municipality loading Support additional field name patterns: - 'code'/'naam' (current CBS format) - 'provincieCode'/'provincieNaam' for province data	2025-12-07 18:40:13 +01:00
kempersc	d9325c0bb5	feat: add web archives integration and improve enrichment scripts Backend: - Attach web_archives.duckdb as read-only database in DuckLake - Create views for web_archives, web_pages, web_claims in heritage schema Scripts: - enrich_cities_google.py: Add batch processing and retry logic - migrate_web_archives.py: Improve schema handling and error recovery Frontend: - DuckLakePanel: Add web archives query support - Database.css: Improve layout for query results display	2025-12-07 17:49:07 +01:00
kempersc	83ab098cf7	feat: add PostGIS international boundary architecture Add schema and tooling for storing administrative boundaries in PostGIS: - 002_postgis_boundaries.sql: Complete PostGIS schema with: - boundary_countries (ISO 3166-1) - boundary_admin1 (states/provinces/regions) - boundary_admin2 (municipalities/districts) - boundary_historical (HALC pre-modern territories) - custodian_service_areas (computed werkgebied geometries) - geonames_settlements (reverse geocoding) - Spatial functions: find_admin_for_point, find_nearest_settlement - Views for API access - load_boundaries_postgis.py: Python loader supporting: - GADM (Global Administrative Areas) - primary global source - CBS (Dutch municipality boundaries) - GeoNames settlements for reverse geocoding - Cached downloads and upsert logic - POSTGIS_BOUNDARY_ARCHITECTURE.md: Design documentation This replaces the static GeoJSON approach for international coverage.	2025-12-07 14:34:39 +01:00
kempersc	e45c1a3c85	feat(scripts): add city enrichment and location resolution utilities Enrichment scripts for country-specific city data: - enrich_austrian_cities.py, enrich_belgian_cities.py, enrich_belgian_v2.py - enrich_bulgarian_cities.py, enrich_czech_cities.py, enrich_czech_cities_fast.py - enrich_japanese_cities.py, enrich_swiss_isil_cities.py, enrich_cities_google.py Location resolution utilities: - resolve_cities_from_file_coords.py - Resolve cities using coordinates in filenames - resolve_cities_wikidata.py - Use Wikidata P131 for city resolution - resolve_country_codes.py - Standardize country codes - resolve_cz_xx_regions.py - Fix Czech XX region codes - resolve_locations_by_name.py - Name-based location lookup - resolve_regions_from_city.py - Derive regions from city data - update_ghcid_with_geonames.py - Update GHCIDs with GeoNames data CH-Annotator integration: - create_custodian_from_ch_annotator.py - Create custodians from annotations - add_ch_annotator_location_claims.py - Add location claims - extract_locations_ch_annotator.py - Extract locations from annotations Migration and fixes: - migrate_egyptian_from_ch.py - Migrate Egyptian data - migrate_web_archives.py - Migrate web archive data - fix_belgian_cities.py - Fix Belgian city data	2025-12-07 14:26:59 +01:00
kempersc	63a6bccd9b	fix: remove custodian files with invalid GHCID special characters Remove 229 custodian YAML files containing invalid characters in GHCIDs: - Ampersand (&) in abbreviations (e.g., BM&HS, UNL&AG, DR&IMSM) - Parentheses in abbreviations (e.g., WHO(RA, VK(, SL() - Unicode characters in filenames (Ö, Ä, Å, É, İ, Ż, etc.) These files are replaced with corrected versions using alphabetic-only abbreviations per AGENTS.md Rule 8 (Special Characters MUST Be Excluded). Related scripts updated for location resolution.	2025-12-07 14:23:50 +01:00
kempersc	ee4e57bc75	add new entries	2025-12-07 00:26:01 +01:00
kempersc	1635625032	added web annotations	2025-12-06 19:50:04 +01:00
kempersc	55e2cd2340	feat: implement LLM-based extraction for Archives Lab content - Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0. - Replaced regex-based extraction with generative LLM inference. - Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics. - Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons. - Results and statistics are saved in JSON format for further analysis.	2025-12-05 23:16:21 +01:00
kempersc	4da64eeebf	improve annotator	2025-12-05 16:25:39 +01:00
kempersc	e38fb4613b	improve annotation prompt	2025-12-05 15:51:39 +01:00
kempersc	3a242370fc	annotation standards added	2025-12-05 15:30:23 +01:00
kempersc	d661947830	update enriched entries	2025-12-03 17:38:46 +01:00
kempersc	ef89b1213a	validate enrichments	2025-12-02 14:36:01 +01:00
kempersc	4b833d20b2	add pids	2025-12-01 23:55:55 +01:00
kempersc	7dce283c17	Add new enums for PersonalCollectionType, ResearchCenterType, and TasteScentHeritage classifications; implement validation script for custodian names against authoritative sources	2025-12-01 18:39:22 +01:00
kempersc	48a2b26f59	feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas - Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data. - Loaded instance data from YAML files and enriched enum definitions with meaningful annotations. - Configured output paths for generated diagrams in both frontend and schema directories. - Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.	2025-12-01 16:58:03 +01:00
kempersc	097d116b72	enrich entries	2025-12-01 16:06:34 +01:00
kempersc	2497e5913f	enrich entries	2025-12-01 00:37:24 +01:00
kempersc	f3c149b1bb	update entries	2025-11-30 23:30:29 +01:00
kempersc	0ab8f24a6b	archive websites	2025-11-29 18:05:16 +01:00
kempersc	da1eae6597	Refactor code structure for improved readability and maintainability	2025-11-29 12:27:39 +01:00
kempersc	30162e6526	Add script to validate KB library entries and generate enrichment report - Implemented a Python script to validate KB library YAML files for required fields and data quality. - Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics. - Created a comprehensive markdown report summarizing validation results and enrichment quality. - Included error handling for file loading and validation processes. - Generated JSON statistics for further analysis.	2025-11-28 14:48:33 +01:00
kempersc	5cdce584b2	Add complete schema for heritage custodian observation reconstruction - Introduced a comprehensive class diagram for the heritage custodian observation reconstruction schema. - Defined multiple classes including AllocationAgency, ArchiveOrganizationType, AuxiliaryDigitalPlatform, and others, with relevant attributes and relationships. - Established inheritance and associations among classes to represent complex relationships within the schema. - Generated on 2025-11-28, version 0.9.0, excluding the Container class.	2025-11-28 13:13:23 +01:00
kempersc	0d1741c55e	Refactor code structure for improved readability and maintainability	2025-11-28 11:44:21 +01:00
kempersc	37886f0433	Refactor code structure for improved readability and maintainability	2025-11-27 17:43:14 +01:00
kempersc	5ef8ccac51	Add script to enrich NDE Register NL entries with Wikidata data - Implemented a Python script that fetches and enriches entries from the NDE Register using data from Wikidata. - Utilized the Wikibase REST API and SPARQL endpoints for data retrieval. - Added logging for tracking progress and errors during the enrichment process. - Configured rate limiting based on authentication status for API requests. - Created a structured output in YAML format, including detailed enrichment data. - Generated a log file summarizing the enrichment process and results.	2025-11-27 13:30:00 +01:00
kempersc	a5a66eb547	add classes	2025-11-25 12:48:07 +01:00
kempersc	3ff0e33bf9	Add UML diagrams and scripts for custodian schema - Created PlantUML diagrams for custodian types, full schema, legal status, and organizational structure. - Implemented a script to generate GraphViz DOT diagrams from OWL/RDF ontology files. - Developed a script to generate UML diagrams from modular LinkML schema, supporting both Mermaid and PlantUML formats. - Enhanced class definitions and relationships in UML diagrams to reflect the latest schema updates.	2025-11-23 23:05:33 +01:00
kempersc	67657c39b6	feat: Complete Country Class Implementation and Hypernyms Removal - Created the Country class with ISO 3166-1 alpha-2 and alpha-3 codes, ensuring minimal design without additional metadata. - Integrated the Country class into CustodianPlace and LegalForm schemas to support country-specific feature types and legal forms. - Removed duplicate keys in FeatureTypeEnum.yaml, resulting in 294 unique feature types. - Eliminated "Hypernyms:" text from FeatureTypeEnum descriptions, verifying that semantic relationships are now conveyed through ontology mappings. - Created example instance file demonstrating integration of Country with CustodianPlace and LegalForm. - Updated documentation to reflect the completion of the Country class implementation and hypernyms removal.	2025-11-23 13:09:38 +01:00
kempersc	6eb18700f0	Add SHACL validation shapes and validation script for Heritage Custodian Ontology - Created SHACL shapes for validating temporal consistency and bidirectional relationships in custodial collections and staff observations. - Implemented a Python script to validate RDF data against the defined SHACL shapes using the pyshacl library. - Added command-line interface for validation with options for specifying data formats and output reports. - Included detailed error handling and reporting for validation results.	2025-11-22 23:22:10 +01:00
kempersc	2761857b0d	Add scripts for converting OWL/Turtle ontology to Mermaid and PlantUML diagrams - Implemented `owl_to_mermaid.py` to convert OWL/Turtle files into Mermaid class diagrams. - Implemented `owl_to_plantuml.py` to convert OWL/Turtle files into PlantUML class diagrams. - Added two new PlantUML files for custodian multi-aspect diagrams.	2025-11-22 23:01:13 +01:00
kempersc	fa5680f0dd	Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats - Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation. - Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation. - Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument. - Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities. - Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.	2025-11-22 14:33:51 +01:00
kempersc	edb1e07941	updated schemata	2025-11-21 22:12:33 +01:00
kempersc	38354539a6	feat: Add comprehensive harvester for Thüringen archives - Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de. - Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more. - Introduced structured data parsing and error handling for robust data extraction. - Added rate limiting to respect server load and improve scraping efficiency. - Results are saved in a JSON format with detailed metadata about the extraction process.	2025-11-20 00:25:45 +01:00
kempersc	3c80de87e0	add isil entries	2025-11-19 23:25:22 +01:00
kempersc	e5a532a8bc	Add comprehensive tests for NLP institution extraction and RDF partnership integration - Introduced `test_nlp_extractor.py` with unit tests for the InstitutionExtractor, covering various extraction patterns (ISIL, Wikidata, VIAF, city names) and ensuring proper classification of institutions (museum, library, archive). - Added tests for extracted entities and result handling to validate the extraction process. - Created `test_partnership_rdf_integration.py` to validate the end-to-end process of extracting partnerships from a conversation and exporting them to RDF format. - Implemented tests for temporal properties in partnerships and ensured compliance with W3C Organization Ontology patterns. - Verified that extracted partnerships are correctly linked with PROV-O provenance metadata.	2025-11-19 23:20:47 +01:00

1 2 3

142 commits