kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	3a6ead8fde	feat: Add legal form filtering rule for CustodianName - Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations. - Documented rationale, examples, and implementation guidelines for the filtering process. docs: Create README for value standardization rules - Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes. - Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution. feat: Implement transliteration standards for non-Latin scripts - Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration. - Included detailed guidelines for various scripts and languages, along with implementation examples. feat: Define XPath provenance rules for web observations - Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources. - Established a workflow for archiving websites and verifying claims against archived HTML. chore: Update records lifecycle diagram - Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians. - Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.	2025-12-09 16:58:41 +01:00
kempersc	7b42d720d5	geocode: add coordinates to CZ, BY, CH, FR, ES custodian files from GeoNames (1145 files)	2025-12-09 16:41:41 +01:00
kempersc	b54904ad0a	fix: normalize YAML null formatting in Eye Filmmuseum file	2025-12-09 16:34:12 +01:00
kempersc	2c25ed6a96	geocode: add coordinates to JP custodian files from GeoNames (batch 2 - remaining 3639 files)	2025-12-09 16:33:29 +01:00
kempersc	9bc454cdbf	geocode: add coordinates to JP custodian files from GeoNames (batch 1 - 3000 files)	2025-12-09 16:32:01 +01:00
kempersc	982620ba0c	normalize: add canonical location blocks (batch 7 - final) - Final 42 files updated - Normalization complete: all 27,511 custodian files have location block - 15,419 files have coordinates with coordinate_provenance - 12,092 files have address-only location blocks	2025-12-09 14:57:33 +01:00
kempersc	e28576ee65	normalize: add canonical location blocks (batch 6) - 2,546 files updated with location blocks - All 27,511 custodian files now have location: block - 15,421 files have coordinates with coordinate_provenance - 12,090 files have address-only location blocks	2025-12-09 14:44:03 +01:00
kempersc	d20978dcbe	normalize: add canonical location blocks (batch 5)	2025-12-09 14:39:02 +01:00
kempersc	3f60aa6238	normalize: add canonical location blocks (batch 4 - final)	2025-12-09 14:18:15 +01:00
kempersc	5b3d4d1ed5	normalize: add canonical location blocks (batch 3)	2025-12-09 14:14:13 +01:00
kempersc	b739ad4e61	normalize: add canonical location blocks (batch 2)	2025-12-09 13:28:59 +01:00
kempersc	bb41287730	normalize: add canonical location blocks (batch 1)	2025-12-09 13:17:11 +01:00
kempersc	a7321b1bb9	reconstruct location blocks	2025-12-09 12:25:16 +01:00
kempersc	85a951bbea	normalize: add canonical location blocks to 586 files - Fixed 469 JP files missing location: blocks (had data in original_entry.locations) - Fixed 117 additional JP files found in second pass - 1 EG file skipped (no location source data available) - Total files with location: blocks now 27,459 out of 27,511 (99.8%) - Also includes YAML formatting standardization (line wrapping) Recovery from data loss in commit `62fdd35321` is now complete.	2025-12-09 12:17:34 +01:00
kempersc	cab712659d	recover location blocks	2025-12-09 11:34:56 +01:00
kempersc	62fdd35321	Refactor code structure for improved readability and maintainability	2025-12-09 11:15:51 +01:00
kempersc	b61271220b	enrich entries	2025-12-09 10:46:43 +01:00
kempersc	bf7c773955	edit Japanese entries	2025-12-09 09:16:19 +01:00
kempersc	c283daa1a2	normalise dutch entries	2025-12-09 08:02:27 +01:00
kempersc	609866886a	enrich entries	2025-12-09 07:58:14 +01:00
kempersc	131e3ca259	normalise custodian entries	2025-12-09 07:56:35 +01:00
kempersc	40bd3cb8f5	data(custodian): add emic_name fields and remove duplicate files with name suffixes - Add emic_name, name_language, and standardized_name to 1,781 custodian files - Remove 2,239 duplicate files that had name suffixes in filename - Consolidate data into base GHCID files per PID stability rules - Part of UNESCO Memory of the World custodian enrichment	2025-12-08 14:57:34 +01:00
kempersc	7e3559f7e5	add new entries	2025-12-07 23:08:02 +01:00
kempersc	0c4c378e06	fix(data): clean up YAML structure in BE/EG custodian files (450 files) Remove redundant ch_annotator metadata and duplicate ghcid_history entries that were causing YAML parsing issues. Files now have cleaner, more consistent structure while preserving all essential data.	2025-12-07 18:46:42 +01:00
kempersc	d9325c0bb5	feat: add web archives integration and improve enrichment scripts Backend: - Attach web_archives.duckdb as read-only database in DuckLake - Create views for web_archives, web_pages, web_claims in heritage schema Scripts: - enrich_cities_google.py: Add batch processing and retry logic - migrate_web_archives.py: Improve schema handling and error recovery Frontend: - DuckLakePanel: Add web archives query support - Database.css: Improve layout for query results display	2025-12-07 17:49:07 +01:00
kempersc	1e01639c56	fix(data): resolve XXX placeholder city codes to actual GeoNames settlements Rename 144 custodian files from XXX placeholders to resolved city codes: - BR (65): ASA, RBR, MAN, FAZ, MAC, VIT, FOR, GUA, BRE, GOI, SLU, BHO, XAN, CGR, CUI, etc. - CH (24): ZUR, BER, GEN, BAL, LUC, etc. - MX (23): MEX, GDL, MTY, PUE, etc. - CL (9): SCL, VAL, etc. - CZ (5): PRG, BRN, MSV, etc. - KR (4): SEL, etc. - GB (4): LON, etc. - FR (3): PAR, etc. - IN (2): DEL, etc. - PH, JP, EE (1 each) City codes derived from GeoNames reverse geocoding using institution coordinates. GHCID format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}	2025-12-07 17:47:09 +01:00
kempersc	9d15cce65c	docs: add enrichment reports and update manifest Add enrichment reports from city resolution: - Austrian, Belgian, Bulgarian, Czech, Swiss ISIL enrichment reports - GeoNames update reports - Custodian creation reports - Entry-to-GHCID mapping file	2025-12-07 14:27:36 +01:00
kempersc	f284e87d13	feat: add 24,963 heritage custodian records from global extraction Major batch addition of heritage institution data: - Japan: 12,077 institutions (libraries, museums, archives) - Czechia: 6,760 institutions - Switzerland: 2,390 institutions - Belgium: 448 institutions - Belarus: 257 institutions - Austria: 249 institutions (with corrected GHCIDs) - Argentina: 235 institutions (bibliotecas populares) - Brazil: 155 institutions - Mexico: 110 institutions - Bulgaria: 98 institutions - Chile: 83 institutions - Egypt: 50 institutions - And additional records from VN, NL, GE, KR, GB, FR, US, IN, etc. All records include: - Standardized GHCID identifiers (alphabetic-only abbreviations) - GeoNames-resolved location data - ISO 3166-2 region codes - Provenance metadata with extraction timestamps	2025-12-07 14:24:48 +01:00
kempersc	63a6bccd9b	fix: remove custodian files with invalid GHCID special characters Remove 229 custodian YAML files containing invalid characters in GHCIDs: - Ampersand (&) in abbreviations (e.g., BM&HS, UNL&AG, DR&IMSM) - Parentheses in abbreviations (e.g., WHO(RA, VK(, SL() - Unicode characters in filenames (Ö, Ä, Å, É, İ, Ż, etc.) These files are replaced with corrected versions using alphabetic-only abbreviations per AGENTS.md Rule 8 (Special Characters MUST Be Excluded). Related scripts updated for location resolution.	2025-12-07 14:23:50 +01:00
kempersc	ee4e57bc75	add new entries	2025-12-07 00:26:01 +01:00
kempersc	1635625032	added web annotations	2025-12-06 19:50:04 +01:00
kempersc	55e2cd2340	feat: implement LLM-based extraction for Archives Lab content - Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0. - Replaced regex-based extraction with generative LLM inference. - Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics. - Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons. - Results and statistics are saved in JSON format for further analysis.	2025-12-05 23:16:21 +01:00
kempersc	4da64eeebf	improve annotator	2025-12-05 16:25:39 +01:00
kempersc	e38fb4613b	improve annotation prompt	2025-12-05 15:51:39 +01:00
kempersc	3a242370fc	annotation standards added	2025-12-05 15:30:23 +01:00
kempersc	d661947830	update enriched entries	2025-12-03 17:38:46 +01:00
kempersc	ef89b1213a	validate enrichments	2025-12-02 14:36:01 +01:00
kempersc	8ebca2f845	add pid	2025-12-02 00:00:45 +01:00
kempersc	4b833d20b2	add pids	2025-12-01 23:55:55 +01:00
kempersc	7dce283c17	Add new enums for PersonalCollectionType, ResearchCenterType, and TasteScentHeritage classifications; implement validation script for custodian names against authoritative sources	2025-12-01 18:39:22 +01:00
kempersc	48a2b26f59	feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas - Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data. - Loaded instance data from YAML files and enriched enum definitions with meaningful annotations. - Configured output paths for generated diagrams in both frontend and schema directories. - Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.	2025-12-01 16:58:03 +01:00
kempersc	097d116b72	enrich entries	2025-12-01 16:06:34 +01:00
kempersc	2497e5913f	enrich entries	2025-12-01 00:37:24 +01:00
kempersc	f3c149b1bb	update entries	2025-11-30 23:30:29 +01:00
kempersc	d623f0af4a	store archived websites	2025-11-29 20:40:46 +01:00
kempersc	572ccd5daf	archive websites	2025-11-29 18:18:04 +01:00
kempersc	0ab8f24a6b	archive websites	2025-11-29 18:05:16 +01:00
kempersc	da1eae6597	Refactor code structure for improved readability and maintainability	2025-11-29 12:27:39 +01:00
kempersc	30162e6526	Add script to validate KB library entries and generate enrichment report - Implemented a Python script to validate KB library YAML files for required fields and data quality. - Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics. - Created a comprehensive markdown report summarizing validation results and enrichment quality. - Included error handling for file loading and validation processes. - Generated JSON statistics for further analysis.	2025-11-28 14:48:33 +01:00
kempersc	5cdce584b2	Add complete schema for heritage custodian observation reconstruction - Introduced a comprehensive class diagram for the heritage custodian observation reconstruction schema. - Defined multiple classes including AllocationAgency, ArchiveOrganizationType, AuxiliaryDigitalPlatform, and others, with relevant attributes and relationships. - Established inheritance and associations among classes to represent complex relationships within the schema. - Generated on 2025-11-28, version 0.9.0, excluding the Container class.	2025-11-28 13:13:23 +01:00

1 2

66 commits