Commit graph

59 commits

Author SHA1 Message Date
kempersc
41959f0766 correct HCID! 2025-12-10 13:01:13 +01:00
kempersc
c4b0f17a43 geocode: complete 100% coverage - add coordinates to final 26 files (CZ, BE, AR, LB, ML) 2025-12-10 01:07:34 +01:00
kempersc
6e2c36413e geocode: add coordinates to 540 Japanese custodian files using postal codes
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)

This brings overall geocoding coverage from 97.84% to 99.81%
2025-12-10 00:27:33 +01:00
kempersc
dee7a4c7d9 geocode: add coordinates to 147 Swiss custodian files
- Improved city name normalization to handle:
  - St. Gallen / St.Gallen -> Sankt Gallen
  - Canton suffixes (Buchs SG, Brugg AG)
  - Hyphenated districts (Bernex - Genève)
  - Postal codes with slashes (Ecublens/VD)
  - German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
2025-12-09 22:38:33 +01:00
kempersc
cc61d99acf geocode: add coordinates to BG and EG custodian files
- BG: Add lat/lon from existing GeoNames IDs (28 files)
- EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files)
- Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml
- Overall coverage: 96.4% → 96.6%
2025-12-09 21:59:58 +01:00
kempersc
2137c522db geocode: add coordinates to JP compound cities and CZ files from GeoNames
- JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files)
- CZ: Map city codes to GeoNames entries (667 files)
- Overall coverage: 84.5% → 96.4%
2025-12-09 21:49:40 +01:00
kempersc
3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00
kempersc
fade1ed5b3 fix: add safety measures to prevent data loss during enrichment
Key changes:
- Created scripts/lib/safe_yaml_update.py with PROTECTED_KEYS constant
- Fixed enrich_custodians_wikidata_full.py to re-read files before writing
  (prevents race conditions where another script modified the file)
- Added safety check to abort if protected keys would be lost
- Protected keys include: location, original_entry, ghcid, provenance,
  google_maps_enrichment, osm_enrichment, etc.

Root cause of data loss in 62fdd35321:
- Script loaded files into list, then processed them later
- If another script modified files between load and write, changes were lost
- Now files are re-read immediately before modification

Per AGENTS.md Rule 5: NEVER Delete Enriched Data - Additive Only
2025-12-09 12:27:09 +01:00
kempersc
a7321b1bb9 reconstruct location blocks 2025-12-09 12:25:16 +01:00
kempersc
cab712659d recover location blocks 2025-12-09 11:34:56 +01:00
kempersc
62fdd35321 Refactor code structure for improved readability and maintainability 2025-12-09 11:15:51 +01:00
kempersc
b61271220b enrich entries 2025-12-09 10:46:43 +01:00
kempersc
bf7c773955 edit Japanese entries 2025-12-09 09:16:19 +01:00
kempersc
c283daa1a2 normalise dutch entries 2025-12-09 08:02:27 +01:00
kempersc
131e3ca259 normalise custodian entries 2025-12-09 07:56:35 +01:00
kempersc
0938cce6cf feat(loaders): update DuckLake and TypeDB loaders with relation support 2025-12-08 15:00:14 +01:00
kempersc
486bbee813 feat(wikidata): add re-enrichment and duplicate removal scripts
- Add reenrich_wikidata_with_verification.py for re-running enrichment
- Add remove_wikidata_duplicates.py for deduplication
2025-12-08 14:59:38 +01:00
kempersc
891692a4d6 feat(ghcid): add diacritics normalization and transliteration scripts
- Add fix_ghcid_diacritics.py for normalizing non-ASCII in GHCIDs
- Add resolve_diacritics_collisions.py for collision handling
- Add transliterate_emic_names.py for non-Latin script handling
- Add transliteration tests
2025-12-08 14:59:28 +01:00
kempersc
6a6557bbe8 feat(enrichment): add emic name enrichment and update CustodianName schema
- Add emic_name, name_language, standardized_name to CustodianName
- Add scripts for enriching custodian emic names from Wikidata
- Add YouTube and Google Maps enrichment scripts
- Update DuckLake loader for new schema fields
2025-12-08 14:58:50 +01:00
kempersc
7e3559f7e5 add new entries 2025-12-07 23:08:02 +01:00
kempersc
18874e6070 fix(scripts): normalize org_type codes in DuckLake loader
- Handle single-letter GLAM type codes (G, L, A, M, O, R, C, etc.)
- Handle legacy GRP.HER.* format
- Support compound types like 'M,F' -> 'MUSEUM,FEATURES'
- Fix type hint syntax for Python 3.10+
2025-12-07 19:21:14 +01:00
kempersc
400b1c04c1 fix(scripts): force table recreation in web archives migration
Drop existing tables before creating to ensure schema updates are applied
properly instead of using IF NOT EXISTS which would skip schema changes.
2025-12-07 18:47:46 +01:00
kempersc
90a1f20271 chore: add YAML history fix scripts and update ducklake/deploy tooling
- Add fix_yaml_history.py and fix_yaml_history_v2.py for cleaning up
  malformed ghcid_history entries with duplicate/redundant data
- Update load_custodians_to_ducklake.py for DuckDB lakehouse loading
- Update migrate_web_archives.py for web archive management
- Update deploy.sh with improvements
- Ignore entire data/ducklake/ directory (generated databases)
2025-12-07 18:45:52 +01:00
kempersc
7f85238f67 fix(scripts): update CBS GeoJSON field names for municipality loading
Support additional field name patterns:
- 'code'/'naam' (current CBS format)
- 'provincieCode'/'provincieNaam' for province data
2025-12-07 18:40:13 +01:00
kempersc
d9325c0bb5 feat: add web archives integration and improve enrichment scripts
Backend:
- Attach web_archives.duckdb as read-only database in DuckLake
- Create views for web_archives, web_pages, web_claims in heritage schema

Scripts:
- enrich_cities_google.py: Add batch processing and retry logic
- migrate_web_archives.py: Improve schema handling and error recovery

Frontend:
- DuckLakePanel: Add web archives query support
- Database.css: Improve layout for query results display
2025-12-07 17:49:07 +01:00
kempersc
83ab098cf7 feat: add PostGIS international boundary architecture
Add schema and tooling for storing administrative boundaries in PostGIS:
- 002_postgis_boundaries.sql: Complete PostGIS schema with:
  - boundary_countries (ISO 3166-1)
  - boundary_admin1 (states/provinces/regions)
  - boundary_admin2 (municipalities/districts)
  - boundary_historical (HALC pre-modern territories)
  - custodian_service_areas (computed werkgebied geometries)
  - geonames_settlements (reverse geocoding)
  - Spatial functions: find_admin_for_point, find_nearest_settlement
  - Views for API access

- load_boundaries_postgis.py: Python loader supporting:
  - GADM (Global Administrative Areas) - primary global source
  - CBS (Dutch municipality boundaries)
  - GeoNames settlements for reverse geocoding
  - Cached downloads and upsert logic

- POSTGIS_BOUNDARY_ARCHITECTURE.md: Design documentation

This replaces the static GeoJSON approach for international coverage.
2025-12-07 14:34:39 +01:00
kempersc
e45c1a3c85 feat(scripts): add city enrichment and location resolution utilities
Enrichment scripts for country-specific city data:
- enrich_austrian_cities.py, enrich_belgian_cities.py, enrich_belgian_v2.py
- enrich_bulgarian_cities.py, enrich_czech_cities.py, enrich_czech_cities_fast.py
- enrich_japanese_cities.py, enrich_swiss_isil_cities.py, enrich_cities_google.py

Location resolution utilities:
- resolve_cities_from_file_coords.py - Resolve cities using coordinates in filenames
- resolve_cities_wikidata.py - Use Wikidata P131 for city resolution
- resolve_country_codes.py - Standardize country codes
- resolve_cz_xx_regions.py - Fix Czech XX region codes
- resolve_locations_by_name.py - Name-based location lookup
- resolve_regions_from_city.py - Derive regions from city data
- update_ghcid_with_geonames.py - Update GHCIDs with GeoNames data

CH-Annotator integration:
- create_custodian_from_ch_annotator.py - Create custodians from annotations
- add_ch_annotator_location_claims.py - Add location claims
- extract_locations_ch_annotator.py - Extract locations from annotations

Migration and fixes:
- migrate_egyptian_from_ch.py - Migrate Egyptian data
- migrate_web_archives.py - Migrate web archive data
- fix_belgian_cities.py - Fix Belgian city data
2025-12-07 14:26:59 +01:00
kempersc
63a6bccd9b fix: remove custodian files with invalid GHCID special characters
Remove 229 custodian YAML files containing invalid characters in GHCIDs:
- Ampersand (&) in abbreviations (e.g., BM&HS, UNL&AG, DR&IMSM)
- Parentheses in abbreviations (e.g., WHO(RA, VK(, SL()
- Unicode characters in filenames (Ö, Ä, Å, É, İ, Ż, etc.)

These files are replaced with corrected versions using alphabetic-only
abbreviations per AGENTS.md Rule 8 (Special Characters MUST Be Excluded).

Related scripts updated for location resolution.
2025-12-07 14:23:50 +01:00
kempersc
ee4e57bc75 add new entries 2025-12-07 00:26:01 +01:00
kempersc
1635625032 added web annotations 2025-12-06 19:50:04 +01:00
kempersc
55e2cd2340 feat: implement LLM-based extraction for Archives Lab content
- Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0.
- Replaced regex-based extraction with generative LLM inference.
- Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics.
- Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons.
- Results and statistics are saved in JSON format for further analysis.
2025-12-05 23:16:21 +01:00
kempersc
4da64eeebf improve annotator 2025-12-05 16:25:39 +01:00
kempersc
e38fb4613b improve annotation prompt 2025-12-05 15:51:39 +01:00
kempersc
3a242370fc annotation standards added 2025-12-05 15:30:23 +01:00
kempersc
d661947830 update enriched entries 2025-12-03 17:38:46 +01:00
kempersc
ef89b1213a validate enrichments 2025-12-02 14:36:01 +01:00
kempersc
4b833d20b2 add pids 2025-12-01 23:55:55 +01:00
kempersc
7dce283c17 Add new enums for PersonalCollectionType, ResearchCenterType, and TasteScentHeritage classifications; implement validation script for custodian names against authoritative sources 2025-12-01 18:39:22 +01:00
kempersc
48a2b26f59 feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
2025-12-01 16:58:03 +01:00
kempersc
097d116b72 enrich entries 2025-12-01 16:06:34 +01:00
kempersc
2497e5913f enrich entries 2025-12-01 00:37:24 +01:00
kempersc
f3c149b1bb update entries 2025-11-30 23:30:29 +01:00
kempersc
0ab8f24a6b archive websites 2025-11-29 18:05:16 +01:00
kempersc
da1eae6597 Refactor code structure for improved readability and maintainability 2025-11-29 12:27:39 +01:00
kempersc
30162e6526 Add script to validate KB library entries and generate enrichment report
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
2025-11-28 14:48:33 +01:00
kempersc
5cdce584b2 Add complete schema for heritage custodian observation reconstruction
- Introduced a comprehensive class diagram for the heritage custodian observation reconstruction schema.
- Defined multiple classes including AllocationAgency, ArchiveOrganizationType, AuxiliaryDigitalPlatform, and others, with relevant attributes and relationships.
- Established inheritance and associations among classes to represent complex relationships within the schema.
- Generated on 2025-11-28, version 0.9.0, excluding the Container class.
2025-11-28 13:13:23 +01:00
kempersc
0d1741c55e Refactor code structure for improved readability and maintainability 2025-11-28 11:44:21 +01:00
kempersc
37886f0433 Refactor code structure for improved readability and maintainability 2025-11-27 17:43:14 +01:00
kempersc
5ef8ccac51 Add script to enrich NDE Register NL entries with Wikidata data
- Implemented a Python script that fetches and enriches entries from the NDE Register using data from Wikidata.
- Utilized the Wikibase REST API and SPARQL endpoints for data retrieval.
- Added logging for tracking progress and errors during the enrichment process.
- Configured rate limiting based on authentication status for API requests.
- Created a structured output in YAML format, including detailed enrichment data.
- Generated a log file summarizing the enrichment process and results.
2025-11-27 13:30:00 +01:00
kempersc
a5a66eb547 add classes 2025-11-25 12:48:07 +01:00