Commit graph

102 commits

Author SHA1 Message Date
kempersc
cc61d99acf geocode: add coordinates to BG and EG custodian files
- BG: Add lat/lon from existing GeoNames IDs (28 files)
- EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files)
- Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml
- Overall coverage: 96.4% → 96.6%
2025-12-09 21:59:58 +01:00
kempersc
2137c522db geocode: add coordinates to JP compound cities and CZ files from GeoNames
- JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files)
- CZ: Map city codes to GeoNames entries (667 files)
- Overall coverage: 84.5% → 96.4%
2025-12-09 21:49:40 +01:00
kempersc
92b5e58ef3 geocode: add coordinates to AT, BE, DE, GB, PL, UA, US custodian files from GeoNames 2025-12-09 20:38:34 +01:00
kempersc
3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00
kempersc
7b42d720d5 geocode: add coordinates to CZ, BY, CH, FR, ES custodian files from GeoNames (1145 files) 2025-12-09 16:41:41 +01:00
kempersc
b54904ad0a fix: normalize YAML null formatting in Eye Filmmuseum file 2025-12-09 16:34:12 +01:00
kempersc
2c25ed6a96 geocode: add coordinates to JP custodian files from GeoNames (batch 2 - remaining 3639 files) 2025-12-09 16:33:29 +01:00
kempersc
9bc454cdbf geocode: add coordinates to JP custodian files from GeoNames (batch 1 - 3000 files) 2025-12-09 16:32:01 +01:00
kempersc
982620ba0c normalize: add canonical location blocks (batch 7 - final)
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
2025-12-09 14:57:33 +01:00
kempersc
e28576ee65 normalize: add canonical location blocks (batch 6)
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
2025-12-09 14:44:03 +01:00
kempersc
d20978dcbe normalize: add canonical location blocks (batch 5) 2025-12-09 14:39:02 +01:00
kempersc
3f60aa6238 normalize: add canonical location blocks (batch 4 - final) 2025-12-09 14:18:15 +01:00
kempersc
5b3d4d1ed5 normalize: add canonical location blocks (batch 3) 2025-12-09 14:14:13 +01:00
kempersc
b739ad4e61 normalize: add canonical location blocks (batch 2) 2025-12-09 13:28:59 +01:00
kempersc
bb41287730 normalize: add canonical location blocks (batch 1) 2025-12-09 13:17:11 +01:00
kempersc
fade1ed5b3 fix: add safety measures to prevent data loss during enrichment
Key changes:
- Created scripts/lib/safe_yaml_update.py with PROTECTED_KEYS constant
- Fixed enrich_custodians_wikidata_full.py to re-read files before writing
  (prevents race conditions where another script modified the file)
- Added safety check to abort if protected keys would be lost
- Protected keys include: location, original_entry, ghcid, provenance,
  google_maps_enrichment, osm_enrichment, etc.

Root cause of data loss in 62fdd35321:
- Script loaded files into list, then processed them later
- If another script modified files between load and write, changes were lost
- Now files are re-read immediately before modification

Per AGENTS.md Rule 5: NEVER Delete Enriched Data - Additive Only
2025-12-09 12:27:09 +01:00
kempersc
a7321b1bb9 reconstruct location blocks 2025-12-09 12:25:16 +01:00
kempersc
85a951bbea normalize: add canonical location blocks to 586 files
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)

Recovery from data loss in commit 62fdd35321 is now complete.
2025-12-09 12:17:34 +01:00
kempersc
cab712659d recover location blocks 2025-12-09 11:34:56 +01:00
kempersc
62fdd35321 Refactor code structure for improved readability and maintainability 2025-12-09 11:15:51 +01:00
kempersc
b61271220b enrich entries 2025-12-09 10:46:43 +01:00
kempersc
bf7c773955 edit Japanese entries 2025-12-09 09:16:19 +01:00
kempersc
c283daa1a2 normalise dutch entries 2025-12-09 08:02:27 +01:00
kempersc
609866886a enrich entries 2025-12-09 07:58:14 +01:00
kempersc
131e3ca259 normalise custodian entries 2025-12-09 07:56:35 +01:00
kempersc
0938cce6cf feat(loaders): update DuckLake and TypeDB loaders with relation support 2025-12-08 15:00:14 +01:00
kempersc
486bbee813 feat(wikidata): add re-enrichment and duplicate removal scripts
- Add reenrich_wikidata_with_verification.py for re-running enrichment
- Add remove_wikidata_duplicates.py for deduplication
2025-12-08 14:59:38 +01:00
kempersc
891692a4d6 feat(ghcid): add diacritics normalization and transliteration scripts
- Add fix_ghcid_diacritics.py for normalizing non-ASCII in GHCIDs
- Add resolve_diacritics_collisions.py for collision handling
- Add transliterate_emic_names.py for non-Latin script handling
- Add transliteration tests
2025-12-08 14:59:28 +01:00
kempersc
6a6557bbe8 feat(enrichment): add emic name enrichment and update CustodianName schema
- Add emic_name, name_language, standardized_name to CustodianName
- Add scripts for enriching custodian emic names from Wikidata
- Add YouTube and Google Maps enrichment scripts
- Update DuckLake loader for new schema fields
2025-12-08 14:58:50 +01:00
kempersc
35066eb5eb fix(typedb): improve concept serialization for frontend compatibility
- Add 'type' alias alongside '_type' for frontend compatibility
- Handle both bytes and string IID formats from driver
- Add 'id' alias alongside '_iid' for consistency
2025-12-08 14:58:35 +01:00
kempersc
271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00
kempersc
40bd3cb8f5 data(custodian): add emic_name fields and remove duplicate files with name suffixes
- Add emic_name, name_language, and standardized_name to 1,781 custodian files
- Remove 2,239 duplicate files that had name suffixes in filename
- Consolidate data into base GHCID files per PID stability rules
- Part of UNESCO Memory of the World custodian enrichment
2025-12-08 14:57:34 +01:00
kempersc
13f67bed19 feat(frontend): add graph visualization and data explorer features
Database Panels:
- Add D3.js force-directed graph visualization to Oxigraph and TypeDB panels
- Add 'Explore' tab with class/entity browser, graph/table toggle, and search
- Add data explorer to PostgreSQL panel with table browser, pagination, search, export
- Fix SPARQL variable naming bug in Oxigraph getGraphData() function
- Add node details panel showing selected entity attributes
- Add zoom/pan controls and node coloring by entity type

Map Features:
- Add TimelineSlider component for temporal filtering of institutions
- Support dual-handle range slider with decade histogram
- Add quick presets (Ancient, Medieval, Modern, Contemporary)
- Show institution density visualization by founding decade

Hooks:
- Extend useOxigraph with getGraphData() for graph visualization
- Extend useTypeDB with getGraphData() for graph visualization
- Extend usePostgreSQL with getTableData() and exportTableData()
- Improve useDuckLakeInstitutions with temporal filtering support

Styles:
- Add HeritageDashboard.css with shared panel styling
- Add TimelineSlider.css for timeline component styling
2025-12-08 14:56:17 +01:00
kempersc
7e3559f7e5 add new entries 2025-12-07 23:08:02 +01:00
kempersc
57c743b005 refactor(frontend): fetch NL municipalities from PostGIS API instead of static file
Replace static netherlands_municipalities_simplified.geojson with dynamic
PostGIS API call to /boundaries/countries/NL/admin2/geojson.

Transform API response properties to expected format:
- API: {code, name, name_local, admin1_code, admin1_name}
- Expected: {code, naam, provincieCode, provincieNaam}

This ensures NL boundary data comes from the authoritative PostGIS
database rather than a static file that could become outdated.
2025-12-07 19:48:07 +01:00
kempersc
12965071be fix(frontend): improve DuckLake connection detection in map page
Wait for DuckLake loading to complete before deciding whether to use
DuckLake data or fallback to static JSON. Prevents race conditions.
2025-12-07 19:23:12 +01:00
kempersc
1981dc28ed fix(frontend): normalize org_type names to letter codes in DuckLake hook
DuckLake stores full names like 'MUSEUM' but map expects single-letter
codes like 'M' for color styling. Also includes CSS fixes for Database page.
2025-12-07 19:21:58 +01:00
kempersc
18874e6070 fix(scripts): normalize org_type codes in DuckLake loader
- Handle single-letter GLAM type codes (G, L, A, M, O, R, C, etc.)
- Handle legacy GRP.HER.* format
- Support compound types like 'M,F' -> 'MUSEUM,FEATURES'
- Fix type hint syntax for Python 3.10+
2025-12-07 19:21:14 +01:00
kempersc
6d66e67bf4 fix(deploy): route Oxigraph requests through SSH tunnel
Oxigraph only listens on localhost, so deploy script now executes
curl commands via SSH instead of trying to reach it directly.
2025-12-07 19:20:56 +01:00
kempersc
810022d524 feat(frontend): add search filter and claim page filter to DuckLakePanel
- Add search bar to filter table data across all columns
- Filter web archive claims by selected page
- Include source_page in claim queries for filtering
- Fix TypeScript unused parameter warning
2025-12-07 19:20:40 +01:00
kempersc
f82dd57903 feat(frontend): add useBoundariesAPI hook for PostGIS boundary fetching
New React hook that fetches administrative boundaries from the PostGIS API:
- Supports international boundaries (NL, JP, CZ, DE, BE, CH, AT, etc.)
- Caches admin1, admin2, and GeoJSON data
- Provides point-in-polygon lookup
- Includes utility functions for filtering boundaries by code/name
- Replaces static GeoJSON file loading pattern
2025-12-07 19:20:21 +01:00
kempersc
49141c005b feat(api): add PostGIS boundary REST endpoints
Add new endpoints for querying administrative boundaries:
- GET /boundaries/countries - List all countries with boundary data
- GET /boundaries/countries/{code}/admin1 - List provinces/states
- GET /boundaries/countries/{code}/admin2 - List municipalities
- GET /boundaries/lookup?lat=&lon= - Point-in-polygon lookup
- GET /boundaries/admin2/{id}/geojson - Get single boundary GeoJSON
- GET /boundaries/countries/{code}/admin2/geojson - Get all admin2 as GeoJSON
- GET /boundaries/stats - Get boundary statistics

Uses proper joins through country_id/admin1_id foreign keys.
2025-12-07 18:56:11 +01:00
kempersc
400b1c04c1 fix(scripts): force table recreation in web archives migration
Drop existing tables before creating to ensure schema updates are applied
properly instead of using IF NOT EXISTS which would skip schema changes.
2025-12-07 18:47:46 +01:00
kempersc
0c4c378e06 fix(data): clean up YAML structure in BE/EG custodian files (450 files)
Remove redundant ch_annotator metadata and duplicate ghcid_history entries
that were causing YAML parsing issues. Files now have cleaner, more
consistent structure while preserving all essential data.
2025-12-07 18:46:42 +01:00
kempersc
90a1f20271 chore: add YAML history fix scripts and update ducklake/deploy tooling
- Add fix_yaml_history.py and fix_yaml_history_v2.py for cleaning up
  malformed ghcid_history entries with duplicate/redundant data
- Update load_custodians_to_ducklake.py for DuckDB lakehouse loading
- Update migrate_web_archives.py for web archive management
- Update deploy.sh with improvements
- Ignore entire data/ducklake/ directory (generated databases)
2025-12-07 18:45:52 +01:00
kempersc
7f85238f67 fix(scripts): update CBS GeoJSON field names for municipality loading
Support additional field name patterns:
- 'code'/'naam' (current CBS format)
- 'provincieCode'/'provincieNaam' for province data
2025-12-07 18:40:13 +01:00
kempersc
d9325c0bb5 feat: add web archives integration and improve enrichment scripts
Backend:
- Attach web_archives.duckdb as read-only database in DuckLake
- Create views for web_archives, web_pages, web_claims in heritage schema

Scripts:
- enrich_cities_google.py: Add batch processing and retry logic
- migrate_web_archives.py: Improve schema handling and error recovery

Frontend:
- DuckLakePanel: Add web archives query support
- Database.css: Improve layout for query results display
2025-12-07 17:49:07 +01:00
kempersc
1e01639c56 fix(data): resolve XXX placeholder city codes to actual GeoNames settlements
Rename 144 custodian files from XXX placeholders to resolved city codes:
- BR (65): ASA, RBR, MAN, FAZ, MAC, VIT, FOR, GUA, BRE, GOI, SLU, BHO, XAN, CGR, CUI, etc.
- CH (24): ZUR, BER, GEN, BAL, LUC, etc.
- MX (23): MEX, GDL, MTY, PUE, etc.
- CL (9): SCL, VAL, etc.
- CZ (5): PRG, BRN, MSV, etc.
- KR (4): SEL, etc.
- GB (4): LON, etc.
- FR (3): PAR, etc.
- IN (2): DEL, etc.
- PH, JP, EE (1 each)

City codes derived from GeoNames reverse geocoding using institution coordinates.
GHCID format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}
2025-12-07 17:47:09 +01:00
kempersc
83ab098cf7 feat: add PostGIS international boundary architecture
Add schema and tooling for storing administrative boundaries in PostGIS:
- 002_postgis_boundaries.sql: Complete PostGIS schema with:
  - boundary_countries (ISO 3166-1)
  - boundary_admin1 (states/provinces/regions)
  - boundary_admin2 (municipalities/districts)
  - boundary_historical (HALC pre-modern territories)
  - custodian_service_areas (computed werkgebied geometries)
  - geonames_settlements (reverse geocoding)
  - Spatial functions: find_admin_for_point, find_nearest_settlement
  - Views for API access

- load_boundaries_postgis.py: Python loader supporting:
  - GADM (Global Administrative Areas) - primary global source
  - CBS (Dutch municipality boundaries)
  - GeoNames settlements for reverse geocoding
  - Cached downloads and upsert logic

- POSTGIS_BOUNDARY_ARCHITECTURE.md: Design documentation

This replaces the static GeoJSON approach for international coverage.
2025-12-07 14:34:39 +01:00
kempersc
0b06af0fb6 chore: mark unused function and ignore ducklake databases 2025-12-07 14:28:12 +01:00