- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.
- CZ: 2,170 processed (26% of 8,432)
- JP: 2,075 processed (17% of 12,096)
- AR: Started processing
- Total checkpoint: 9,762 files across all countries
- Using crawl4ai favicon extraction
- Added logo_enrichment to 771 Czech custodian files
- 87% logo hit rate using crawl4ai favicon extraction
- Total checkpoint: 9,257 files across all countries
- CZ remaining: 6,642 files
- Add digital platform discovery data with provenance
- Cleanup duplicate/incorrect custodian entries
- Add GHCID collision resolution suffixes where needed
- Update person entity profiles with career history
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)
This brings overall geocoding coverage from 97.84% to 99.81%
- Improved city name cleaning:
- Roman numeral district suffixes (Kolín V. -> Kolín)
- City + country suffixes (Genève 4 - Suisse -> Genève)
- Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
- Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
- Handle Czech address patterns:
- House numbers with čp./č.p. prefix
- X nad/pod Y town names (rivers/landmarks)
- Hyphenated district names (Město-Část)
- Trailing numbers and suffixes
- Improved city name normalization to handle:
- St. Gallen / St.Gallen -> Sankt Gallen
- Canton suffixes (Buchs SG, Brugg AG)
- Hyphenated districts (Bernex - Genève)
- Postal codes with slashes (Ecublens/VD)
- German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.
docs: Create README for value standardization rules
- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.
feat: Implement transliteration standards for non-Latin scripts
- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.
feat: Define XPath provenance rules for web observations
- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.
chore: Update records lifecycle diagram
- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)
Recovery from data loss in commit 62fdd35321 is now complete.
- Add emic_name, name_language, and standardized_name to 1,781 custodian files
- Remove 2,239 duplicate files that had name suffixes in filename
- Consolidate data into base GHCID files per PID stability rules
- Part of UNESCO Memory of the World custodian enrichment
Remove redundant ch_annotator metadata and duplicate ghcid_history entries
that were causing YAML parsing issues. Files now have cleaner, more
consistent structure while preserving all essential data.