- Add digital platform discovery data with provenance
- Cleanup duplicate/incorrect custodian entries
- Add GHCID collision resolution suffixes where needed
- Update person entity profiles with career history
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)
This brings overall geocoding coverage from 97.84% to 99.81%
- Improved city name cleaning:
- Roman numeral district suffixes (Kolín V. -> Kolín)
- City + country suffixes (Genève 4 - Suisse -> Genève)
- Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
- Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
- Handle Czech address patterns:
- House numbers with čp./č.p. prefix
- X nad/pod Y town names (rivers/landmarks)
- Hyphenated district names (Město-Část)
- Trailing numbers and suffixes
- Improved city name normalization to handle:
- St. Gallen / St.Gallen -> Sankt Gallen
- Canton suffixes (Buchs SG, Brugg AG)
- Hyphenated districts (Bernex - Genève)
- Postal codes with slashes (Ecublens/VD)
- German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.
docs: Create README for value standardization rules
- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.
feat: Implement transliteration standards for non-Latin scripts
- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.
feat: Define XPath provenance rules for web observations
- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.
chore: Update records lifecycle diagram
- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)
Recovery from data loss in commit 62fdd35321 is now complete.
- Add emic_name, name_language, and standardized_name to 1,781 custodian files
- Remove 2,239 duplicate files that had name suffixes in filename
- Consolidate data into base GHCID files per PID stability rules
- Part of UNESCO Memory of the World custodian enrichment
Remove redundant ch_annotator metadata and duplicate ghcid_history entries
that were causing YAML parsing issues. Files now have cleaner, more
consistent structure while preserving all essential data.
Rename 144 custodian files from XXX placeholders to resolved city codes:
- BR (65): ASA, RBR, MAN, FAZ, MAC, VIT, FOR, GUA, BRE, GOI, SLU, BHO, XAN, CGR, CUI, etc.
- CH (24): ZUR, BER, GEN, BAL, LUC, etc.
- MX (23): MEX, GDL, MTY, PUE, etc.
- CL (9): SCL, VAL, etc.
- CZ (5): PRG, BRN, MSV, etc.
- KR (4): SEL, etc.
- GB (4): LON, etc.
- FR (3): PAR, etc.
- IN (2): DEL, etc.
- PH, JP, EE (1 each)
City codes derived from GeoNames reverse geocoding using institution coordinates.
GHCID format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}