- CZ: 2,170 processed (26% of 8,432)
- JP: 2,075 processed (17% of 12,096)
- AR: Started processing
- Total checkpoint: 9,762 files across all countries
- Using crawl4ai favicon extraction
- Added logo_enrichment to 771 Czech custodian files
- 87% logo hit rate using crawl4ai favicon extraction
- Total checkpoint: 9,257 files across all countries
- CZ remaining: 6,642 files
- Add digital platform discovery data with provenance
- Cleanup duplicate/incorrect custodian entries
- Add GHCID collision resolution suffixes where needed
- Update person entity profiles with career history
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)
This brings overall geocoding coverage from 97.84% to 99.81%
- Improved city name cleaning:
- Roman numeral district suffixes (Kolín V. -> Kolín)
- City + country suffixes (Genève 4 - Suisse -> Genève)
- Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
- Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
- Handle Czech address patterns:
- House numbers with čp./č.p. prefix
- X nad/pod Y town names (rivers/landmarks)
- Hyphenated district names (Město-Část)
- Trailing numbers and suffixes
- Improved city name normalization to handle:
- St. Gallen / St.Gallen -> Sankt Gallen
- Canton suffixes (Buchs SG, Brugg AG)
- Hyphenated districts (Bernex - Genève)
- Postal codes with slashes (Ecublens/VD)
- German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding