9 KiB
Geocoding Session Summary - 2025-11-07
Status: ✅ GEOCODING IN PROGRESS
Started: 2025-11-07 13:23 PM Expected Completion: ~14:50 PM (90 minutes runtime)
What Was Done
1. Created Global Geocoding Script ✅
File: scripts/geocode_global_institutions.py
Features:
- Persistent SQLite cache (
data/cache/geocoding_cache.db) - Rate limiting (1 request/second for Nominatim API)
- Country-specific query optimization (JP, NL, MX, BR, CL, etc.)
- Progress tracking with ETA calculation
- Resume capability (cached results persist across runs)
- Comprehensive error handling
- Command-line options:
--dry-run: Test without API calls--limit N: Process only first N institutions--country CODE: Filter by country--force: Re-geocode existing coordinates--verbose: Detailed progress logging
Usage:
# Full run (current)
python scripts/geocode_global_institutions.py
# Test run
python scripts/geocode_global_institutions.py --dry-run --limit 10 --verbose
# Geocode specific country
python scripts/geocode_global_institutions.py --country NL --verbose
2. Created Progress Monitor Script ✅
File: scripts/check_geocoding_progress.sh
Usage:
# Check current progress
./scripts/check_geocoding_progress.sh
# Watch live progress
tail -f data/logs/geocoding_full_run.log
3. Started Full Geocoding Run ✅
Command:
nohup python scripts/geocode_global_institutions.py > data/logs/geocoding_full_run.log 2>&1 &
Process ID: 62188
Log File: data/logs/geocoding_full_run.log
Current Progress
Dataset Size: 13,396 institutions Need Geocoding: 13,209 institutions (98.6%) Already Geocoded: 187 institutions (1.4% - Latin American data)
Progress (as of 13:25 PM):
- Institutions processed: 60 / 13,396 (0.4%)
- Processing rate: ~2.4 institutions/second
- Estimated completion: ~90 minutes
- Cache hits improving performance
By Country (institutions needing geocoding):
| Country | Total | Need Geocoding | Priority |
|---|---|---|---|
| JP | 12,065 | 12,065 (100%) | High |
| NL | 1,017 | 1,017 (100%) | High |
| MX | 109 | 50 (45.9%) | Medium |
| BR | 97 | 47 (48.5%) | Medium |
| CL | 90 | 12 (13.3%) | Low |
| Others | 18 | 18 (100%) | Low |
How Geocoding Works
Query Optimization by Country
Japan (JP):
- Cleans city names: removes " SHI", " KU", " CHO", " MURA"
- Query format:
{city_clean}, {prefecture}, Japan - Example:
"SAPPORO KITA, HOKKAIDO, Japan"
Netherlands (NL):
- Prioritizes postal code + city
- Query format:
{postal_code}, {city}, Netherlands - Example:
"1012 JS, Amsterdam, Netherlands"
Mexico (MX):
- Uses city + state
- Query format:
{city}, {state}, Mexico - Example:
"Ciudad de México, CDMX, Mexico"
Brazil (BR):
- Uses city + state abbreviation
- Query format:
{city}, {state}, Brazil - Example:
"Rio de Janeiro, RJ, Brazil"
Chile (CL):
- Uses city + region
- Query format:
{city}, {region}, Chile - Example:
"Santiago, Región Metropolitana, Chile"
Caching Strategy
Cache Storage: SQLite database at data/cache/geocoding_cache.db
Cache Benefits:
- Avoid repeated API calls for same query
- Persist across script runs (resume capability)
- Speed up processing (cache hits = instant results)
- Cache both successes AND failures (avoid retrying bad queries)
Cache Statistics (after 60 institutions):
- Total cached queries: 5
- Successful: 5
- Failed: 0
Expected Results
Output Files
Primary Output:
- File:
data/instances/global/global_heritage_institutions.yaml - Change: Each institution's
locationsarray will gain:latitude: <float>- WGS84 latitudelongitude: <float>- WGS84 longitudegeonames_id: <int>- GeoNames ID (when available)
Example:
locations:
- location_type: Physical Address
street_address: 4-1-2 SHINKOTONI 7-JO, SAPPORO SHI KITA KU, HOKKAIDO
city: SAPPORO SHI KITA KU
postal_code: 001-0907
region: HOKKAIDO
country: JP
is_primary: true
latitude: 43.0609096 # ← NEW
longitude: 141.3460696 # ← NEW
Log File:
- File:
data/logs/geocoding_full_run.log - Contains:
- Progress updates every 10 institutions
- Error messages for failed geocoding
- Final statistics summary
- Execution time and API call rate
Cache Database:
- File:
data/cache/geocoding_cache.db - Contains: ~13,000 geocoding query results
- Reusable: Can be used for future runs or other scripts
Expected Statistics
Estimated Results (after completion):
- Total institutions: 13,396
- Successfully geocoded: ~12,800-13,000 (95-97%)
- Failed geocoding: ~200-400 (2-3%)
- Reasons: Ambiguous locations, unknown places, API errors
- Cache hits: ~500-1,000 (duplicate city names)
- API calls: ~12,500-12,800
- Execution time: ~90 minutes
Coverage by Country (estimated):
- Japan (JP): ~95% success (rural areas may fail)
- Netherlands (NL): ~98% success (postal codes help)
- Mexico (MX): ~92% success
- Brazil (BR): ~93% success
- Chile (CL): ~96% success (already partially geocoded)
Monitoring Commands
Check Progress
# Quick status check
./scripts/check_geocoding_progress.sh
# Check if process is still running
ps aux | grep geocode_global_institutions.py | grep -v grep
# View last 20 lines of log
tail -20 data/logs/geocoding_full_run.log
# Watch live progress (Ctrl+C to exit)
tail -f data/logs/geocoding_full_run.log
Troubleshooting
If process appears stuck:
# Check process status
ps aux | grep geocode_global_institutions.py
# View recent progress
tail -50 data/logs/geocoding_full_run.log
# Check for errors
grep -i "error" data/logs/geocoding_full_run.log
If process dies unexpectedly:
# Restart - it will resume from cache
python scripts/geocode_global_institutions.py
When Geocoding Completes
Validation Steps
-
Check final statistics in log:
tail -50 data/logs/geocoding_full_run.log -
Validate dataset:
python scripts/validate_geocoding_results.py # (to be created) -
Generate visualizations:
- World map with institution markers
- Density heatmaps
- Coverage analysis by country
Next Steps (Priority 2)
Wikidata Enrichment:
- Replace 3,426 synthetic Q-numbers with real Wikidata IDs
- Query Wikidata SPARQL for ISIL codes (property P791)
- Enrich institution metadata from Wikidata
- Estimated time: 6-10 minutes
Collection Metadata Extraction (Priority 3):
- Web scraping with crawl4ai
- Extract from 10,932 institutional websites (81.6%)
- Populate LinkML Collection class
- Estimated time: Several days
Files Created This Session
Scripts
- ✅
scripts/geocode_global_institutions.py- Main geocoding script (398 lines) - ✅
scripts/check_geocoding_progress.sh- Progress monitor
Data Files
- 🔄
data/instances/global/global_heritage_institutions.yaml- Being updated with coordinates - ✅
data/cache/geocoding_cache.db- SQLite cache (growing) - ✅
data/logs/geocoding_full_run.log- Execution log
Directories Created
- ✅
data/cache/- For geocoding cache - ✅
data/logs/- For execution logs
Technical Notes
Nominatim API Usage Policy Compliance
Rate Limiting: ✅ 1 request/second (enforced by script)
User-Agent: ✅ GLAM-Heritage-Data-Extraction/1.0
Caching: ✅ All results cached to minimize API load
Timeout: ✅ 10 seconds per request
Error Handling: ✅ Graceful failures, no retry storms
Reference: https://operations.osmfoundation.org/policies/nominatim/
Performance Optimization
Cache Efficiency:
- Same city queried multiple times → cache hit (instant)
- Example: 102 Toyohashi libraries → 1 API call, 101 cache hits
Query Optimization:
- Country-specific formatting improves match accuracy
- Postal codes (NL) → higher precision
- Prefecture names (JP) → better disambiguation
Database Performance:
- SQLite cache uses indexed primary key (query text)
- Average lookup time: <1ms
- No database locks (single writer process)
Session Handoff
Current State:
- ✅ Geocoding script created and tested
- 🔄 Full geocoding run in progress (ETA: 90 minutes)
- ✅ Monitoring tools available
- ✅ Cache infrastructure working
When You Return:
- Run
./scripts/check_geocoding_progress.shto check status - If complete, check final statistics in log
- Run validation script (to be created)
- Proceed with Priority 2: Wikidata enrichment
If Something Goes Wrong:
- Process can be safely restarted (cache preserves progress)
- Log file contains detailed error information
- Cache can be inspected directly:
sqlite3 data/cache/geocoding_cache.db
Last Updated: 2025-11-07 13:26 PM
Next Check: 2025-11-07 14:50 PM (estimated completion)
Status: 🟢 Running smoothly