glam/GEOCODING_SESSION_2025-11-07.md
2025-11-19 23:25:22 +01:00

9 KiB

Geocoding Session Summary - 2025-11-07

Status: GEOCODING IN PROGRESS

Started: 2025-11-07 13:23 PM Expected Completion: ~14:50 PM (90 minutes runtime)


What Was Done

1. Created Global Geocoding Script

File: scripts/geocode_global_institutions.py

Features:

  • Persistent SQLite cache (data/cache/geocoding_cache.db)
  • Rate limiting (1 request/second for Nominatim API)
  • Country-specific query optimization (JP, NL, MX, BR, CL, etc.)
  • Progress tracking with ETA calculation
  • Resume capability (cached results persist across runs)
  • Comprehensive error handling
  • Command-line options:
    • --dry-run: Test without API calls
    • --limit N: Process only first N institutions
    • --country CODE: Filter by country
    • --force: Re-geocode existing coordinates
    • --verbose: Detailed progress logging

Usage:

# Full run (current)
python scripts/geocode_global_institutions.py

# Test run
python scripts/geocode_global_institutions.py --dry-run --limit 10 --verbose

# Geocode specific country
python scripts/geocode_global_institutions.py --country NL --verbose

2. Created Progress Monitor Script

File: scripts/check_geocoding_progress.sh

Usage:

# Check current progress
./scripts/check_geocoding_progress.sh

# Watch live progress
tail -f data/logs/geocoding_full_run.log

3. Started Full Geocoding Run

Command:

nohup python scripts/geocode_global_institutions.py > data/logs/geocoding_full_run.log 2>&1 &

Process ID: 62188 Log File: data/logs/geocoding_full_run.log


Current Progress

Dataset Size: 13,396 institutions Need Geocoding: 13,209 institutions (98.6%) Already Geocoded: 187 institutions (1.4% - Latin American data)

Progress (as of 13:25 PM):

  • Institutions processed: 60 / 13,396 (0.4%)
  • Processing rate: ~2.4 institutions/second
  • Estimated completion: ~90 minutes
  • Cache hits improving performance

By Country (institutions needing geocoding):

Country Total Need Geocoding Priority
JP 12,065 12,065 (100%) High
NL 1,017 1,017 (100%) High
MX 109 50 (45.9%) Medium
BR 97 47 (48.5%) Medium
CL 90 12 (13.3%) Low
Others 18 18 (100%) Low

How Geocoding Works

Query Optimization by Country

Japan (JP):

  • Cleans city names: removes " SHI", " KU", " CHO", " MURA"
  • Query format: {city_clean}, {prefecture}, Japan
  • Example: "SAPPORO KITA, HOKKAIDO, Japan"

Netherlands (NL):

  • Prioritizes postal code + city
  • Query format: {postal_code}, {city}, Netherlands
  • Example: "1012 JS, Amsterdam, Netherlands"

Mexico (MX):

  • Uses city + state
  • Query format: {city}, {state}, Mexico
  • Example: "Ciudad de México, CDMX, Mexico"

Brazil (BR):

  • Uses city + state abbreviation
  • Query format: {city}, {state}, Brazil
  • Example: "Rio de Janeiro, RJ, Brazil"

Chile (CL):

  • Uses city + region
  • Query format: {city}, {region}, Chile
  • Example: "Santiago, Región Metropolitana, Chile"

Caching Strategy

Cache Storage: SQLite database at data/cache/geocoding_cache.db

Cache Benefits:

  1. Avoid repeated API calls for same query
  2. Persist across script runs (resume capability)
  3. Speed up processing (cache hits = instant results)
  4. Cache both successes AND failures (avoid retrying bad queries)

Cache Statistics (after 60 institutions):

  • Total cached queries: 5
  • Successful: 5
  • Failed: 0

Expected Results

Output Files

Primary Output:

  • File: data/instances/global/global_heritage_institutions.yaml
  • Change: Each institution's locations array will gain:
    • latitude: <float> - WGS84 latitude
    • longitude: <float> - WGS84 longitude
    • geonames_id: <int> - GeoNames ID (when available)

Example:

locations:
  - location_type: Physical Address
    street_address: 4-1-2 SHINKOTONI 7-JO, SAPPORO SHI KITA KU, HOKKAIDO
    city: SAPPORO SHI KITA KU
    postal_code: 001-0907
    region: HOKKAIDO
    country: JP
    is_primary: true
    latitude: 43.0609096      # ← NEW
    longitude: 141.3460696    # ← NEW

Log File:

  • File: data/logs/geocoding_full_run.log
  • Contains:
    • Progress updates every 10 institutions
    • Error messages for failed geocoding
    • Final statistics summary
    • Execution time and API call rate

Cache Database:

  • File: data/cache/geocoding_cache.db
  • Contains: ~13,000 geocoding query results
  • Reusable: Can be used for future runs or other scripts

Expected Statistics

Estimated Results (after completion):

  • Total institutions: 13,396
  • Successfully geocoded: ~12,800-13,000 (95-97%)
  • Failed geocoding: ~200-400 (2-3%)
    • Reasons: Ambiguous locations, unknown places, API errors
  • Cache hits: ~500-1,000 (duplicate city names)
  • API calls: ~12,500-12,800
  • Execution time: ~90 minutes

Coverage by Country (estimated):

  • Japan (JP): ~95% success (rural areas may fail)
  • Netherlands (NL): ~98% success (postal codes help)
  • Mexico (MX): ~92% success
  • Brazil (BR): ~93% success
  • Chile (CL): ~96% success (already partially geocoded)

Monitoring Commands

Check Progress

# Quick status check
./scripts/check_geocoding_progress.sh

# Check if process is still running
ps aux | grep geocode_global_institutions.py | grep -v grep

# View last 20 lines of log
tail -20 data/logs/geocoding_full_run.log

# Watch live progress (Ctrl+C to exit)
tail -f data/logs/geocoding_full_run.log

Troubleshooting

If process appears stuck:

# Check process status
ps aux | grep geocode_global_institutions.py

# View recent progress
tail -50 data/logs/geocoding_full_run.log

# Check for errors
grep -i "error" data/logs/geocoding_full_run.log

If process dies unexpectedly:

# Restart - it will resume from cache
python scripts/geocode_global_institutions.py

When Geocoding Completes

Validation Steps

  1. Check final statistics in log:

    tail -50 data/logs/geocoding_full_run.log
    
  2. Validate dataset:

    python scripts/validate_geocoding_results.py  # (to be created)
    
  3. Generate visualizations:

    • World map with institution markers
    • Density heatmaps
    • Coverage analysis by country

Next Steps (Priority 2)

Wikidata Enrichment:

  • Replace 3,426 synthetic Q-numbers with real Wikidata IDs
  • Query Wikidata SPARQL for ISIL codes (property P791)
  • Enrich institution metadata from Wikidata
  • Estimated time: 6-10 minutes

Collection Metadata Extraction (Priority 3):

  • Web scraping with crawl4ai
  • Extract from 10,932 institutional websites (81.6%)
  • Populate LinkML Collection class
  • Estimated time: Several days

Files Created This Session

Scripts

  1. scripts/geocode_global_institutions.py - Main geocoding script (398 lines)
  2. scripts/check_geocoding_progress.sh - Progress monitor

Data Files

  1. 🔄 data/instances/global/global_heritage_institutions.yaml - Being updated with coordinates
  2. data/cache/geocoding_cache.db - SQLite cache (growing)
  3. data/logs/geocoding_full_run.log - Execution log

Directories Created

  1. data/cache/ - For geocoding cache
  2. data/logs/ - For execution logs

Technical Notes

Nominatim API Usage Policy Compliance

Rate Limiting: 1 request/second (enforced by script) User-Agent: GLAM-Heritage-Data-Extraction/1.0 Caching: All results cached to minimize API load Timeout: 10 seconds per request Error Handling: Graceful failures, no retry storms

Reference: https://operations.osmfoundation.org/policies/nominatim/

Performance Optimization

Cache Efficiency:

  • Same city queried multiple times → cache hit (instant)
  • Example: 102 Toyohashi libraries → 1 API call, 101 cache hits

Query Optimization:

  • Country-specific formatting improves match accuracy
  • Postal codes (NL) → higher precision
  • Prefecture names (JP) → better disambiguation

Database Performance:

  • SQLite cache uses indexed primary key (query text)
  • Average lookup time: <1ms
  • No database locks (single writer process)

Session Handoff

Current State:

  • Geocoding script created and tested
  • 🔄 Full geocoding run in progress (ETA: 90 minutes)
  • Monitoring tools available
  • Cache infrastructure working

When You Return:

  1. Run ./scripts/check_geocoding_progress.sh to check status
  2. If complete, check final statistics in log
  3. Run validation script (to be created)
  4. Proceed with Priority 2: Wikidata enrichment

If Something Goes Wrong:

  • Process can be safely restarted (cache preserves progress)
  • Log file contains detailed error information
  • Cache can be inspected directly: sqlite3 data/cache/geocoding_cache.db

Last Updated: 2025-11-07 13:26 PM
Next Check: 2025-11-07 14:50 PM (estimated completion)
Status: 🟢 Running smoothly