glam/GEOCODING_SESSION_2025-11-07_RESUMED.md
2025-11-19 23:25:22 +01:00

5 KiB

Geocoding Session - November 7, 2025 (Resumed)

Session Start

  • Time: 13:37 PM PST
  • Status: Resumed from previous session that crashed at 310/13,396 institutions

Critical Bugs Fixed

Bug 1: TypeError in extratags Handling

Problem: Script crashed when location_result['extratags'] was None

# BEFORE (line 281):
if 'extratags' in location_result and 'geonames:id' in location_result['extratags']:
    # TypeError if extratags is None

# AFTER:
extratags = location_result.get('extratags')
if extratags and isinstance(extratags, dict) and 'geonames:id' in extratags:
    result['geonames_id'] = int(extratags['geonames:id'])

Bug 2: --limit Flag Caused Data Loss

Problem: Using --limit 20 would:

  1. Slice institutions list to 20 items
  2. Process those 20
  3. Save only those 20 back to file, deleting 13,376 institutions

Impact: Lost entire dataset during testing (had to regenerate with merge_global_datasets.py)

Solution:

  • Keep full institutions list in memory
  • Create separate institutions_to_process for limited processing
  • Always save full institutions list to file
# BEFORE:
if args.limit:
    institutions = institutions[:args.limit]  # TRUNCATES LIST!
    # ...saves truncated list...

# AFTER:
institutions_to_process = institutions
if args.limit:
    institutions_to_process = institutions[:args.limit]  # Process subset
    # ...saves FULL institutions list...

Bug 3: No Incremental Saves

Problem: Script only saved at the very end - crash at institution 310 meant losing all progress

Solution: Save every 100 institutions:

if not args.dry_run and i % 100 == 0 and updated_count > 0:
    print(f"\n💾 Saving progress at {i}/{len(institutions_to_process)} institutions...")
    with open(global_file, 'w', encoding='utf-8') as f:
        yaml.dump(institutions, f, ...)

Current Status

Geocoding Run

  • Process ID: 70163
  • Started: 13:38 PM
  • Log File: data/logs/geocoding_full_run_fixed.log
  • Dataset: 13,396 institutions total
  • Already Geocoded: 187 (Latin America from previous work)
  • To Geocode: 13,209 (98.6%)

Performance

  • Cache: 112 queries cached (96 successful)
  • Rate: ~4.5 institutions/second (faster due to cache hits)
  • ETA: ~50 minutes (~14:28 PM completion)

Progress Saves

  • Automatic save every 100 institutions
  • Final save at completion
  • Full dataset always preserved

Files Modified

Scripts Updated

  1. scripts/geocode_global_institutions.py:

    • Fixed extratags TypeError (line 281)
    • Fixed --limit data loss bug (lines 417-436)
    • Added incremental saves every 100 institutions (lines 474-486)
    • Updated progress messages
  2. scripts/check_geocoding_progress.sh:

    • Updated log file path to geocoding_full_run_fixed.log

Data Files

  1. data/instances/global/global_heritage_institutions.yaml:

    • Regenerated after data loss incident
    • Confirmed 13,396 institutions
    • Currently being updated with coordinates
  2. data/cache/geocoding_cache.db:

    • Contains 112 geocoding queries
    • Growing as geocoding proceeds
  3. data/logs/geocoding_full_run_fixed.log:

    • New log file for current run
    • Clean start after fixes

Lessons Learned

Testing with --limit is Dangerous

  • Always test with --dry-run first
  • Verify file integrity after test runs
  • Consider adding --output flag to write to different file during testing

Incremental Saves Are Critical

  • Large datasets (13K+ institutions) take 90+ minutes
  • Network/API failures can happen anytime
  • Saving every 100 items balances I/O vs. data safety

Cache is Essential

  • 112 cached queries reduced API calls significantly
  • Rate went from ~2.5/sec to ~4.5/sec
  • Nominatim rate limit (1 req/sec) still applies to new queries

Next Steps (After Geocoding Completes)

  1. Validation (~5 minutes)

    python scripts/validate_geocoding_results.py
    
    • Verify coverage (~95-97% expected)
    • Generate GeoJSON for visualization
    • Create validation report
  2. Wikidata Enrichment (~10 minutes)

    • Replace 3,426 synthetic Japanese Q-numbers
    • Query Wikidata SPARQL for ISIL codes
    • Add real Wikidata identifiers
  3. Collection Metadata (multi-day project)

    • Crawl 10,932 institutional websites
    • Extract collection descriptions
    • Map to LinkML Collection class

Monitoring Commands

# Check progress
./scripts/check_geocoding_progress.sh

# Watch live log
tail -f data/logs/geocoding_full_run_fixed.log

# Check process
ps aux | grep geocode_global_institutions

# Count geocoded institutions (run in Python)
python3 -c "
import yaml
with open('data/instances/global/global_heritage_institutions.yaml', 'r') as f:
    institutions = yaml.safe_load(f)
geocoded = sum(1 for i in institutions if i.get('locations') and i['locations'][0].get('latitude'))
print(f'Geocoded: {geocoded:,} / {len(institutions):,}')
"

Session End

  • Status: Geocoding running in background
  • ETA: ~14:28 PM
  • Next Session: Validate results and begin Wikidata enrichment