5 KiB
5 KiB
Geocoding Session - November 7, 2025 (Resumed)
Session Start
- Time: 13:37 PM PST
- Status: Resumed from previous session that crashed at 310/13,396 institutions
Critical Bugs Fixed
Bug 1: TypeError in extratags Handling
Problem: Script crashed when location_result['extratags'] was None
# BEFORE (line 281):
if 'extratags' in location_result and 'geonames:id' in location_result['extratags']:
# TypeError if extratags is None
# AFTER:
extratags = location_result.get('extratags')
if extratags and isinstance(extratags, dict) and 'geonames:id' in extratags:
result['geonames_id'] = int(extratags['geonames:id'])
Bug 2: --limit Flag Caused Data Loss
Problem: Using --limit 20 would:
- Slice institutions list to 20 items
- Process those 20
- Save only those 20 back to file, deleting 13,376 institutions
Impact: Lost entire dataset during testing (had to regenerate with merge_global_datasets.py)
Solution:
- Keep full
institutionslist in memory - Create separate
institutions_to_processfor limited processing - Always save full
institutionslist to file
# BEFORE:
if args.limit:
institutions = institutions[:args.limit] # TRUNCATES LIST!
# ...saves truncated list...
# AFTER:
institutions_to_process = institutions
if args.limit:
institutions_to_process = institutions[:args.limit] # Process subset
# ...saves FULL institutions list...
Bug 3: No Incremental Saves
Problem: Script only saved at the very end - crash at institution 310 meant losing all progress
Solution: Save every 100 institutions:
if not args.dry_run and i % 100 == 0 and updated_count > 0:
print(f"\n💾 Saving progress at {i}/{len(institutions_to_process)} institutions...")
with open(global_file, 'w', encoding='utf-8') as f:
yaml.dump(institutions, f, ...)
Current Status
Geocoding Run
- Process ID: 70163
- Started: 13:38 PM
- Log File:
data/logs/geocoding_full_run_fixed.log - Dataset: 13,396 institutions total
- Already Geocoded: 187 (Latin America from previous work)
- To Geocode: 13,209 (98.6%)
Performance
- Cache: 112 queries cached (96 successful)
- Rate: ~4.5 institutions/second (faster due to cache hits)
- ETA: ~50 minutes (~14:28 PM completion)
Progress Saves
- Automatic save every 100 institutions
- Final save at completion
- Full dataset always preserved
Files Modified
Scripts Updated
-
scripts/geocode_global_institutions.py:- Fixed extratags TypeError (line 281)
- Fixed --limit data loss bug (lines 417-436)
- Added incremental saves every 100 institutions (lines 474-486)
- Updated progress messages
-
scripts/check_geocoding_progress.sh:- Updated log file path to
geocoding_full_run_fixed.log
- Updated log file path to
Data Files
-
data/instances/global/global_heritage_institutions.yaml:- Regenerated after data loss incident
- Confirmed 13,396 institutions
- Currently being updated with coordinates
-
data/cache/geocoding_cache.db:- Contains 112 geocoding queries
- Growing as geocoding proceeds
-
data/logs/geocoding_full_run_fixed.log:- New log file for current run
- Clean start after fixes
Lessons Learned
Testing with --limit is Dangerous
- Always test with
--dry-runfirst - Verify file integrity after test runs
- Consider adding
--outputflag to write to different file during testing
Incremental Saves Are Critical
- Large datasets (13K+ institutions) take 90+ minutes
- Network/API failures can happen anytime
- Saving every 100 items balances I/O vs. data safety
Cache is Essential
- 112 cached queries reduced API calls significantly
- Rate went from ~2.5/sec to ~4.5/sec
- Nominatim rate limit (1 req/sec) still applies to new queries
Next Steps (After Geocoding Completes)
-
Validation (~5 minutes)
python scripts/validate_geocoding_results.py- Verify coverage (~95-97% expected)
- Generate GeoJSON for visualization
- Create validation report
-
Wikidata Enrichment (~10 minutes)
- Replace 3,426 synthetic Japanese Q-numbers
- Query Wikidata SPARQL for ISIL codes
- Add real Wikidata identifiers
-
Collection Metadata (multi-day project)
- Crawl 10,932 institutional websites
- Extract collection descriptions
- Map to LinkML Collection class
Monitoring Commands
# Check progress
./scripts/check_geocoding_progress.sh
# Watch live log
tail -f data/logs/geocoding_full_run_fixed.log
# Check process
ps aux | grep geocode_global_institutions
# Count geocoded institutions (run in Python)
python3 -c "
import yaml
with open('data/instances/global/global_heritage_institutions.yaml', 'r') as f:
institutions = yaml.safe_load(f)
geocoded = sum(1 for i in institutions if i.get('locations') and i['locations'][0].get('latitude'))
print(f'Geocoded: {geocoded:,} / {len(institutions):,}')
"
Session End
- Status: Geocoding running in background
- ETA: ~14:28 PM
- Next Session: Validate results and begin Wikidata enrichment