glam/GEOCODING_SESSION_2025-11-07_RESUMED.md
2025-11-19 23:25:22 +01:00

167 lines
5 KiB
Markdown

# Geocoding Session - November 7, 2025 (Resumed)
## Session Start
- **Time**: 13:37 PM PST
- **Status**: Resumed from previous session that crashed at 310/13,396 institutions
## Critical Bugs Fixed
### Bug 1: TypeError in extratags Handling
**Problem**: Script crashed when `location_result['extratags']` was `None`
```python
# BEFORE (line 281):
if 'extratags' in location_result and 'geonames:id' in location_result['extratags']:
# TypeError if extratags is None
# AFTER:
extratags = location_result.get('extratags')
if extratags and isinstance(extratags, dict) and 'geonames:id' in extratags:
result['geonames_id'] = int(extratags['geonames:id'])
```
### Bug 2: --limit Flag Caused Data Loss
**Problem**: Using `--limit 20` would:
1. Slice institutions list to 20 items
2. Process those 20
3. **Save only those 20 back to file, deleting 13,376 institutions**
**Impact**: Lost entire dataset during testing (had to regenerate with `merge_global_datasets.py`)
**Solution**:
- Keep full `institutions` list in memory
- Create separate `institutions_to_process` for limited processing
- Always save full `institutions` list to file
```python
# BEFORE:
if args.limit:
institutions = institutions[:args.limit] # TRUNCATES LIST!
# ...saves truncated list...
# AFTER:
institutions_to_process = institutions
if args.limit:
institutions_to_process = institutions[:args.limit] # Process subset
# ...saves FULL institutions list...
```
### Bug 3: No Incremental Saves
**Problem**: Script only saved at the very end - crash at institution 310 meant losing all progress
**Solution**: Save every 100 institutions:
```python
if not args.dry_run and i % 100 == 0 and updated_count > 0:
print(f"\n💾 Saving progress at {i}/{len(institutions_to_process)} institutions...")
with open(global_file, 'w', encoding='utf-8') as f:
yaml.dump(institutions, f, ...)
```
## Current Status
### Geocoding Run
- **Process ID**: 70163
- **Started**: 13:38 PM
- **Log File**: `data/logs/geocoding_full_run_fixed.log`
- **Dataset**: 13,396 institutions total
- **Already Geocoded**: 187 (Latin America from previous work)
- **To Geocode**: 13,209 (98.6%)
### Performance
- **Cache**: 112 queries cached (96 successful)
- **Rate**: ~4.5 institutions/second (faster due to cache hits)
- **ETA**: ~50 minutes (~14:28 PM completion)
### Progress Saves
- Automatic save every 100 institutions
- Final save at completion
- Full dataset always preserved
## Files Modified
### Scripts Updated
1. `scripts/geocode_global_institutions.py`:
- Fixed extratags TypeError (line 281)
- Fixed --limit data loss bug (lines 417-436)
- Added incremental saves every 100 institutions (lines 474-486)
- Updated progress messages
2. `scripts/check_geocoding_progress.sh`:
- Updated log file path to `geocoding_full_run_fixed.log`
### Data Files
1. `data/instances/global/global_heritage_institutions.yaml`:
- Regenerated after data loss incident
- Confirmed 13,396 institutions
- Currently being updated with coordinates
2. `data/cache/geocoding_cache.db`:
- Contains 112 geocoding queries
- Growing as geocoding proceeds
3. `data/logs/geocoding_full_run_fixed.log`:
- New log file for current run
- Clean start after fixes
## Lessons Learned
### Testing with --limit is Dangerous
- Always test with `--dry-run` first
- Verify file integrity after test runs
- Consider adding `--output` flag to write to different file during testing
### Incremental Saves Are Critical
- Large datasets (13K+ institutions) take 90+ minutes
- Network/API failures can happen anytime
- Saving every 100 items balances I/O vs. data safety
### Cache is Essential
- 112 cached queries reduced API calls significantly
- Rate went from ~2.5/sec to ~4.5/sec
- Nominatim rate limit (1 req/sec) still applies to new queries
## Next Steps (After Geocoding Completes)
1. **Validation** (~5 minutes)
```bash
python scripts/validate_geocoding_results.py
```
- Verify coverage (~95-97% expected)
- Generate GeoJSON for visualization
- Create validation report
2. **Wikidata Enrichment** (~10 minutes)
- Replace 3,426 synthetic Japanese Q-numbers
- Query Wikidata SPARQL for ISIL codes
- Add real Wikidata identifiers
3. **Collection Metadata** (multi-day project)
- Crawl 10,932 institutional websites
- Extract collection descriptions
- Map to LinkML Collection class
## Monitoring Commands
```bash
# Check progress
./scripts/check_geocoding_progress.sh
# Watch live log
tail -f data/logs/geocoding_full_run_fixed.log
# Check process
ps aux | grep geocode_global_institutions
# Count geocoded institutions (run in Python)
python3 -c "
import yaml
with open('data/instances/global/global_heritage_institutions.yaml', 'r') as f:
institutions = yaml.safe_load(f)
geocoded = sum(1 for i in institutions if i.get('locations') and i['locations'][0].get('latitude'))
print(f'Geocoded: {geocoded:,} / {len(institutions):,}')
"
```
## Session End
- **Status**: Geocoding running in background
- **ETA**: ~14:28 PM
- **Next Session**: Validate results and begin Wikidata enrichment