167 lines
5 KiB
Markdown
167 lines
5 KiB
Markdown
# Geocoding Session - November 7, 2025 (Resumed)
|
|
|
|
## Session Start
|
|
- **Time**: 13:37 PM PST
|
|
- **Status**: Resumed from previous session that crashed at 310/13,396 institutions
|
|
|
|
## Critical Bugs Fixed
|
|
|
|
### Bug 1: TypeError in extratags Handling
|
|
**Problem**: Script crashed when `location_result['extratags']` was `None`
|
|
```python
|
|
# BEFORE (line 281):
|
|
if 'extratags' in location_result and 'geonames:id' in location_result['extratags']:
|
|
# TypeError if extratags is None
|
|
|
|
# AFTER:
|
|
extratags = location_result.get('extratags')
|
|
if extratags and isinstance(extratags, dict) and 'geonames:id' in extratags:
|
|
result['geonames_id'] = int(extratags['geonames:id'])
|
|
```
|
|
|
|
### Bug 2: --limit Flag Caused Data Loss
|
|
**Problem**: Using `--limit 20` would:
|
|
1. Slice institutions list to 20 items
|
|
2. Process those 20
|
|
3. **Save only those 20 back to file, deleting 13,376 institutions**
|
|
|
|
**Impact**: Lost entire dataset during testing (had to regenerate with `merge_global_datasets.py`)
|
|
|
|
**Solution**:
|
|
- Keep full `institutions` list in memory
|
|
- Create separate `institutions_to_process` for limited processing
|
|
- Always save full `institutions` list to file
|
|
|
|
```python
|
|
# BEFORE:
|
|
if args.limit:
|
|
institutions = institutions[:args.limit] # TRUNCATES LIST!
|
|
# ...saves truncated list...
|
|
|
|
# AFTER:
|
|
institutions_to_process = institutions
|
|
if args.limit:
|
|
institutions_to_process = institutions[:args.limit] # Process subset
|
|
# ...saves FULL institutions list...
|
|
```
|
|
|
|
### Bug 3: No Incremental Saves
|
|
**Problem**: Script only saved at the very end - crash at institution 310 meant losing all progress
|
|
|
|
**Solution**: Save every 100 institutions:
|
|
```python
|
|
if not args.dry_run and i % 100 == 0 and updated_count > 0:
|
|
print(f"\n💾 Saving progress at {i}/{len(institutions_to_process)} institutions...")
|
|
with open(global_file, 'w', encoding='utf-8') as f:
|
|
yaml.dump(institutions, f, ...)
|
|
```
|
|
|
|
## Current Status
|
|
|
|
### Geocoding Run
|
|
- **Process ID**: 70163
|
|
- **Started**: 13:38 PM
|
|
- **Log File**: `data/logs/geocoding_full_run_fixed.log`
|
|
- **Dataset**: 13,396 institutions total
|
|
- **Already Geocoded**: 187 (Latin America from previous work)
|
|
- **To Geocode**: 13,209 (98.6%)
|
|
|
|
### Performance
|
|
- **Cache**: 112 queries cached (96 successful)
|
|
- **Rate**: ~4.5 institutions/second (faster due to cache hits)
|
|
- **ETA**: ~50 minutes (~14:28 PM completion)
|
|
|
|
### Progress Saves
|
|
- Automatic save every 100 institutions
|
|
- Final save at completion
|
|
- Full dataset always preserved
|
|
|
|
## Files Modified
|
|
|
|
### Scripts Updated
|
|
1. `scripts/geocode_global_institutions.py`:
|
|
- Fixed extratags TypeError (line 281)
|
|
- Fixed --limit data loss bug (lines 417-436)
|
|
- Added incremental saves every 100 institutions (lines 474-486)
|
|
- Updated progress messages
|
|
|
|
2. `scripts/check_geocoding_progress.sh`:
|
|
- Updated log file path to `geocoding_full_run_fixed.log`
|
|
|
|
### Data Files
|
|
1. `data/instances/global/global_heritage_institutions.yaml`:
|
|
- Regenerated after data loss incident
|
|
- Confirmed 13,396 institutions
|
|
- Currently being updated with coordinates
|
|
|
|
2. `data/cache/geocoding_cache.db`:
|
|
- Contains 112 geocoding queries
|
|
- Growing as geocoding proceeds
|
|
|
|
3. `data/logs/geocoding_full_run_fixed.log`:
|
|
- New log file for current run
|
|
- Clean start after fixes
|
|
|
|
## Lessons Learned
|
|
|
|
### Testing with --limit is Dangerous
|
|
- Always test with `--dry-run` first
|
|
- Verify file integrity after test runs
|
|
- Consider adding `--output` flag to write to different file during testing
|
|
|
|
### Incremental Saves Are Critical
|
|
- Large datasets (13K+ institutions) take 90+ minutes
|
|
- Network/API failures can happen anytime
|
|
- Saving every 100 items balances I/O vs. data safety
|
|
|
|
### Cache is Essential
|
|
- 112 cached queries reduced API calls significantly
|
|
- Rate went from ~2.5/sec to ~4.5/sec
|
|
- Nominatim rate limit (1 req/sec) still applies to new queries
|
|
|
|
## Next Steps (After Geocoding Completes)
|
|
|
|
1. **Validation** (~5 minutes)
|
|
```bash
|
|
python scripts/validate_geocoding_results.py
|
|
```
|
|
- Verify coverage (~95-97% expected)
|
|
- Generate GeoJSON for visualization
|
|
- Create validation report
|
|
|
|
2. **Wikidata Enrichment** (~10 minutes)
|
|
- Replace 3,426 synthetic Japanese Q-numbers
|
|
- Query Wikidata SPARQL for ISIL codes
|
|
- Add real Wikidata identifiers
|
|
|
|
3. **Collection Metadata** (multi-day project)
|
|
- Crawl 10,932 institutional websites
|
|
- Extract collection descriptions
|
|
- Map to LinkML Collection class
|
|
|
|
## Monitoring Commands
|
|
|
|
```bash
|
|
# Check progress
|
|
./scripts/check_geocoding_progress.sh
|
|
|
|
# Watch live log
|
|
tail -f data/logs/geocoding_full_run_fixed.log
|
|
|
|
# Check process
|
|
ps aux | grep geocode_global_institutions
|
|
|
|
# Count geocoded institutions (run in Python)
|
|
python3 -c "
|
|
import yaml
|
|
with open('data/instances/global/global_heritage_institutions.yaml', 'r') as f:
|
|
institutions = yaml.safe_load(f)
|
|
geocoded = sum(1 for i in institutions if i.get('locations') and i['locations'][0].get('latitude'))
|
|
print(f'Geocoded: {geocoded:,} / {len(institutions):,}')
|
|
"
|
|
```
|
|
|
|
## Session End
|
|
- **Status**: Geocoding running in background
|
|
- **ETA**: ~14:28 PM
|
|
- **Next Session**: Validate results and begin Wikidata enrichment
|