glam/GEOCODING_SESSION_2025-11-07.md
2025-11-19 23:25:22 +01:00

339 lines
9 KiB
Markdown

# Geocoding Session Summary - 2025-11-07
## Status: ✅ GEOCODING IN PROGRESS
**Started**: 2025-11-07 13:23 PM
**Expected Completion**: ~14:50 PM (90 minutes runtime)
---
## What Was Done
### 1. Created Global Geocoding Script ✅
**File**: `scripts/geocode_global_institutions.py`
**Features**:
- Persistent SQLite cache (`data/cache/geocoding_cache.db`)
- Rate limiting (1 request/second for Nominatim API)
- Country-specific query optimization (JP, NL, MX, BR, CL, etc.)
- Progress tracking with ETA calculation
- Resume capability (cached results persist across runs)
- Comprehensive error handling
- Command-line options:
- `--dry-run`: Test without API calls
- `--limit N`: Process only first N institutions
- `--country CODE`: Filter by country
- `--force`: Re-geocode existing coordinates
- `--verbose`: Detailed progress logging
**Usage**:
```bash
# Full run (current)
python scripts/geocode_global_institutions.py
# Test run
python scripts/geocode_global_institutions.py --dry-run --limit 10 --verbose
# Geocode specific country
python scripts/geocode_global_institutions.py --country NL --verbose
```
### 2. Created Progress Monitor Script ✅
**File**: `scripts/check_geocoding_progress.sh`
**Usage**:
```bash
# Check current progress
./scripts/check_geocoding_progress.sh
# Watch live progress
tail -f data/logs/geocoding_full_run.log
```
### 3. Started Full Geocoding Run ✅
**Command**:
```bash
nohup python scripts/geocode_global_institutions.py > data/logs/geocoding_full_run.log 2>&1 &
```
**Process ID**: 62188
**Log File**: `data/logs/geocoding_full_run.log`
---
## Current Progress
**Dataset Size**: 13,396 institutions
**Need Geocoding**: 13,209 institutions (98.6%)
**Already Geocoded**: 187 institutions (1.4% - Latin American data)
**Progress** (as of 13:25 PM):
- Institutions processed: 60 / 13,396 (0.4%)
- Processing rate: ~2.4 institutions/second
- Estimated completion: ~90 minutes
- Cache hits improving performance
**By Country** (institutions needing geocoding):
| Country | Total | Need Geocoding | Priority |
|---------|-------|----------------|----------|
| JP | 12,065 | 12,065 (100%) | High |
| NL | 1,017 | 1,017 (100%) | High |
| MX | 109 | 50 (45.9%) | Medium |
| BR | 97 | 47 (48.5%) | Medium |
| CL | 90 | 12 (13.3%) | Low |
| Others | 18 | 18 (100%) | Low |
---
## How Geocoding Works
### Query Optimization by Country
**Japan (JP)**:
- Cleans city names: removes " SHI", " KU", " CHO", " MURA"
- Query format: `{city_clean}, {prefecture}, Japan`
- Example: `"SAPPORO KITA, HOKKAIDO, Japan"`
**Netherlands (NL)**:
- Prioritizes postal code + city
- Query format: `{postal_code}, {city}, Netherlands`
- Example: `"1012 JS, Amsterdam, Netherlands"`
**Mexico (MX)**:
- Uses city + state
- Query format: `{city}, {state}, Mexico`
- Example: `"Ciudad de México, CDMX, Mexico"`
**Brazil (BR)**:
- Uses city + state abbreviation
- Query format: `{city}, {state}, Brazil`
- Example: `"Rio de Janeiro, RJ, Brazil"`
**Chile (CL)**:
- Uses city + region
- Query format: `{city}, {region}, Chile`
- Example: `"Santiago, Región Metropolitana, Chile"`
### Caching Strategy
**Cache Storage**: SQLite database at `data/cache/geocoding_cache.db`
**Cache Benefits**:
1. Avoid repeated API calls for same query
2. Persist across script runs (resume capability)
3. Speed up processing (cache hits = instant results)
4. Cache both successes AND failures (avoid retrying bad queries)
**Cache Statistics** (after 60 institutions):
- Total cached queries: 5
- Successful: 5
- Failed: 0
---
## Expected Results
### Output Files
**Primary Output**:
- **File**: `data/instances/global/global_heritage_institutions.yaml`
- **Change**: Each institution's `locations` array will gain:
- `latitude: <float>` - WGS84 latitude
- `longitude: <float>` - WGS84 longitude
- `geonames_id: <int>` - GeoNames ID (when available)
**Example**:
```yaml
locations:
- location_type: Physical Address
street_address: 4-1-2 SHINKOTONI 7-JO, SAPPORO SHI KITA KU, HOKKAIDO
city: SAPPORO SHI KITA KU
postal_code: 001-0907
region: HOKKAIDO
country: JP
is_primary: true
latitude: 43.0609096 # ← NEW
longitude: 141.3460696 # ← NEW
```
**Log File**:
- **File**: `data/logs/geocoding_full_run.log`
- **Contains**:
- Progress updates every 10 institutions
- Error messages for failed geocoding
- Final statistics summary
- Execution time and API call rate
**Cache Database**:
- **File**: `data/cache/geocoding_cache.db`
- **Contains**: ~13,000 geocoding query results
- **Reusable**: Can be used for future runs or other scripts
### Expected Statistics
**Estimated Results** (after completion):
- **Total institutions**: 13,396
- **Successfully geocoded**: ~12,800-13,000 (95-97%)
- **Failed geocoding**: ~200-400 (2-3%)
- Reasons: Ambiguous locations, unknown places, API errors
- **Cache hits**: ~500-1,000 (duplicate city names)
- **API calls**: ~12,500-12,800
- **Execution time**: ~90 minutes
**Coverage by Country** (estimated):
- Japan (JP): ~95% success (rural areas may fail)
- Netherlands (NL): ~98% success (postal codes help)
- Mexico (MX): ~92% success
- Brazil (BR): ~93% success
- Chile (CL): ~96% success (already partially geocoded)
---
## Monitoring Commands
### Check Progress
```bash
# Quick status check
./scripts/check_geocoding_progress.sh
# Check if process is still running
ps aux | grep geocode_global_institutions.py | grep -v grep
# View last 20 lines of log
tail -20 data/logs/geocoding_full_run.log
# Watch live progress (Ctrl+C to exit)
tail -f data/logs/geocoding_full_run.log
```
### Troubleshooting
**If process appears stuck**:
```bash
# Check process status
ps aux | grep geocode_global_institutions.py
# View recent progress
tail -50 data/logs/geocoding_full_run.log
# Check for errors
grep -i "error" data/logs/geocoding_full_run.log
```
**If process dies unexpectedly**:
```bash
# Restart - it will resume from cache
python scripts/geocode_global_institutions.py
```
---
## When Geocoding Completes
### Validation Steps
1. **Check final statistics** in log:
```bash
tail -50 data/logs/geocoding_full_run.log
```
2. **Validate dataset**:
```bash
python scripts/validate_geocoding_results.py # (to be created)
```
3. **Generate visualizations**:
- World map with institution markers
- Density heatmaps
- Coverage analysis by country
### Next Steps (Priority 2)
**Wikidata Enrichment**:
- Replace 3,426 synthetic Q-numbers with real Wikidata IDs
- Query Wikidata SPARQL for ISIL codes (property P791)
- Enrich institution metadata from Wikidata
- Estimated time: 6-10 minutes
**Collection Metadata Extraction** (Priority 3):
- Web scraping with crawl4ai
- Extract from 10,932 institutional websites (81.6%)
- Populate LinkML Collection class
- Estimated time: Several days
---
## Files Created This Session
### Scripts
1.`scripts/geocode_global_institutions.py` - Main geocoding script (398 lines)
2.`scripts/check_geocoding_progress.sh` - Progress monitor
### Data Files
1. 🔄 `data/instances/global/global_heritage_institutions.yaml` - Being updated with coordinates
2.`data/cache/geocoding_cache.db` - SQLite cache (growing)
3.`data/logs/geocoding_full_run.log` - Execution log
### Directories Created
1.`data/cache/` - For geocoding cache
2.`data/logs/` - For execution logs
---
## Technical Notes
### Nominatim API Usage Policy Compliance
**Rate Limiting**: ✅ 1 request/second (enforced by script)
**User-Agent**: ✅ `GLAM-Heritage-Data-Extraction/1.0`
**Caching**: ✅ All results cached to minimize API load
**Timeout**: ✅ 10 seconds per request
**Error Handling**: ✅ Graceful failures, no retry storms
**Reference**: https://operations.osmfoundation.org/policies/nominatim/
### Performance Optimization
**Cache Efficiency**:
- Same city queried multiple times → cache hit (instant)
- Example: 102 Toyohashi libraries → 1 API call, 101 cache hits
**Query Optimization**:
- Country-specific formatting improves match accuracy
- Postal codes (NL) → higher precision
- Prefecture names (JP) → better disambiguation
**Database Performance**:
- SQLite cache uses indexed primary key (query text)
- Average lookup time: <1ms
- No database locks (single writer process)
---
## Session Handoff
**Current State**:
- Geocoding script created and tested
- 🔄 Full geocoding run in progress (ETA: 90 minutes)
- Monitoring tools available
- Cache infrastructure working
**When You Return**:
1. Run `./scripts/check_geocoding_progress.sh` to check status
2. If complete, check final statistics in log
3. Run validation script (to be created)
4. Proceed with Priority 2: Wikidata enrichment
**If Something Goes Wrong**:
- Process can be safely restarted (cache preserves progress)
- Log file contains detailed error information
- Cache can be inspected directly: `sqlite3 data/cache/geocoding_cache.db`
---
**Last Updated**: 2025-11-07 13:26 PM
**Next Check**: 2025-11-07 14:50 PM (estimated completion)
**Status**: 🟢 Running smoothly