339 lines
9 KiB
Markdown
339 lines
9 KiB
Markdown
# Geocoding Session Summary - 2025-11-07
|
|
|
|
## Status: ✅ GEOCODING IN PROGRESS
|
|
|
|
**Started**: 2025-11-07 13:23 PM
|
|
**Expected Completion**: ~14:50 PM (90 minutes runtime)
|
|
|
|
---
|
|
|
|
## What Was Done
|
|
|
|
### 1. Created Global Geocoding Script ✅
|
|
|
|
**File**: `scripts/geocode_global_institutions.py`
|
|
|
|
**Features**:
|
|
- Persistent SQLite cache (`data/cache/geocoding_cache.db`)
|
|
- Rate limiting (1 request/second for Nominatim API)
|
|
- Country-specific query optimization (JP, NL, MX, BR, CL, etc.)
|
|
- Progress tracking with ETA calculation
|
|
- Resume capability (cached results persist across runs)
|
|
- Comprehensive error handling
|
|
- Command-line options:
|
|
- `--dry-run`: Test without API calls
|
|
- `--limit N`: Process only first N institutions
|
|
- `--country CODE`: Filter by country
|
|
- `--force`: Re-geocode existing coordinates
|
|
- `--verbose`: Detailed progress logging
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Full run (current)
|
|
python scripts/geocode_global_institutions.py
|
|
|
|
# Test run
|
|
python scripts/geocode_global_institutions.py --dry-run --limit 10 --verbose
|
|
|
|
# Geocode specific country
|
|
python scripts/geocode_global_institutions.py --country NL --verbose
|
|
```
|
|
|
|
### 2. Created Progress Monitor Script ✅
|
|
|
|
**File**: `scripts/check_geocoding_progress.sh`
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Check current progress
|
|
./scripts/check_geocoding_progress.sh
|
|
|
|
# Watch live progress
|
|
tail -f data/logs/geocoding_full_run.log
|
|
```
|
|
|
|
### 3. Started Full Geocoding Run ✅
|
|
|
|
**Command**:
|
|
```bash
|
|
nohup python scripts/geocode_global_institutions.py > data/logs/geocoding_full_run.log 2>&1 &
|
|
```
|
|
|
|
**Process ID**: 62188
|
|
**Log File**: `data/logs/geocoding_full_run.log`
|
|
|
|
---
|
|
|
|
## Current Progress
|
|
|
|
**Dataset Size**: 13,396 institutions
|
|
**Need Geocoding**: 13,209 institutions (98.6%)
|
|
**Already Geocoded**: 187 institutions (1.4% - Latin American data)
|
|
|
|
**Progress** (as of 13:25 PM):
|
|
- Institutions processed: 60 / 13,396 (0.4%)
|
|
- Processing rate: ~2.4 institutions/second
|
|
- Estimated completion: ~90 minutes
|
|
- Cache hits improving performance
|
|
|
|
**By Country** (institutions needing geocoding):
|
|
| Country | Total | Need Geocoding | Priority |
|
|
|---------|-------|----------------|----------|
|
|
| JP | 12,065 | 12,065 (100%) | High |
|
|
| NL | 1,017 | 1,017 (100%) | High |
|
|
| MX | 109 | 50 (45.9%) | Medium |
|
|
| BR | 97 | 47 (48.5%) | Medium |
|
|
| CL | 90 | 12 (13.3%) | Low |
|
|
| Others | 18 | 18 (100%) | Low |
|
|
|
|
---
|
|
|
|
## How Geocoding Works
|
|
|
|
### Query Optimization by Country
|
|
|
|
**Japan (JP)**:
|
|
- Cleans city names: removes " SHI", " KU", " CHO", " MURA"
|
|
- Query format: `{city_clean}, {prefecture}, Japan`
|
|
- Example: `"SAPPORO KITA, HOKKAIDO, Japan"`
|
|
|
|
**Netherlands (NL)**:
|
|
- Prioritizes postal code + city
|
|
- Query format: `{postal_code}, {city}, Netherlands`
|
|
- Example: `"1012 JS, Amsterdam, Netherlands"`
|
|
|
|
**Mexico (MX)**:
|
|
- Uses city + state
|
|
- Query format: `{city}, {state}, Mexico`
|
|
- Example: `"Ciudad de México, CDMX, Mexico"`
|
|
|
|
**Brazil (BR)**:
|
|
- Uses city + state abbreviation
|
|
- Query format: `{city}, {state}, Brazil`
|
|
- Example: `"Rio de Janeiro, RJ, Brazil"`
|
|
|
|
**Chile (CL)**:
|
|
- Uses city + region
|
|
- Query format: `{city}, {region}, Chile`
|
|
- Example: `"Santiago, Región Metropolitana, Chile"`
|
|
|
|
### Caching Strategy
|
|
|
|
**Cache Storage**: SQLite database at `data/cache/geocoding_cache.db`
|
|
|
|
**Cache Benefits**:
|
|
1. Avoid repeated API calls for same query
|
|
2. Persist across script runs (resume capability)
|
|
3. Speed up processing (cache hits = instant results)
|
|
4. Cache both successes AND failures (avoid retrying bad queries)
|
|
|
|
**Cache Statistics** (after 60 institutions):
|
|
- Total cached queries: 5
|
|
- Successful: 5
|
|
- Failed: 0
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
### Output Files
|
|
|
|
**Primary Output**:
|
|
- **File**: `data/instances/global/global_heritage_institutions.yaml`
|
|
- **Change**: Each institution's `locations` array will gain:
|
|
- `latitude: <float>` - WGS84 latitude
|
|
- `longitude: <float>` - WGS84 longitude
|
|
- `geonames_id: <int>` - GeoNames ID (when available)
|
|
|
|
**Example**:
|
|
```yaml
|
|
locations:
|
|
- location_type: Physical Address
|
|
street_address: 4-1-2 SHINKOTONI 7-JO, SAPPORO SHI KITA KU, HOKKAIDO
|
|
city: SAPPORO SHI KITA KU
|
|
postal_code: 001-0907
|
|
region: HOKKAIDO
|
|
country: JP
|
|
is_primary: true
|
|
latitude: 43.0609096 # ← NEW
|
|
longitude: 141.3460696 # ← NEW
|
|
```
|
|
|
|
**Log File**:
|
|
- **File**: `data/logs/geocoding_full_run.log`
|
|
- **Contains**:
|
|
- Progress updates every 10 institutions
|
|
- Error messages for failed geocoding
|
|
- Final statistics summary
|
|
- Execution time and API call rate
|
|
|
|
**Cache Database**:
|
|
- **File**: `data/cache/geocoding_cache.db`
|
|
- **Contains**: ~13,000 geocoding query results
|
|
- **Reusable**: Can be used for future runs or other scripts
|
|
|
|
### Expected Statistics
|
|
|
|
**Estimated Results** (after completion):
|
|
- **Total institutions**: 13,396
|
|
- **Successfully geocoded**: ~12,800-13,000 (95-97%)
|
|
- **Failed geocoding**: ~200-400 (2-3%)
|
|
- Reasons: Ambiguous locations, unknown places, API errors
|
|
- **Cache hits**: ~500-1,000 (duplicate city names)
|
|
- **API calls**: ~12,500-12,800
|
|
- **Execution time**: ~90 minutes
|
|
|
|
**Coverage by Country** (estimated):
|
|
- Japan (JP): ~95% success (rural areas may fail)
|
|
- Netherlands (NL): ~98% success (postal codes help)
|
|
- Mexico (MX): ~92% success
|
|
- Brazil (BR): ~93% success
|
|
- Chile (CL): ~96% success (already partially geocoded)
|
|
|
|
---
|
|
|
|
## Monitoring Commands
|
|
|
|
### Check Progress
|
|
```bash
|
|
# Quick status check
|
|
./scripts/check_geocoding_progress.sh
|
|
|
|
# Check if process is still running
|
|
ps aux | grep geocode_global_institutions.py | grep -v grep
|
|
|
|
# View last 20 lines of log
|
|
tail -20 data/logs/geocoding_full_run.log
|
|
|
|
# Watch live progress (Ctrl+C to exit)
|
|
tail -f data/logs/geocoding_full_run.log
|
|
```
|
|
|
|
### Troubleshooting
|
|
|
|
**If process appears stuck**:
|
|
```bash
|
|
# Check process status
|
|
ps aux | grep geocode_global_institutions.py
|
|
|
|
# View recent progress
|
|
tail -50 data/logs/geocoding_full_run.log
|
|
|
|
# Check for errors
|
|
grep -i "error" data/logs/geocoding_full_run.log
|
|
```
|
|
|
|
**If process dies unexpectedly**:
|
|
```bash
|
|
# Restart - it will resume from cache
|
|
python scripts/geocode_global_institutions.py
|
|
```
|
|
|
|
---
|
|
|
|
## When Geocoding Completes
|
|
|
|
### Validation Steps
|
|
|
|
1. **Check final statistics** in log:
|
|
```bash
|
|
tail -50 data/logs/geocoding_full_run.log
|
|
```
|
|
|
|
2. **Validate dataset**:
|
|
```bash
|
|
python scripts/validate_geocoding_results.py # (to be created)
|
|
```
|
|
|
|
3. **Generate visualizations**:
|
|
- World map with institution markers
|
|
- Density heatmaps
|
|
- Coverage analysis by country
|
|
|
|
### Next Steps (Priority 2)
|
|
|
|
**Wikidata Enrichment**:
|
|
- Replace 3,426 synthetic Q-numbers with real Wikidata IDs
|
|
- Query Wikidata SPARQL for ISIL codes (property P791)
|
|
- Enrich institution metadata from Wikidata
|
|
- Estimated time: 6-10 minutes
|
|
|
|
**Collection Metadata Extraction** (Priority 3):
|
|
- Web scraping with crawl4ai
|
|
- Extract from 10,932 institutional websites (81.6%)
|
|
- Populate LinkML Collection class
|
|
- Estimated time: Several days
|
|
|
|
---
|
|
|
|
## Files Created This Session
|
|
|
|
### Scripts
|
|
1. ✅ `scripts/geocode_global_institutions.py` - Main geocoding script (398 lines)
|
|
2. ✅ `scripts/check_geocoding_progress.sh` - Progress monitor
|
|
|
|
### Data Files
|
|
1. 🔄 `data/instances/global/global_heritage_institutions.yaml` - Being updated with coordinates
|
|
2. ✅ `data/cache/geocoding_cache.db` - SQLite cache (growing)
|
|
3. ✅ `data/logs/geocoding_full_run.log` - Execution log
|
|
|
|
### Directories Created
|
|
1. ✅ `data/cache/` - For geocoding cache
|
|
2. ✅ `data/logs/` - For execution logs
|
|
|
|
---
|
|
|
|
## Technical Notes
|
|
|
|
### Nominatim API Usage Policy Compliance
|
|
|
|
**Rate Limiting**: ✅ 1 request/second (enforced by script)
|
|
**User-Agent**: ✅ `GLAM-Heritage-Data-Extraction/1.0`
|
|
**Caching**: ✅ All results cached to minimize API load
|
|
**Timeout**: ✅ 10 seconds per request
|
|
**Error Handling**: ✅ Graceful failures, no retry storms
|
|
|
|
**Reference**: https://operations.osmfoundation.org/policies/nominatim/
|
|
|
|
### Performance Optimization
|
|
|
|
**Cache Efficiency**:
|
|
- Same city queried multiple times → cache hit (instant)
|
|
- Example: 102 Toyohashi libraries → 1 API call, 101 cache hits
|
|
|
|
**Query Optimization**:
|
|
- Country-specific formatting improves match accuracy
|
|
- Postal codes (NL) → higher precision
|
|
- Prefecture names (JP) → better disambiguation
|
|
|
|
**Database Performance**:
|
|
- SQLite cache uses indexed primary key (query text)
|
|
- Average lookup time: <1ms
|
|
- No database locks (single writer process)
|
|
|
|
---
|
|
|
|
## Session Handoff
|
|
|
|
**Current State**:
|
|
- ✅ Geocoding script created and tested
|
|
- 🔄 Full geocoding run in progress (ETA: 90 minutes)
|
|
- ✅ Monitoring tools available
|
|
- ✅ Cache infrastructure working
|
|
|
|
**When You Return**:
|
|
1. Run `./scripts/check_geocoding_progress.sh` to check status
|
|
2. If complete, check final statistics in log
|
|
3. Run validation script (to be created)
|
|
4. Proceed with Priority 2: Wikidata enrichment
|
|
|
|
**If Something Goes Wrong**:
|
|
- Process can be safely restarted (cache preserves progress)
|
|
- Log file contains detailed error information
|
|
- Cache can be inspected directly: `sqlite3 data/cache/geocoding_cache.db`
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-07 13:26 PM
|
|
**Next Check**: 2025-11-07 14:50 PM (estimated completion)
|
|
**Status**: 🟢 Running smoothly
|