196 lines
5.8 KiB
Markdown
196 lines
5.8 KiB
Markdown
# Next Session: Mexican Institution Geocoding
|
||
|
||
## Quick Start Guide for Next Session
|
||
|
||
### Objective
|
||
Geocode 117 Mexican heritage institutions to achieve 60%+ coordinate coverage (70+ institutions).
|
||
|
||
### Current Status
|
||
- **Input file**: `data/instances/mexican_institutions_curated.yaml`
|
||
- **Current geocoding**: 5.9% (7 out of 117 institutions)
|
||
- **Target**: 60%+ (70+ institutions)
|
||
|
||
### Step-by-Step Workflow
|
||
|
||
#### 1. Create Mexican Geocoding Script (5 minutes)
|
||
|
||
Copy the Chilean script and adapt for Mexico:
|
||
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
cp scripts/geocode_chilean_institutions.py scripts/geocode_mexican_institutions.py
|
||
```
|
||
|
||
**Configuration changes needed** (lines 22-26):
|
||
```python
|
||
INPUT_FILE = Path("data/instances/mexican_institutions_curated.yaml")
|
||
OUTPUT_FILE = Path("data/instances/mexican_institutions_geocoded_v2.yaml")
|
||
REPORT_FILE = Path("data/instances/mexican_geocoding_report_v2.md")
|
||
CACHE_FILE = Path("data/instances/.geocoding_cache_mexico.yaml")
|
||
```
|
||
|
||
**Report title change** (line 2-3, line 439):
|
||
```python
|
||
# Change "Chilean" to "Mexican" in:
|
||
# - Docstring title
|
||
# - Print statements in main()
|
||
# - Report generation
|
||
```
|
||
|
||
#### 2. Run Geocoding (4-5 minutes first run)
|
||
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
python scripts/geocode_mexican_institutions.py
|
||
```
|
||
|
||
Expected output:
|
||
- ~200-250 API calls (117 institutions × ~2 avg queries with fallbacks)
|
||
- ~4-5 minutes total time (1.1 sec/request rate limit)
|
||
- Target: 70+ institutions geocoded (60%+)
|
||
|
||
#### 3. Validate Results (1 minute)
|
||
|
||
```bash
|
||
python scripts/validate_yaml_instance.py data/instances/mexican_institutions_geocoded_v2.yaml
|
||
```
|
||
|
||
Should show: "✅ All instances are valid!"
|
||
|
||
#### 4. Create Backup (30 seconds)
|
||
|
||
```bash
|
||
tar -czf data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz \
|
||
data/instances/mexican_institutions_geocoded_v2.yaml \
|
||
data/instances/mexican_geocoding_report_v2.md \
|
||
data/instances/.geocoding_cache_mexico.yaml
|
||
```
|
||
|
||
#### 5. Review Report
|
||
|
||
Check `data/instances/mexican_geocoding_report_v2.md` for:
|
||
- Total geocoded percentage (should be ≥60%)
|
||
- Failed institutions (for manual review)
|
||
- API usage statistics
|
||
|
||
### Expected Results
|
||
|
||
**Optimistic Scenario** (similar to Chilean success):
|
||
- 85%+ geocoded (99+ institutions)
|
||
- ~18 failed institutions
|
||
- Fallback strategies highly effective
|
||
|
||
**Realistic Scenario**:
|
||
- 70-80% geocoded (82-93 institutions)
|
||
- ~25-35 failed institutions
|
||
- Some Mexican regions have lower OSM coverage
|
||
|
||
**Worst Case** (below target):
|
||
- 50-60% geocoded (58-70 institutions)
|
||
- Requires manual geocoding or refined queries for failed cases
|
||
|
||
### Mexican-Specific Considerations
|
||
|
||
#### Institution Name Patterns
|
||
|
||
Mexican institution names may differ from Chilean patterns:
|
||
|
||
**Chilean**: "Museo de...", "Archivo Histórico", "Biblioteca Pública"
|
||
**Mexican**: "Museo Nacional de...", "Archivo General del Estado de...", "Biblioteca Pública Municipal"
|
||
|
||
**Potential Adjustments** (if needed):
|
||
|
||
Update fallback query generation (around line 200) to handle:
|
||
- "Archivo General del Estado" → "Archivo Estado"
|
||
- "Biblioteca Pública Municipal" → "Biblioteca Municipal"
|
||
- State names vs. Chilean region names
|
||
|
||
#### Mexican State Names (Examples)
|
||
|
||
Common states in dataset:
|
||
- Ciudad de México (CDMX)
|
||
- Jalisco (Guadalajara)
|
||
- Nuevo León (Monterrey)
|
||
- Puebla
|
||
- Veracruz
|
||
- Oaxaca
|
||
- Guanajuato
|
||
- Yucatán
|
||
|
||
**OSM Coverage**: Generally good in major cities, variable in rural areas.
|
||
|
||
### Troubleshooting
|
||
|
||
**If geocoding rate is < 60%**:
|
||
|
||
1. **Review failed institutions**:
|
||
```bash
|
||
grep -A3 "No results found (all strategies exhausted)" \
|
||
data/instances/mexican_geocoding_report_v2.md
|
||
```
|
||
|
||
2. **Check for patterns in failures**:
|
||
- Are they all from one region? (May indicate OSM coverage issue)
|
||
- Are they generic names? (Need more specific queries)
|
||
- Are they university archives? (May need different query pattern)
|
||
|
||
3. **Potential fixes**:
|
||
- Add Mexico-specific fallback strategies
|
||
- Mine conversation text for street addresses
|
||
- Manual geocoding for critical institutions
|
||
|
||
**If script errors occur**:
|
||
|
||
- Check INPUT_FILE exists: `ls data/instances/mexican_institutions_curated.yaml`
|
||
- Verify network connection for API calls
|
||
- Check cache file permissions: `ls -la data/instances/.geocoding_cache_mexico.yaml`
|
||
|
||
### After Mexican Geocoding
|
||
|
||
Once Mexican institutions are geocoded, we'll have:
|
||
|
||
**Dataset Summary**:
|
||
- **Brazilian institutions**: 97 (59.8% geocoded)
|
||
- **Chilean institutions**: 90 (86.7% geocoded) ✅
|
||
- **Mexican institutions**: 117 (target: 60%+ geocoded)
|
||
- **Total**: 304 institutions
|
||
|
||
**Next priorities**:
|
||
1. Create combined geocoding report across all 3 countries
|
||
2. Generate geographic visualization (map with all institutions)
|
||
3. Export to multiple formats (GeoJSON, RDF, CSV)
|
||
4. Update PROGRESS.md with geocoding achievements
|
||
|
||
### Files to Create/Review
|
||
|
||
**Created by script**:
|
||
- `scripts/geocode_mexican_institutions.py` - Geocoding script
|
||
- `data/instances/mexican_institutions_geocoded_v2.yaml` - Output data
|
||
- `data/instances/mexican_geocoding_report_v2.md` - Statistics
|
||
- `data/instances/.geocoding_cache_mexico.yaml` - API cache
|
||
- `data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz` - Backup
|
||
|
||
**For review**:
|
||
- Mexican geocoding report (coverage %, failed institutions)
|
||
- Validation output (should be 0 errors)
|
||
|
||
---
|
||
|
||
## Quick Reference: Key Metrics from Chilean Success
|
||
|
||
For comparison with Mexican results:
|
||
|
||
| Metric | Chilean Result | Mexican Target |
|
||
|--------|---------------|----------------|
|
||
| Total institutions | 90 | 117 |
|
||
| Geocoded | 78 (86.7%) | 70+ (60%+) |
|
||
| Failed | 12 (13.3%) | < 47 (< 40%) |
|
||
| API calls | ~150-200 | ~200-250 |
|
||
| Execution time | 3-4 minutes | 4-5 minutes |
|
||
| Fallback effectiveness | +33.4% pts | Target: +20% pts |
|
||
|
||
---
|
||
|
||
**Document Created**: 2025-11-06
|
||
**Estimated Time for Next Session**: 20 minutes total
|
||
**Prerequisites**: Chilean geocoding complete ✅
|