# Next Session: Mexican Institution Geocoding ## Quick Start Guide for Next Session ### Objective Geocode 117 Mexican heritage institutions to achieve 60%+ coordinate coverage (70+ institutions). ### Current Status - **Input file**: `data/instances/mexican_institutions_curated.yaml` - **Current geocoding**: 5.9% (7 out of 117 institutions) - **Target**: 60%+ (70+ institutions) ### Step-by-Step Workflow #### 1. Create Mexican Geocoding Script (5 minutes) Copy the Chilean script and adapt for Mexico: ```bash cd /Users/kempersc/apps/glam cp scripts/geocode_chilean_institutions.py scripts/geocode_mexican_institutions.py ``` **Configuration changes needed** (lines 22-26): ```python INPUT_FILE = Path("data/instances/mexican_institutions_curated.yaml") OUTPUT_FILE = Path("data/instances/mexican_institutions_geocoded_v2.yaml") REPORT_FILE = Path("data/instances/mexican_geocoding_report_v2.md") CACHE_FILE = Path("data/instances/.geocoding_cache_mexico.yaml") ``` **Report title change** (line 2-3, line 439): ```python # Change "Chilean" to "Mexican" in: # - Docstring title # - Print statements in main() # - Report generation ``` #### 2. Run Geocoding (4-5 minutes first run) ```bash cd /Users/kempersc/apps/glam python scripts/geocode_mexican_institutions.py ``` Expected output: - ~200-250 API calls (117 institutions × ~2 avg queries with fallbacks) - ~4-5 minutes total time (1.1 sec/request rate limit) - Target: 70+ institutions geocoded (60%+) #### 3. Validate Results (1 minute) ```bash python scripts/validate_yaml_instance.py data/instances/mexican_institutions_geocoded_v2.yaml ``` Should show: "✅ All instances are valid!" #### 4. Create Backup (30 seconds) ```bash tar -czf data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz \ data/instances/mexican_institutions_geocoded_v2.yaml \ data/instances/mexican_geocoding_report_v2.md \ data/instances/.geocoding_cache_mexico.yaml ``` #### 5. Review Report Check `data/instances/mexican_geocoding_report_v2.md` for: - Total geocoded percentage (should be ≥60%) - Failed institutions (for manual review) - API usage statistics ### Expected Results **Optimistic Scenario** (similar to Chilean success): - 85%+ geocoded (99+ institutions) - ~18 failed institutions - Fallback strategies highly effective **Realistic Scenario**: - 70-80% geocoded (82-93 institutions) - ~25-35 failed institutions - Some Mexican regions have lower OSM coverage **Worst Case** (below target): - 50-60% geocoded (58-70 institutions) - Requires manual geocoding or refined queries for failed cases ### Mexican-Specific Considerations #### Institution Name Patterns Mexican institution names may differ from Chilean patterns: **Chilean**: "Museo de...", "Archivo Histórico", "Biblioteca Pública" **Mexican**: "Museo Nacional de...", "Archivo General del Estado de...", "Biblioteca Pública Municipal" **Potential Adjustments** (if needed): Update fallback query generation (around line 200) to handle: - "Archivo General del Estado" → "Archivo Estado" - "Biblioteca Pública Municipal" → "Biblioteca Municipal" - State names vs. Chilean region names #### Mexican State Names (Examples) Common states in dataset: - Ciudad de México (CDMX) - Jalisco (Guadalajara) - Nuevo León (Monterrey) - Puebla - Veracruz - Oaxaca - Guanajuato - Yucatán **OSM Coverage**: Generally good in major cities, variable in rural areas. ### Troubleshooting **If geocoding rate is < 60%**: 1. **Review failed institutions**: ```bash grep -A3 "No results found (all strategies exhausted)" \ data/instances/mexican_geocoding_report_v2.md ``` 2. **Check for patterns in failures**: - Are they all from one region? (May indicate OSM coverage issue) - Are they generic names? (Need more specific queries) - Are they university archives? (May need different query pattern) 3. **Potential fixes**: - Add Mexico-specific fallback strategies - Mine conversation text for street addresses - Manual geocoding for critical institutions **If script errors occur**: - Check INPUT_FILE exists: `ls data/instances/mexican_institutions_curated.yaml` - Verify network connection for API calls - Check cache file permissions: `ls -la data/instances/.geocoding_cache_mexico.yaml` ### After Mexican Geocoding Once Mexican institutions are geocoded, we'll have: **Dataset Summary**: - **Brazilian institutions**: 97 (59.8% geocoded) - **Chilean institutions**: 90 (86.7% geocoded) ✅ - **Mexican institutions**: 117 (target: 60%+ geocoded) - **Total**: 304 institutions **Next priorities**: 1. Create combined geocoding report across all 3 countries 2. Generate geographic visualization (map with all institutions) 3. Export to multiple formats (GeoJSON, RDF, CSV) 4. Update PROGRESS.md with geocoding achievements ### Files to Create/Review **Created by script**: - `scripts/geocode_mexican_institutions.py` - Geocoding script - `data/instances/mexican_institutions_geocoded_v2.yaml` - Output data - `data/instances/mexican_geocoding_report_v2.md` - Statistics - `data/instances/.geocoding_cache_mexico.yaml` - API cache - `data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz` - Backup **For review**: - Mexican geocoding report (coverage %, failed institutions) - Validation output (should be 0 errors) --- ## Quick Reference: Key Metrics from Chilean Success For comparison with Mexican results: | Metric | Chilean Result | Mexican Target | |--------|---------------|----------------| | Total institutions | 90 | 117 | | Geocoded | 78 (86.7%) | 70+ (60%+) | | Failed | 12 (13.3%) | < 47 (< 40%) | | API calls | ~150-200 | ~200-250 | | Execution time | 3-4 minutes | 4-5 minutes | | Fallback effectiveness | +33.4% pts | Target: +20% pts | --- **Document Created**: 2025-11-06 **Estimated Time for Next Session**: 20 minutes total **Prerequisites**: Chilean geocoding complete ✅