glam/NEXT_STEPS_Mexican_Geocoding.md
2025-11-19 23:25:22 +01:00

196 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Next Session: Mexican Institution Geocoding
## Quick Start Guide for Next Session
### Objective
Geocode 117 Mexican heritage institutions to achieve 60%+ coordinate coverage (70+ institutions).
### Current Status
- **Input file**: `data/instances/mexican_institutions_curated.yaml`
- **Current geocoding**: 5.9% (7 out of 117 institutions)
- **Target**: 60%+ (70+ institutions)
### Step-by-Step Workflow
#### 1. Create Mexican Geocoding Script (5 minutes)
Copy the Chilean script and adapt for Mexico:
```bash
cd /Users/kempersc/apps/glam
cp scripts/geocode_chilean_institutions.py scripts/geocode_mexican_institutions.py
```
**Configuration changes needed** (lines 22-26):
```python
INPUT_FILE = Path("data/instances/mexican_institutions_curated.yaml")
OUTPUT_FILE = Path("data/instances/mexican_institutions_geocoded_v2.yaml")
REPORT_FILE = Path("data/instances/mexican_geocoding_report_v2.md")
CACHE_FILE = Path("data/instances/.geocoding_cache_mexico.yaml")
```
**Report title change** (line 2-3, line 439):
```python
# Change "Chilean" to "Mexican" in:
# - Docstring title
# - Print statements in main()
# - Report generation
```
#### 2. Run Geocoding (4-5 minutes first run)
```bash
cd /Users/kempersc/apps/glam
python scripts/geocode_mexican_institutions.py
```
Expected output:
- ~200-250 API calls (117 institutions × ~2 avg queries with fallbacks)
- ~4-5 minutes total time (1.1 sec/request rate limit)
- Target: 70+ institutions geocoded (60%+)
#### 3. Validate Results (1 minute)
```bash
python scripts/validate_yaml_instance.py data/instances/mexican_institutions_geocoded_v2.yaml
```
Should show: "✅ All instances are valid!"
#### 4. Create Backup (30 seconds)
```bash
tar -czf data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz \
data/instances/mexican_institutions_geocoded_v2.yaml \
data/instances/mexican_geocoding_report_v2.md \
data/instances/.geocoding_cache_mexico.yaml
```
#### 5. Review Report
Check `data/instances/mexican_geocoding_report_v2.md` for:
- Total geocoded percentage (should be ≥60%)
- Failed institutions (for manual review)
- API usage statistics
### Expected Results
**Optimistic Scenario** (similar to Chilean success):
- 85%+ geocoded (99+ institutions)
- ~18 failed institutions
- Fallback strategies highly effective
**Realistic Scenario**:
- 70-80% geocoded (82-93 institutions)
- ~25-35 failed institutions
- Some Mexican regions have lower OSM coverage
**Worst Case** (below target):
- 50-60% geocoded (58-70 institutions)
- Requires manual geocoding or refined queries for failed cases
### Mexican-Specific Considerations
#### Institution Name Patterns
Mexican institution names may differ from Chilean patterns:
**Chilean**: "Museo de...", "Archivo Histórico", "Biblioteca Pública"
**Mexican**: "Museo Nacional de...", "Archivo General del Estado de...", "Biblioteca Pública Municipal"
**Potential Adjustments** (if needed):
Update fallback query generation (around line 200) to handle:
- "Archivo General del Estado" → "Archivo Estado"
- "Biblioteca Pública Municipal" → "Biblioteca Municipal"
- State names vs. Chilean region names
#### Mexican State Names (Examples)
Common states in dataset:
- Ciudad de México (CDMX)
- Jalisco (Guadalajara)
- Nuevo León (Monterrey)
- Puebla
- Veracruz
- Oaxaca
- Guanajuato
- Yucatán
**OSM Coverage**: Generally good in major cities, variable in rural areas.
### Troubleshooting
**If geocoding rate is < 60%**:
1. **Review failed institutions**:
```bash
grep -A3 "No results found (all strategies exhausted)" \
data/instances/mexican_geocoding_report_v2.md
```
2. **Check for patterns in failures**:
- Are they all from one region? (May indicate OSM coverage issue)
- Are they generic names? (Need more specific queries)
- Are they university archives? (May need different query pattern)
3. **Potential fixes**:
- Add Mexico-specific fallback strategies
- Mine conversation text for street addresses
- Manual geocoding for critical institutions
**If script errors occur**:
- Check INPUT_FILE exists: `ls data/instances/mexican_institutions_curated.yaml`
- Verify network connection for API calls
- Check cache file permissions: `ls -la data/instances/.geocoding_cache_mexico.yaml`
### After Mexican Geocoding
Once Mexican institutions are geocoded, we'll have:
**Dataset Summary**:
- **Brazilian institutions**: 97 (59.8% geocoded)
- **Chilean institutions**: 90 (86.7% geocoded) ✅
- **Mexican institutions**: 117 (target: 60%+ geocoded)
- **Total**: 304 institutions
**Next priorities**:
1. Create combined geocoding report across all 3 countries
2. Generate geographic visualization (map with all institutions)
3. Export to multiple formats (GeoJSON, RDF, CSV)
4. Update PROGRESS.md with geocoding achievements
### Files to Create/Review
**Created by script**:
- `scripts/geocode_mexican_institutions.py` - Geocoding script
- `data/instances/mexican_institutions_geocoded_v2.yaml` - Output data
- `data/instances/mexican_geocoding_report_v2.md` - Statistics
- `data/instances/.geocoding_cache_mexico.yaml` - API cache
- `data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz` - Backup
**For review**:
- Mexican geocoding report (coverage %, failed institutions)
- Validation output (should be 0 errors)
---
## Quick Reference: Key Metrics from Chilean Success
For comparison with Mexican results:
| Metric | Chilean Result | Mexican Target |
|--------|---------------|----------------|
| Total institutions | 90 | 117 |
| Geocoded | 78 (86.7%) | 70+ (60%+) |
| Failed | 12 (13.3%) | < 47 (< 40%) |
| API calls | ~150-200 | ~200-250 |
| Execution time | 3-4 minutes | 4-5 minutes |
| Fallback effectiveness | +33.4% pts | Target: +20% pts |
---
**Document Created**: 2025-11-06
**Estimated Time for Next Session**: 20 minutes total
**Prerequisites**: Chilean geocoding complete