glam/NEXT_STEPS_Mexican_Geocoding.md
2025-11-19 23:25:22 +01:00

5.8 KiB
Raw Blame History

Next Session: Mexican Institution Geocoding

Quick Start Guide for Next Session

Objective

Geocode 117 Mexican heritage institutions to achieve 60%+ coordinate coverage (70+ institutions).

Current Status

  • Input file: data/instances/mexican_institutions_curated.yaml
  • Current geocoding: 5.9% (7 out of 117 institutions)
  • Target: 60%+ (70+ institutions)

Step-by-Step Workflow

1. Create Mexican Geocoding Script (5 minutes)

Copy the Chilean script and adapt for Mexico:

cd /Users/kempersc/apps/glam
cp scripts/geocode_chilean_institutions.py scripts/geocode_mexican_institutions.py

Configuration changes needed (lines 22-26):

INPUT_FILE = Path("data/instances/mexican_institutions_curated.yaml")
OUTPUT_FILE = Path("data/instances/mexican_institutions_geocoded_v2.yaml")
REPORT_FILE = Path("data/instances/mexican_geocoding_report_v2.md")
CACHE_FILE = Path("data/instances/.geocoding_cache_mexico.yaml")

Report title change (line 2-3, line 439):

# Change "Chilean" to "Mexican" in:
# - Docstring title
# - Print statements in main()
# - Report generation

2. Run Geocoding (4-5 minutes first run)

cd /Users/kempersc/apps/glam
python scripts/geocode_mexican_institutions.py

Expected output:

  • ~200-250 API calls (117 institutions × ~2 avg queries with fallbacks)
  • ~4-5 minutes total time (1.1 sec/request rate limit)
  • Target: 70+ institutions geocoded (60%+)

3. Validate Results (1 minute)

python scripts/validate_yaml_instance.py data/instances/mexican_institutions_geocoded_v2.yaml

Should show: " All instances are valid!"

4. Create Backup (30 seconds)

tar -czf data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz \
  data/instances/mexican_institutions_geocoded_v2.yaml \
  data/instances/mexican_geocoding_report_v2.md \
  data/instances/.geocoding_cache_mexico.yaml

5. Review Report

Check data/instances/mexican_geocoding_report_v2.md for:

  • Total geocoded percentage (should be ≥60%)
  • Failed institutions (for manual review)
  • API usage statistics

Expected Results

Optimistic Scenario (similar to Chilean success):

  • 85%+ geocoded (99+ institutions)
  • ~18 failed institutions
  • Fallback strategies highly effective

Realistic Scenario:

  • 70-80% geocoded (82-93 institutions)
  • ~25-35 failed institutions
  • Some Mexican regions have lower OSM coverage

Worst Case (below target):

  • 50-60% geocoded (58-70 institutions)
  • Requires manual geocoding or refined queries for failed cases

Mexican-Specific Considerations

Institution Name Patterns

Mexican institution names may differ from Chilean patterns:

Chilean: "Museo de...", "Archivo Histórico", "Biblioteca Pública"
Mexican: "Museo Nacional de...", "Archivo General del Estado de...", "Biblioteca Pública Municipal"

Potential Adjustments (if needed):

Update fallback query generation (around line 200) to handle:

  • "Archivo General del Estado" → "Archivo Estado"
  • "Biblioteca Pública Municipal" → "Biblioteca Municipal"
  • State names vs. Chilean region names

Mexican State Names (Examples)

Common states in dataset:

  • Ciudad de México (CDMX)
  • Jalisco (Guadalajara)
  • Nuevo León (Monterrey)
  • Puebla
  • Veracruz
  • Oaxaca
  • Guanajuato
  • Yucatán

OSM Coverage: Generally good in major cities, variable in rural areas.

Troubleshooting

If geocoding rate is < 60%:

  1. Review failed institutions:

    grep -A3 "No results found (all strategies exhausted)" \
      data/instances/mexican_geocoding_report_v2.md
    
  2. Check for patterns in failures:

    • Are they all from one region? (May indicate OSM coverage issue)
    • Are they generic names? (Need more specific queries)
    • Are they university archives? (May need different query pattern)
  3. Potential fixes:

    • Add Mexico-specific fallback strategies
    • Mine conversation text for street addresses
    • Manual geocoding for critical institutions

If script errors occur:

  • Check INPUT_FILE exists: ls data/instances/mexican_institutions_curated.yaml
  • Verify network connection for API calls
  • Check cache file permissions: ls -la data/instances/.geocoding_cache_mexico.yaml

After Mexican Geocoding

Once Mexican institutions are geocoded, we'll have:

Dataset Summary:

  • Brazilian institutions: 97 (59.8% geocoded)
  • Chilean institutions: 90 (86.7% geocoded)
  • Mexican institutions: 117 (target: 60%+ geocoded)
  • Total: 304 institutions

Next priorities:

  1. Create combined geocoding report across all 3 countries
  2. Generate geographic visualization (map with all institutions)
  3. Export to multiple formats (GeoJSON, RDF, CSV)
  4. Update PROGRESS.md with geocoding achievements

Files to Create/Review

Created by script:

  • scripts/geocode_mexican_institutions.py - Geocoding script
  • data/instances/mexican_institutions_geocoded_v2.yaml - Output data
  • data/instances/mexican_geocoding_report_v2.md - Statistics
  • data/instances/.geocoding_cache_mexico.yaml - API cache
  • data/instances/backups/2025-11-06_mexican-geocoded-v2.tar.gz - Backup

For review:

  • Mexican geocoding report (coverage %, failed institutions)
  • Validation output (should be 0 errors)

Quick Reference: Key Metrics from Chilean Success

For comparison with Mexican results:

Metric Chilean Result Mexican Target
Total institutions 90 117
Geocoded 78 (86.7%) 70+ (60%+)
Failed 12 (13.3%) < 47 (< 40%)
API calls ~150-200 ~200-250
Execution time 3-4 minutes 4-5 minutes
Fallback effectiveness +33.4% pts Target: +20% pts

Document Created: 2025-11-06
Estimated Time for Next Session: 20 minutes total
Prerequisites: Chilean geocoding complete