glam/reports/mexico/baseline_analysis.md
2025-11-19 23:25:22 +01:00

7.1 KiB

Mexican Wikidata Enrichment Campaign - Baseline Analysis

Campaign Start Date: November 12, 2025
Dataset: data/instances/mexico/mexican_institutions_geocoded.yaml
Methodology: Following proven Brazilian campaign framework (Nov 6-11, 2025)


Current State

Coverage Statistics

  • Total Mexican institutions: 117
  • Current Wikidata coverage: 0/117 (0.0%)
  • Institutions without Wikidata: 117 (100%)

Comparison to Brazilian Campaign

  • Brazil starting point: 19.0% (24/126 institutions)
  • Mexico starting point: 0.0% (0/117 institutions)
  • Mexico advantage: Clean slate, no prior partial enrichment to reconcile

Institution Type Distribution

Type Count % of Total Target for Enrichment
MUSEUM 38 32.5% High priority
MIXED 33 28.2% ⚠️ Aggregations - selective enrichment
ARCHIVE 18 15.4% High priority
LIBRARY 14 12.0% High priority
OFFICIAL_INSTITUTION 8 6.8% Medium priority
EDUCATION_PROVIDER 6 5.1% Medium priority
Total 117 100%

Non-MIXED institutions: 84 (71.8% of dataset)
MIXED institutions: 33 (28.2% - aggregations, not individual institutions)


Geographic Distribution

Institutions with geocoded cities: 58/117 (49.6%)

Top 15 Cities

  1. Ciudad de México - 4 institutions (national institutions)
  2. Aguascalientes - 3 institutions
  3. Saltillo (Coahuila) - 3 institutions
  4. Oaxaca - 3 institutions
  5. Campeche - 2 institutions
  6. Chihuahua - 2 institutions
  7. Colima - 2 institutions
  8. Durango - 2 institutions
  9. Guadalajara (Jalisco) - 2 institutions
  10. Morelia (Michoacán) - 2 institutions
  11. Puebla - 2 institutions
  12. Zacatecas - 2 institutions
  13. Mexicali (Baja California) - 1 institution
  14. La Paz (Baja California Sur) - 1 institution
  15. Tuxtla Gutiérrez (Chiapas) - 1 institution

Note: 59 institutions (50.4%) lack precise city data - will need manual geocoding during enrichment.


Priority Candidates for Batch 1

National Institutions (Highest Priority)

These institutions are nationally significant and most likely to have Wikidata entries:

  1. Museo Nacional de Antropología (MUSEUM) - Mexico's flagship anthropology museum
  2. Museo Nacional de Arte (MUNAL) (MUSEUM) - National art museum
  3. Biblioteca Nacional de México (LIBRARY) - National library
  4. Cineteca Nacional (ARCHIVE) - National film archive, Ciudad de México
  5. Fototeca Nacional (ARCHIVE) - National photo archive
  6. Instituto Nacional de Antropología e Historia (INAH) (OFFICIAL_INSTITUTION) - Federal heritage agency

Regional INAH Museums (Second Priority)

  1. Museo Regional de Antropología e Historia (INAH) - Regional anthropology museum
  2. Museo Regional de Chiapas (INAH) - Parque Madero, Chiapas
  3. Museo Regional de Historia de Aguascalientes (INAH) - Aguascalientes
  4. Museo Regional de Sonora (INAH) - Sonora state museum

Campaign Goals

Coverage Targets

Following the proven Brazilian methodology:

  • Minimum target: 65% coverage (76/117 institutions)
  • Stretch target: 70% coverage (82/117 institutions)
  • Focus: Non-MIXED institutions (84 total)
    • 65% of 84 = 55 institutions
    • 70% of 84 = 59 institutions

Quality Standards

  • Match threshold: ≥0.85 confidence score
  • Identifier policy: 100% real Wikidata Q-numbers (zero synthetic identifiers)
  • Verification: Manual review of all matches before committing

Batch Strategy

  • Batch size: 5-6 institutions per batch
  • Estimated batches: 8-10 batches
  • Priority order:
    1. National institutions (Museo Nacional, Biblioteca Nacional, etc.)
    2. Regional INAH museums (Museo Regional de X)
    3. State archives and libraries
    4. Municipal museums
    5. Specialized collections

Expected Challenges

1. Spanish Language Queries

  • Similar to Brazilian Portuguese campaign
  • SPARQL queries will need Spanish labels: SERVICE wikibase:label { bd:serviceParam wikibase:language "es,en" }
  • Many institutions may have only Spanish Wikipedia articles

2. Complex Institutional Structure

  • INAH system: Multiple "Museo Regional" institutions with same name pattern
  • Federal vs. State: Mexico has both federal museums (INAH) and state museums
  • Municipal archives: City-level archives with similar naming (Archivo Municipal de X)

3. Geocoding Gaps

  • 50.4% of institutions lack precise city data
  • Will need to infer from institution names during enrichment
  • Example: "Museo Regional de Chiapas" → likely in Tuxtla Gutiérrez (state capital)

4. MIXED Institutions

  • 28.2% are aggregations (digital platforms, catalogs, portals)
  • Not appropriate for Wikidata enrichment (no single physical institution)
  • Will skip these to maintain data quality

Campaign Timeline (Projected)

Estimated duration: 5-7 days
Based on: Brazilian campaign completed in 6 days (9 batches)

Phase Batches Target Institutions Estimated Days
Phase 1: National Batch 1-2 10-12 institutions 1-2 days
Phase 2: Regional Batch 3-5 15-18 institutions 2-3 days
Phase 3: State/Municipal Batch 6-8 15-18 institutions 2 days
Phase 4: Stretch Goal Batch 9-10 (optional) 10-12 institutions 1-2 days

Stop criteria: When match quality falls below 0.85 threshold or diminishing returns.


Success Metrics

Quantitative

  • Minimum 65% Wikidata coverage achieved
  • Zero synthetic Q-numbers generated
  • Average confidence score ≥0.90
  • Zero false positives (manual verification)

Qualitative

  • All national institutions enriched
  • Major regional museums enriched
  • State archives/libraries covered
  • Documentation complete for replication

Deliverables

  • Batch reports for each enrichment round (8-10 reports)
  • Updated Mexican institution YAML with Wikidata identifiers
  • Campaign summary report (following reports/brazil/brazil_campaign_summary.md template)
  • Updated PROGRESS.md with Mexican enrichment section
  • Handoff document for next campaign (India or Argentina)

Next Steps

  1. Execute Batch 1: Enrich 5-6 national institutions

    • Museo Nacional de Antropología
    • Museo Nacional de Arte (MUNAL)
    • Biblioteca Nacional de México
    • Cineteca Nacional
    • Fototeca Nacional
    • Instituto Nacional de Antropología e Historia (INAH)
  2. Document results: Create reports/mexico/batch01_report.md

  3. Iterate: Continue with Batch 2-8 following proven methodology

  4. Monitor quality: Stop if confidence scores drop below 0.85


References

  • Brazilian campaign: reports/brazil/brazil_campaign_summary.md (67.5% coverage achieved)
  • Methodology: AGENTS.md - Wikidata enrichment workflow
  • Schema: schemas/core.yaml - Identifier class
  • Collision handling: docs/PERSISTENT_IDENTIFIERS.md - Q-number policy

Campaign Status: Baseline complete, ready for Batch 1 execution