glam/reports/mexico/baseline_analysis.md
2025-11-19 23:25:22 +01:00

203 lines
7.1 KiB
Markdown

# Mexican Wikidata Enrichment Campaign - Baseline Analysis
**Campaign Start Date:** November 12, 2025
**Dataset:** `data/instances/mexico/mexican_institutions_geocoded.yaml`
**Methodology:** Following proven Brazilian campaign framework (Nov 6-11, 2025)
---
## Current State
### Coverage Statistics
- **Total Mexican institutions:** 117
- **Current Wikidata coverage:** 0/117 (0.0%)
- **Institutions without Wikidata:** 117 (100%)
### Comparison to Brazilian Campaign
- **Brazil starting point:** 19.0% (24/126 institutions)
- **Mexico starting point:** 0.0% (0/117 institutions)
- **Mexico advantage:** Clean slate, no prior partial enrichment to reconcile
---
## Institution Type Distribution
| Type | Count | % of Total | Target for Enrichment |
|------|-------|------------|----------------------|
| MUSEUM | 38 | 32.5% | ✅ High priority |
| MIXED | 33 | 28.2% | ⚠️ Aggregations - selective enrichment |
| ARCHIVE | 18 | 15.4% | ✅ High priority |
| LIBRARY | 14 | 12.0% | ✅ High priority |
| OFFICIAL_INSTITUTION | 8 | 6.8% | ✅ Medium priority |
| EDUCATION_PROVIDER | 6 | 5.1% | ✅ Medium priority |
| **Total** | **117** | **100%** | |
**Non-MIXED institutions:** 84 (71.8% of dataset)
**MIXED institutions:** 33 (28.2% - aggregations, not individual institutions)
---
## Geographic Distribution
**Institutions with geocoded cities:** 58/117 (49.6%)
### Top 15 Cities
1. **Ciudad de México** - 4 institutions (national institutions)
2. **Aguascalientes** - 3 institutions
3. **Saltillo** (Coahuila) - 3 institutions
4. **Oaxaca** - 3 institutions
5. **Campeche** - 2 institutions
6. **Chihuahua** - 2 institutions
7. **Colima** - 2 institutions
8. **Durango** - 2 institutions
9. **Guadalajara** (Jalisco) - 2 institutions
10. **Morelia** (Michoacán) - 2 institutions
11. **Puebla** - 2 institutions
12. **Zacatecas** - 2 institutions
13. **Mexicali** (Baja California) - 1 institution
14. **La Paz** (Baja California Sur) - 1 institution
15. **Tuxtla Gutiérrez** (Chiapas) - 1 institution
**Note:** 59 institutions (50.4%) lack precise city data - will need manual geocoding during enrichment.
---
## Priority Candidates for Batch 1
### National Institutions (Highest Priority)
These institutions are nationally significant and most likely to have Wikidata entries:
1. **Museo Nacional de Antropología** (MUSEUM) - Mexico's flagship anthropology museum
2. **Museo Nacional de Arte (MUNAL)** (MUSEUM) - National art museum
3. **Biblioteca Nacional de México** (LIBRARY) - National library
4. **Cineteca Nacional** (ARCHIVE) - National film archive, Ciudad de México
5. **Fototeca Nacional** (ARCHIVE) - National photo archive
6. **Instituto Nacional de Antropología e Historia (INAH)** (OFFICIAL_INSTITUTION) - Federal heritage agency
### Regional INAH Museums (Second Priority)
7. **Museo Regional de Antropología e Historia (INAH)** - Regional anthropology museum
8. **Museo Regional de Chiapas (INAH)** - Parque Madero, Chiapas
9. **Museo Regional de Historia de Aguascalientes (INAH)** - Aguascalientes
10. **Museo Regional de Sonora (INAH)** - Sonora state museum
---
## Campaign Goals
### Coverage Targets
Following the proven Brazilian methodology:
- **Minimum target:** 65% coverage (76/117 institutions)
- **Stretch target:** 70% coverage (82/117 institutions)
- **Focus:** Non-MIXED institutions (84 total)
- 65% of 84 = 55 institutions
- 70% of 84 = 59 institutions
### Quality Standards
- **Match threshold:** ≥0.85 confidence score
- **Identifier policy:** 100% real Wikidata Q-numbers (zero synthetic identifiers)
- **Verification:** Manual review of all matches before committing
### Batch Strategy
- **Batch size:** 5-6 institutions per batch
- **Estimated batches:** 8-10 batches
- **Priority order:**
1. National institutions (Museo Nacional, Biblioteca Nacional, etc.)
2. Regional INAH museums (Museo Regional de X)
3. State archives and libraries
4. Municipal museums
5. Specialized collections
---
## Expected Challenges
### 1. Spanish Language Queries
- Similar to Brazilian Portuguese campaign
- SPARQL queries will need Spanish labels: `SERVICE wikibase:label { bd:serviceParam wikibase:language "es,en" }`
- Many institutions may have only Spanish Wikipedia articles
### 2. Complex Institutional Structure
- **INAH system:** Multiple "Museo Regional" institutions with same name pattern
- **Federal vs. State:** Mexico has both federal museums (INAH) and state museums
- **Municipal archives:** City-level archives with similar naming (Archivo Municipal de X)
### 3. Geocoding Gaps
- 50.4% of institutions lack precise city data
- Will need to infer from institution names during enrichment
- Example: "Museo Regional de Chiapas" → likely in Tuxtla Gutiérrez (state capital)
### 4. MIXED Institutions
- 28.2% are aggregations (digital platforms, catalogs, portals)
- Not appropriate for Wikidata enrichment (no single physical institution)
- Will skip these to maintain data quality
---
## Campaign Timeline (Projected)
**Estimated duration:** 5-7 days
**Based on:** Brazilian campaign completed in 6 days (9 batches)
| Phase | Batches | Target Institutions | Estimated Days |
|-------|---------|---------------------|----------------|
| Phase 1: National | Batch 1-2 | 10-12 institutions | 1-2 days |
| Phase 2: Regional | Batch 3-5 | 15-18 institutions | 2-3 days |
| Phase 3: State/Municipal | Batch 6-8 | 15-18 institutions | 2 days |
| Phase 4: Stretch Goal | Batch 9-10 (optional) | 10-12 institutions | 1-2 days |
**Stop criteria:** When match quality falls below 0.85 threshold or diminishing returns.
---
## Success Metrics
### Quantitative
- [ ] Minimum 65% Wikidata coverage achieved
- [ ] Zero synthetic Q-numbers generated
- [ ] Average confidence score ≥0.90
- [ ] Zero false positives (manual verification)
### Qualitative
- [ ] All national institutions enriched
- [ ] Major regional museums enriched
- [ ] State archives/libraries covered
- [ ] Documentation complete for replication
### Deliverables
- [ ] Batch reports for each enrichment round (8-10 reports)
- [ ] Updated Mexican institution YAML with Wikidata identifiers
- [ ] Campaign summary report (following `reports/brazil/brazil_campaign_summary.md` template)
- [ ] Updated PROGRESS.md with Mexican enrichment section
- [ ] Handoff document for next campaign (India or Argentina)
---
## Next Steps
1. **Execute Batch 1:** Enrich 5-6 national institutions
- Museo Nacional de Antropología
- Museo Nacional de Arte (MUNAL)
- Biblioteca Nacional de México
- Cineteca Nacional
- Fototeca Nacional
- Instituto Nacional de Antropología e Historia (INAH)
2. **Document results:** Create `reports/mexico/batch01_report.md`
3. **Iterate:** Continue with Batch 2-8 following proven methodology
4. **Monitor quality:** Stop if confidence scores drop below 0.85
---
## References
- **Brazilian campaign:** `reports/brazil/brazil_campaign_summary.md` (67.5% coverage achieved)
- **Methodology:** `AGENTS.md` - Wikidata enrichment workflow
- **Schema:** `schemas/core.yaml` - Identifier class
- **Collision handling:** `docs/PERSISTENT_IDENTIFIERS.md` - Q-number policy
**Campaign Status:** ✅ Baseline complete, ready for Batch 1 execution