203 lines
7.1 KiB
Markdown
203 lines
7.1 KiB
Markdown
# Mexican Wikidata Enrichment Campaign - Baseline Analysis
|
|
|
|
**Campaign Start Date:** November 12, 2025
|
|
**Dataset:** `data/instances/mexico/mexican_institutions_geocoded.yaml`
|
|
**Methodology:** Following proven Brazilian campaign framework (Nov 6-11, 2025)
|
|
|
|
---
|
|
|
|
## Current State
|
|
|
|
### Coverage Statistics
|
|
- **Total Mexican institutions:** 117
|
|
- **Current Wikidata coverage:** 0/117 (0.0%)
|
|
- **Institutions without Wikidata:** 117 (100%)
|
|
|
|
### Comparison to Brazilian Campaign
|
|
- **Brazil starting point:** 19.0% (24/126 institutions)
|
|
- **Mexico starting point:** 0.0% (0/117 institutions)
|
|
- **Mexico advantage:** Clean slate, no prior partial enrichment to reconcile
|
|
|
|
---
|
|
|
|
## Institution Type Distribution
|
|
|
|
| Type | Count | % of Total | Target for Enrichment |
|
|
|------|-------|------------|----------------------|
|
|
| MUSEUM | 38 | 32.5% | ✅ High priority |
|
|
| MIXED | 33 | 28.2% | ⚠️ Aggregations - selective enrichment |
|
|
| ARCHIVE | 18 | 15.4% | ✅ High priority |
|
|
| LIBRARY | 14 | 12.0% | ✅ High priority |
|
|
| OFFICIAL_INSTITUTION | 8 | 6.8% | ✅ Medium priority |
|
|
| EDUCATION_PROVIDER | 6 | 5.1% | ✅ Medium priority |
|
|
| **Total** | **117** | **100%** | |
|
|
|
|
**Non-MIXED institutions:** 84 (71.8% of dataset)
|
|
**MIXED institutions:** 33 (28.2% - aggregations, not individual institutions)
|
|
|
|
---
|
|
|
|
## Geographic Distribution
|
|
|
|
**Institutions with geocoded cities:** 58/117 (49.6%)
|
|
|
|
### Top 15 Cities
|
|
1. **Ciudad de México** - 4 institutions (national institutions)
|
|
2. **Aguascalientes** - 3 institutions
|
|
3. **Saltillo** (Coahuila) - 3 institutions
|
|
4. **Oaxaca** - 3 institutions
|
|
5. **Campeche** - 2 institutions
|
|
6. **Chihuahua** - 2 institutions
|
|
7. **Colima** - 2 institutions
|
|
8. **Durango** - 2 institutions
|
|
9. **Guadalajara** (Jalisco) - 2 institutions
|
|
10. **Morelia** (Michoacán) - 2 institutions
|
|
11. **Puebla** - 2 institutions
|
|
12. **Zacatecas** - 2 institutions
|
|
13. **Mexicali** (Baja California) - 1 institution
|
|
14. **La Paz** (Baja California Sur) - 1 institution
|
|
15. **Tuxtla Gutiérrez** (Chiapas) - 1 institution
|
|
|
|
**Note:** 59 institutions (50.4%) lack precise city data - will need manual geocoding during enrichment.
|
|
|
|
---
|
|
|
|
## Priority Candidates for Batch 1
|
|
|
|
### National Institutions (Highest Priority)
|
|
|
|
These institutions are nationally significant and most likely to have Wikidata entries:
|
|
|
|
1. **Museo Nacional de Antropología** (MUSEUM) - Mexico's flagship anthropology museum
|
|
2. **Museo Nacional de Arte (MUNAL)** (MUSEUM) - National art museum
|
|
3. **Biblioteca Nacional de México** (LIBRARY) - National library
|
|
4. **Cineteca Nacional** (ARCHIVE) - National film archive, Ciudad de México
|
|
5. **Fototeca Nacional** (ARCHIVE) - National photo archive
|
|
6. **Instituto Nacional de Antropología e Historia (INAH)** (OFFICIAL_INSTITUTION) - Federal heritage agency
|
|
|
|
### Regional INAH Museums (Second Priority)
|
|
|
|
7. **Museo Regional de Antropología e Historia (INAH)** - Regional anthropology museum
|
|
8. **Museo Regional de Chiapas (INAH)** - Parque Madero, Chiapas
|
|
9. **Museo Regional de Historia de Aguascalientes (INAH)** - Aguascalientes
|
|
10. **Museo Regional de Sonora (INAH)** - Sonora state museum
|
|
|
|
---
|
|
|
|
## Campaign Goals
|
|
|
|
### Coverage Targets
|
|
Following the proven Brazilian methodology:
|
|
|
|
- **Minimum target:** 65% coverage (76/117 institutions)
|
|
- **Stretch target:** 70% coverage (82/117 institutions)
|
|
- **Focus:** Non-MIXED institutions (84 total)
|
|
- 65% of 84 = 55 institutions
|
|
- 70% of 84 = 59 institutions
|
|
|
|
### Quality Standards
|
|
- **Match threshold:** ≥0.85 confidence score
|
|
- **Identifier policy:** 100% real Wikidata Q-numbers (zero synthetic identifiers)
|
|
- **Verification:** Manual review of all matches before committing
|
|
|
|
### Batch Strategy
|
|
- **Batch size:** 5-6 institutions per batch
|
|
- **Estimated batches:** 8-10 batches
|
|
- **Priority order:**
|
|
1. National institutions (Museo Nacional, Biblioteca Nacional, etc.)
|
|
2. Regional INAH museums (Museo Regional de X)
|
|
3. State archives and libraries
|
|
4. Municipal museums
|
|
5. Specialized collections
|
|
|
|
---
|
|
|
|
## Expected Challenges
|
|
|
|
### 1. Spanish Language Queries
|
|
- Similar to Brazilian Portuguese campaign
|
|
- SPARQL queries will need Spanish labels: `SERVICE wikibase:label { bd:serviceParam wikibase:language "es,en" }`
|
|
- Many institutions may have only Spanish Wikipedia articles
|
|
|
|
### 2. Complex Institutional Structure
|
|
- **INAH system:** Multiple "Museo Regional" institutions with same name pattern
|
|
- **Federal vs. State:** Mexico has both federal museums (INAH) and state museums
|
|
- **Municipal archives:** City-level archives with similar naming (Archivo Municipal de X)
|
|
|
|
### 3. Geocoding Gaps
|
|
- 50.4% of institutions lack precise city data
|
|
- Will need to infer from institution names during enrichment
|
|
- Example: "Museo Regional de Chiapas" → likely in Tuxtla Gutiérrez (state capital)
|
|
|
|
### 4. MIXED Institutions
|
|
- 28.2% are aggregations (digital platforms, catalogs, portals)
|
|
- Not appropriate for Wikidata enrichment (no single physical institution)
|
|
- Will skip these to maintain data quality
|
|
|
|
---
|
|
|
|
## Campaign Timeline (Projected)
|
|
|
|
**Estimated duration:** 5-7 days
|
|
**Based on:** Brazilian campaign completed in 6 days (9 batches)
|
|
|
|
| Phase | Batches | Target Institutions | Estimated Days |
|
|
|-------|---------|---------------------|----------------|
|
|
| Phase 1: National | Batch 1-2 | 10-12 institutions | 1-2 days |
|
|
| Phase 2: Regional | Batch 3-5 | 15-18 institutions | 2-3 days |
|
|
| Phase 3: State/Municipal | Batch 6-8 | 15-18 institutions | 2 days |
|
|
| Phase 4: Stretch Goal | Batch 9-10 (optional) | 10-12 institutions | 1-2 days |
|
|
|
|
**Stop criteria:** When match quality falls below 0.85 threshold or diminishing returns.
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
### Quantitative
|
|
- [ ] Minimum 65% Wikidata coverage achieved
|
|
- [ ] Zero synthetic Q-numbers generated
|
|
- [ ] Average confidence score ≥0.90
|
|
- [ ] Zero false positives (manual verification)
|
|
|
|
### Qualitative
|
|
- [ ] All national institutions enriched
|
|
- [ ] Major regional museums enriched
|
|
- [ ] State archives/libraries covered
|
|
- [ ] Documentation complete for replication
|
|
|
|
### Deliverables
|
|
- [ ] Batch reports for each enrichment round (8-10 reports)
|
|
- [ ] Updated Mexican institution YAML with Wikidata identifiers
|
|
- [ ] Campaign summary report (following `reports/brazil/brazil_campaign_summary.md` template)
|
|
- [ ] Updated PROGRESS.md with Mexican enrichment section
|
|
- [ ] Handoff document for next campaign (India or Argentina)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Execute Batch 1:** Enrich 5-6 national institutions
|
|
- Museo Nacional de Antropología
|
|
- Museo Nacional de Arte (MUNAL)
|
|
- Biblioteca Nacional de México
|
|
- Cineteca Nacional
|
|
- Fototeca Nacional
|
|
- Instituto Nacional de Antropología e Historia (INAH)
|
|
|
|
2. **Document results:** Create `reports/mexico/batch01_report.md`
|
|
|
|
3. **Iterate:** Continue with Batch 2-8 following proven methodology
|
|
|
|
4. **Monitor quality:** Stop if confidence scores drop below 0.85
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Brazilian campaign:** `reports/brazil/brazil_campaign_summary.md` (67.5% coverage achieved)
|
|
- **Methodology:** `AGENTS.md` - Wikidata enrichment workflow
|
|
- **Schema:** `schemas/core.yaml` - Identifier class
|
|
- **Collision handling:** `docs/PERSISTENT_IDENTIFIERS.md` - Q-number policy
|
|
|
|
**Campaign Status:** ✅ Baseline complete, ready for Batch 1 execution
|