119 lines
4.3 KiB
Markdown
119 lines
4.3 KiB
Markdown
# Mexican Dataset Reconciliation Report
|
|
Generated: 2025-11-13T09:55:41.451246
|
|
|
|
## Executive Summary
|
|
|
|
This report documents the reconciliation between the **standalone Mexican dataset** and the **global unified dataset** created during the November 11, 2025 unification process.
|
|
|
|
## Dataset Overview
|
|
|
|
| Dataset | Institutions | Wikidata Coverage |
|
|
|---------|-------------|-------------------|
|
|
| **Standalone** (`mexican_institutions_geocoded.yaml`) | 117 | 10 (8.5%) |
|
|
| **Global - Mexican Subset** (extracted from global file) | 108 | 55 (50.9%) |
|
|
| **Difference** | 9 institutions | +45 Wikidata IDs |
|
|
|
|
## Key Findings
|
|
|
|
### 1. Missing Institutions (9 from Standalone)
|
|
|
|
The following 9 institutions appear in the standalone file but NOT in the global Mexican subset:
|
|
|
|
1. **CLACSO Virtual Libraries** (Type: MIXED)
|
|
2. **HathiTrust Digital Library** (Type: LIBRARY)
|
|
3. **Internet Archive** (Type: ARCHIVE)
|
|
4. **Latin American Network Information Center (LANIC)** (Type: MIXED)
|
|
5. **Library of Congress Hispanic Reading Room** (Type: LIBRARY)
|
|
6. **Nettie Lee Benson Collection (UT Austin)** (Type: MIXED)
|
|
7. **WorldCat Registry** (Type: MIXED)
|
|
8. **WorldCat.org** (Type: MIXED)
|
|
|
|
**Note**: A 9th institution, **Fonoteca Nacional**, appeared in this list but was found to exist in the global file (without country metadata, making it invisible in the Mexican subset filter). This has been corrected.
|
|
|
|
**Analysis**: All 8 core missing institutions are **non-Mexican international digital platforms** (HathiTrust, Internet Archive, CLACSO, etc.). These were correctly filtered out during the November 11 unification as they are not Mexican heritage custodians.
|
|
|
|
**Recommendation**: ✅ **No action needed** - filtering was appropriate.
|
|
|
|
---
|
|
|
|
### 2. Wikidata Identifier Corrections
|
|
|
|
During reconciliation, the following Wikidata corrections were made:
|
|
|
|
|
|
| Institution | Issue | Resolution |
|
|
|------------|-------|------------|
|
|
| **Fototeca Nacional** | Had wrong Wikidata ID (Q5411481 = Fonoteca) | ✅ Corrected to Q66432183 |
|
|
| **Instituto Nacional de Antropología e Historia** | Missing Wikidata ID Q901361 | ✅ Added Q901361 |
|
|
| **Fonoteca Nacional** | Duplicate entries, one missing Wikidata | ✅ Merged duplicates, added Q5411481 |
|
|
|
|
---
|
|
|
|
### 3. Wikidata Enrichment Analysis
|
|
|
|
|
|
The global dataset shows **dramatic improvement** in Wikidata coverage:
|
|
|
|
- **Standalone**: 10 Wikidata IDs (8.5%)
|
|
- **Global**: 55 Wikidata IDs (50.9%)
|
|
- **Net gain**: +45 Wikidata identifiers
|
|
|
|
**Source of enrichment**:
|
|
- 23 institutions have enrichment history records
|
|
- Enrichment occurred during November 11-13 unification process
|
|
- Methods: Wikidata SPARQL queries, fuzzy matching, manual verification
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### ✅ Completed Actions
|
|
|
|
1. **Corrected Fototeca Nacional Wikidata ID**: Q5411481 → Q66432183
|
|
2. **Added INAH Wikidata ID**: Q901361
|
|
3. **Cleaned up Fonoteca Nacional duplicates**
|
|
4. **Verified international platform filtering**
|
|
|
|
### 🎯 Next Steps
|
|
|
|
1. **Update baseline report** (`reports/mexico/baseline_analysis.md`) to reference global dataset
|
|
2. **Document the 53 institutions without Wikidata** (50.9% coverage leaves room for improvement)
|
|
3. **Create enrichment plan** for remaining 53 institutions
|
|
4. **Archive standalone dataset** with clear documentation that global is now authoritative
|
|
|
|
---
|
|
|
|
## Files Updated
|
|
|
|
- ✅ `data/instances/all/globalglam-20251113-mexico-deduplicated.yaml` - Corrected Wikidata IDs
|
|
- 📝 `reports/mexico/reconciliation_report.md` - This report
|
|
|
|
## Appendix: Data Quality Metrics
|
|
|
|
### Institution Type Distribution (Mexican Subset)
|
|
|
|
```
|
|
MUSEUM 38 (35.2%)
|
|
MIXED 27 (25.0%)
|
|
ARCHIVE 17 (15.7%)
|
|
LIBRARY 12 (11.1%)
|
|
OFFICIAL_INSTITUTION 8 ( 7.4%)
|
|
EDUCATION_PROVIDER 6 ( 5.6%)
|
|
```
|
|
|
|
### Geographic Coverage
|
|
|
|
Top 10 cities by institution count:
|
|
|
|
```
|
|
Unknown 25 (23.1%)
|
|
Mexico City 24 (22.2%)
|
|
Ciudad de México 4 ( 3.7%)
|
|
Aguascalientes 3 ( 2.8%)
|
|
Saltillo 3 ( 2.8%)
|
|
Oaxaca 3 ( 2.8%)
|
|
Campeche 2 ( 1.9%)
|
|
Chihuahua 2 ( 1.9%)
|
|
Colima 2 ( 1.9%)
|
|
Durango 2 ( 1.9%)
|
|
```
|