glam/reports/mexico/reconciliation_report.md
2025-11-19 23:25:22 +01:00

119 lines
4.3 KiB
Markdown

# Mexican Dataset Reconciliation Report
Generated: 2025-11-13T09:55:41.451246
## Executive Summary
This report documents the reconciliation between the **standalone Mexican dataset** and the **global unified dataset** created during the November 11, 2025 unification process.
## Dataset Overview
| Dataset | Institutions | Wikidata Coverage |
|---------|-------------|-------------------|
| **Standalone** (`mexican_institutions_geocoded.yaml`) | 117 | 10 (8.5%) |
| **Global - Mexican Subset** (extracted from global file) | 108 | 55 (50.9%) |
| **Difference** | 9 institutions | +45 Wikidata IDs |
## Key Findings
### 1. Missing Institutions (9 from Standalone)
The following 9 institutions appear in the standalone file but NOT in the global Mexican subset:
1. **CLACSO Virtual Libraries** (Type: MIXED)
2. **HathiTrust Digital Library** (Type: LIBRARY)
3. **Internet Archive** (Type: ARCHIVE)
4. **Latin American Network Information Center (LANIC)** (Type: MIXED)
5. **Library of Congress Hispanic Reading Room** (Type: LIBRARY)
6. **Nettie Lee Benson Collection (UT Austin)** (Type: MIXED)
7. **WorldCat Registry** (Type: MIXED)
8. **WorldCat.org** (Type: MIXED)
**Note**: A 9th institution, **Fonoteca Nacional**, appeared in this list but was found to exist in the global file (without country metadata, making it invisible in the Mexican subset filter). This has been corrected.
**Analysis**: All 8 core missing institutions are **non-Mexican international digital platforms** (HathiTrust, Internet Archive, CLACSO, etc.). These were correctly filtered out during the November 11 unification as they are not Mexican heritage custodians.
**Recommendation**: ✅ **No action needed** - filtering was appropriate.
---
### 2. Wikidata Identifier Corrections
During reconciliation, the following Wikidata corrections were made:
| Institution | Issue | Resolution |
|------------|-------|------------|
| **Fototeca Nacional** | Had wrong Wikidata ID (Q5411481 = Fonoteca) | ✅ Corrected to Q66432183 |
| **Instituto Nacional de Antropología e Historia** | Missing Wikidata ID Q901361 | ✅ Added Q901361 |
| **Fonoteca Nacional** | Duplicate entries, one missing Wikidata | ✅ Merged duplicates, added Q5411481 |
---
### 3. Wikidata Enrichment Analysis
The global dataset shows **dramatic improvement** in Wikidata coverage:
- **Standalone**: 10 Wikidata IDs (8.5%)
- **Global**: 55 Wikidata IDs (50.9%)
- **Net gain**: +45 Wikidata identifiers
**Source of enrichment**:
- 23 institutions have enrichment history records
- Enrichment occurred during November 11-13 unification process
- Methods: Wikidata SPARQL queries, fuzzy matching, manual verification
---
## Recommendations
### ✅ Completed Actions
1. **Corrected Fototeca Nacional Wikidata ID**: Q5411481 → Q66432183
2. **Added INAH Wikidata ID**: Q901361
3. **Cleaned up Fonoteca Nacional duplicates**
4. **Verified international platform filtering**
### 🎯 Next Steps
1. **Update baseline report** (`reports/mexico/baseline_analysis.md`) to reference global dataset
2. **Document the 53 institutions without Wikidata** (50.9% coverage leaves room for improvement)
3. **Create enrichment plan** for remaining 53 institutions
4. **Archive standalone dataset** with clear documentation that global is now authoritative
---
## Files Updated
-`data/instances/all/globalglam-20251113-mexico-deduplicated.yaml` - Corrected Wikidata IDs
- 📝 `reports/mexico/reconciliation_report.md` - This report
## Appendix: Data Quality Metrics
### Institution Type Distribution (Mexican Subset)
```
MUSEUM 38 (35.2%)
MIXED 27 (25.0%)
ARCHIVE 17 (15.7%)
LIBRARY 12 (11.1%)
OFFICIAL_INSTITUTION 8 ( 7.4%)
EDUCATION_PROVIDER 6 ( 5.6%)
```
### Geographic Coverage
Top 10 cities by institution count:
```
Unknown 25 (23.1%)
Mexico City 24 (22.2%)
Ciudad de México 4 ( 3.7%)
Aguascalientes 3 ( 2.8%)
Saltillo 3 ( 2.8%)
Oaxaca 3 ( 2.8%)
Campeche 2 ( 1.9%)
Chihuahua 2 ( 1.9%)
Colima 2 ( 1.9%)
Durango 2 ( 1.9%)
```