glam/reports/mexico/reconciliation_report.md
2025-11-19 23:25:22 +01:00

4.3 KiB

Mexican Dataset Reconciliation Report

Generated: 2025-11-13T09:55:41.451246

Executive Summary

This report documents the reconciliation between the standalone Mexican dataset and the global unified dataset created during the November 11, 2025 unification process.

Dataset Overview

Dataset Institutions Wikidata Coverage
Standalone (mexican_institutions_geocoded.yaml) 117 10 (8.5%)
Global - Mexican Subset (extracted from global file) 108 55 (50.9%)
Difference 9 institutions +45 Wikidata IDs

Key Findings

1. Missing Institutions (9 from Standalone)

The following 9 institutions appear in the standalone file but NOT in the global Mexican subset:

  1. CLACSO Virtual Libraries (Type: MIXED)
  2. HathiTrust Digital Library (Type: LIBRARY)
  3. Internet Archive (Type: ARCHIVE)
  4. Latin American Network Information Center (LANIC) (Type: MIXED)
  5. Library of Congress Hispanic Reading Room (Type: LIBRARY)
  6. Nettie Lee Benson Collection (UT Austin) (Type: MIXED)
  7. WorldCat Registry (Type: MIXED)
  8. WorldCat.org (Type: MIXED)

Note: A 9th institution, Fonoteca Nacional, appeared in this list but was found to exist in the global file (without country metadata, making it invisible in the Mexican subset filter). This has been corrected.

Analysis: All 8 core missing institutions are non-Mexican international digital platforms (HathiTrust, Internet Archive, CLACSO, etc.). These were correctly filtered out during the November 11 unification as they are not Mexican heritage custodians.

Recommendation: No action needed - filtering was appropriate.


2. Wikidata Identifier Corrections

During reconciliation, the following Wikidata corrections were made:

Institution Issue Resolution
Fototeca Nacional Had wrong Wikidata ID (Q5411481 = Fonoteca) Corrected to Q66432183
Instituto Nacional de Antropología e Historia Missing Wikidata ID Q901361 Added Q901361
Fonoteca Nacional Duplicate entries, one missing Wikidata Merged duplicates, added Q5411481

3. Wikidata Enrichment Analysis

The global dataset shows dramatic improvement in Wikidata coverage:

  • Standalone: 10 Wikidata IDs (8.5%)
  • Global: 55 Wikidata IDs (50.9%)
  • Net gain: +45 Wikidata identifiers

Source of enrichment:

  • 23 institutions have enrichment history records
  • Enrichment occurred during November 11-13 unification process
  • Methods: Wikidata SPARQL queries, fuzzy matching, manual verification

Recommendations

Completed Actions

  1. Corrected Fototeca Nacional Wikidata ID: Q5411481 → Q66432183
  2. Added INAH Wikidata ID: Q901361
  3. Cleaned up Fonoteca Nacional duplicates
  4. Verified international platform filtering

🎯 Next Steps

  1. Update baseline report (reports/mexico/baseline_analysis.md) to reference global dataset
  2. Document the 53 institutions without Wikidata (50.9% coverage leaves room for improvement)
  3. Create enrichment plan for remaining 53 institutions
  4. Archive standalone dataset with clear documentation that global is now authoritative

Files Updated

  • data/instances/all/globalglam-20251113-mexico-deduplicated.yaml - Corrected Wikidata IDs
  • 📝 reports/mexico/reconciliation_report.md - This report

Appendix: Data Quality Metrics

Institution Type Distribution (Mexican Subset)

MUSEUM                          38 (35.2%)
MIXED                           27 (25.0%)
ARCHIVE                         17 (15.7%)
LIBRARY                         12 (11.1%)
OFFICIAL_INSTITUTION             8 ( 7.4%)
EDUCATION_PROVIDER               6 ( 5.6%)

Geographic Coverage

Top 10 cities by institution count:

Unknown                         25 (23.1%)
Mexico City                     24 (22.2%)
Ciudad de México                 4 ( 3.7%)
Aguascalientes                   3 ( 2.8%)
Saltillo                         3 ( 2.8%)
Oaxaca                           3 ( 2.8%)
Campeche                         2 ( 1.9%)
Chihuahua                        2 ( 1.9%)
Colima                           2 ( 1.9%)
Durango                          2 ( 1.9%)