# Chilean Heritage Institutions - Batch 1 Wikidata Enrichment Report **Date**: November 9, 2025 **Script**: `scripts/enrich_chilean_batch1_manual.py` **Dataset**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml` (90 institutions) ## Batch 1 Results Summary ### Coverage Impact - **Before**: 0/90 institutions with Wikidata (0.0%) - **After**: 2/90 institutions with Wikidata (2.2%) - **Target**: 13 institutions (diverse sample) - **Success rate**: 2/13 auto-enriched (15.4%) ### Enrichment Breakdown | Status | Count | Percentage | |--------|-------|------------| | ✅ Auto-enriched (≥85% match) | 2 | 15.4% | | ⚠️ Manual review needed | 1 | 7.7% | | ❌ Not found in dataset | 10 | 76.9% | | ⏭️ Already enriched | 0 | 0.0% | ## Successfully Enriched Institutions ### 1. Universidad de Tarapacá ✅ - **Wikidata**: Q3138071 - **Match confidence**: 100.0% (HIGH) - **Region**: Arica - **Type**: EDUCATION_PROVIDER - **URL**: https://www.wikidata.org/wiki/Q3138071 - **Verification**: ✅ Confirmed - Public university in Arica, Chile ### 2. Universidad Católica del Norte ✅ - **Wikidata**: Q3244385 - **Match confidence**: 100.0% (HIGH) - **Region**: Antofagasta - **Type**: EDUCATION_PROVIDER - **URL**: https://www.wikidata.org/wiki/Q3244385 - **Verification**: ✅ Confirmed - Private Catholic university in Antofagasta ## Manual Review Required ### Museo Universidad de Tarapacá San Miguel de Azapa (MASMA) ⚠️ - **Dataset name**: Museo Universidad de Tarapacá San Miguel de Azapa (MASMA) - **Region**: Arica, City: Arica - **Suggested match**: Q86280263 (Museo Andino) - **Match score**: 76.2% (MEDIUM) - **Manual verification**: ❌ INCORRECT MATCH - Q86280263 is "Museo Andino" in Buin (near Santiago) - MASMA is an archaeological museum in Arica (different location and focus) - **Action needed**: Search Wikidata for "Museo San Miguel de Azapa" or create new entry ## Institutions Not Found (10/13) These targets couldn't be matched in the dataset due to name variations: ### Museums (3) 1. **Museo de Historia Natural de Atacama** (Atacama) - Not found with this exact name 2. **Museo Indígena Atacameño** (Antofagasta region) - Dataset has: "Museo Indígena Atacameño (El Loa)" ✓ - Region mismatch caused failure (El Loa vs Antofagasta) 3. **Museo de Tocopilla** (Tocopilla, Antofagasta) - Not found with this exact name ### Archives (3) 4. **Archivo Central Andrés Bello** (Santiago) - Dataset has: "Universidad de Chile's Archivo Central Andrés Bello" ✓ - Name pattern mismatch 5. **Archivo Central USACH** (Santiago) - Dataset has: "USACH's Archivo Patrimonial" ✓ - Different name (Patrimonial vs Central) 6. **Archivo Histórico del Arzobispado** (Santiago) - Dataset has: "Arzobispado's Archivo Histórico" ✓ - Word order difference ### Libraries (3) 7. **Biblioteca Nacional Digital** (Santiago) - Dataset has: "Biblioteca Nacional Digital (Iquique)" ✓ - Region mismatch (Metropolitana vs Iquique) 8. **Biblioteca Federico Varela** (Atacama) - Dataset has: "Biblioteca Pública Federico Varela (Chañaral)" ✓ - Missing "Pública" in pattern 9. **CRA Escuela El Olivar** (Arica) - Dataset has: "Biblioteca CRA El Olivar (Huasco)" ✓ - Region mismatch (Arica vs Huasco) ### Education Providers (1) 10. **Universidad Arturo Prat** (Iquique, Tarapacá) - Not found in dataset (may not be included) ## Key Findings ### 1. Name Matching Challenges - Chilean dataset uses varied naming conventions: - Possessive forms: "Universidad's Archivo" vs "Archivo Universidad" - Word order variations: "Arzobispado's Archivo Histórico" vs "Archivo Histórico del Arzobispado" - Additional qualifiers: "Biblioteca Pública" vs "Biblioteca" ### 2. Region/Location Inconsistencies - Same institution name appears in different regions - Some locations are provinces (El Loa, Chañaral, Huasco) not regions - Need better geographic matching strategy ### 3. Wikidata Coverage for Chilean Institutions - Universities: Good Wikidata coverage (100% success rate for 2 tested) - Museums: Sparse Wikidata coverage (many Chilean museums not in Wikidata) - Archives/Libraries: Very limited Wikidata coverage ### 4. Match Quality - High-confidence matches (≥85%): Excellent quality, no false positives - Medium-confidence matches (70-85%): Requires careful verification (1/1 was incorrect) ## Recommendations for Batch 2 ### Improved Search Strategy 1. **Normalize institution names** before matching: - Strip possessive markers ("'s") - Try multiple word orders - Handle "Pública", "Municipal", "Regional" qualifiers 2. **Fuzzy location matching**: - Match by city first, then region - Handle province/region confusion - Use lat/lon proximity for ambiguous cases 3. **Focus on high-probability targets**: - Major universities (likely in Wikidata) - National/regional museums - Biblioteca Nacional and major public libraries ### Batch 2 Targets (Revised) Priority institutions with high Wikidata likelihood: **Universities/Education** (5): - Universidad de Chile - Universidad de Santiago de Chile (USACH) - Universidad de Concepción - Universidad Austral de Chile - Pontificia Universidad Católica de Chile **Major Museums** (5): - Museo Histórico y Antropológico (Valdivia) - Museo Colchagua - Museo Gabriela Mistral - Museo Antropológico Padre Sebastián Englert (Easter Island) - Casa Museo Isla Negra (Pablo Neruda) **National/Regional Libraries** (3): - Biblioteca Nacional (if in dataset) - Major university libraries - Regional archive centers ## Technical Improvements Needed 1. **Better fuzzy matching**: - Use token-based matching (not just string similarity) - Weight geographic proximity in matching 2. **Wikidata query optimization**: - Add region/city filters to SPARQL queries - Query by institution coordinates when available 3. **Manual verification workflow**: - Export candidates to CSV for batch review - Include Wikidata descriptions in output - Add Wikipedia links for verification ## Next Steps 1. ✅ Verify the 2 enriched institutions (both universities confirmed correct) 2. ⚠️ Reject the MASMA false positive (Q86280263 is wrong institution) 3. 🔧 Refine name matching patterns based on actual dataset names 4. 📋 Create Batch 2 with revised targets (focus on universities) 5. 🎯 Goal: Reach 20+ institutions (22% coverage) by end of Batch 2 ## Files Generated - **Input**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml` - **Backup**: `data/instances/chile/chilean_institutions_geocoded_v2.batch1_backup` - **Output**: `data/instances/chile/chilean_institutions_batch1_enriched.yaml` - **Script**: `scripts/enrich_chilean_batch1_manual.py` --- **Conclusion**: Batch 1 achieved modest success (2 enrichments) but revealed important challenges with name matching and Wikidata coverage. Universities show promise for high-quality enrichment. Need improved matching strategy for archives/libraries.