glam/CHILEAN_BATCH1_REPORT.md
2025-11-19 23:25:22 +01:00

191 lines
6.9 KiB
Markdown

# Chilean Heritage Institutions - Batch 1 Wikidata Enrichment Report
**Date**: November 9, 2025
**Script**: `scripts/enrich_chilean_batch1_manual.py`
**Dataset**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml` (90 institutions)
## Batch 1 Results Summary
### Coverage Impact
- **Before**: 0/90 institutions with Wikidata (0.0%)
- **After**: 2/90 institutions with Wikidata (2.2%)
- **Target**: 13 institutions (diverse sample)
- **Success rate**: 2/13 auto-enriched (15.4%)
### Enrichment Breakdown
| Status | Count | Percentage |
|--------|-------|------------|
| ✅ Auto-enriched (≥85% match) | 2 | 15.4% |
| ⚠️ Manual review needed | 1 | 7.7% |
| ❌ Not found in dataset | 10 | 76.9% |
| ⏭️ Already enriched | 0 | 0.0% |
## Successfully Enriched Institutions
### 1. Universidad de Tarapacá ✅
- **Wikidata**: Q3138071
- **Match confidence**: 100.0% (HIGH)
- **Region**: Arica
- **Type**: EDUCATION_PROVIDER
- **URL**: https://www.wikidata.org/wiki/Q3138071
- **Verification**: ✅ Confirmed - Public university in Arica, Chile
### 2. Universidad Católica del Norte ✅
- **Wikidata**: Q3244385
- **Match confidence**: 100.0% (HIGH)
- **Region**: Antofagasta
- **Type**: EDUCATION_PROVIDER
- **URL**: https://www.wikidata.org/wiki/Q3244385
- **Verification**: ✅ Confirmed - Private Catholic university in Antofagasta
## Manual Review Required
### Museo Universidad de Tarapacá San Miguel de Azapa (MASMA) ⚠️
- **Dataset name**: Museo Universidad de Tarapacá San Miguel de Azapa (MASMA)
- **Region**: Arica, City: Arica
- **Suggested match**: Q86280263 (Museo Andino)
- **Match score**: 76.2% (MEDIUM)
- **Manual verification**: ❌ INCORRECT MATCH
- Q86280263 is "Museo Andino" in Buin (near Santiago)
- MASMA is an archaeological museum in Arica (different location and focus)
- **Action needed**: Search Wikidata for "Museo San Miguel de Azapa" or create new entry
## Institutions Not Found (10/13)
These targets couldn't be matched in the dataset due to name variations:
### Museums (3)
1. **Museo de Historia Natural de Atacama** (Atacama)
- Not found with this exact name
2. **Museo Indígena Atacameño** (Antofagasta region)
- Dataset has: "Museo Indígena Atacameño (El Loa)" ✓
- Region mismatch caused failure (El Loa vs Antofagasta)
3. **Museo de Tocopilla** (Tocopilla, Antofagasta)
- Not found with this exact name
### Archives (3)
4. **Archivo Central Andrés Bello** (Santiago)
- Dataset has: "Universidad de Chile's Archivo Central Andrés Bello" ✓
- Name pattern mismatch
5. **Archivo Central USACH** (Santiago)
- Dataset has: "USACH's Archivo Patrimonial" ✓
- Different name (Patrimonial vs Central)
6. **Archivo Histórico del Arzobispado** (Santiago)
- Dataset has: "Arzobispado's Archivo Histórico" ✓
- Word order difference
### Libraries (3)
7. **Biblioteca Nacional Digital** (Santiago)
- Dataset has: "Biblioteca Nacional Digital (Iquique)" ✓
- Region mismatch (Metropolitana vs Iquique)
8. **Biblioteca Federico Varela** (Atacama)
- Dataset has: "Biblioteca Pública Federico Varela (Chañaral)" ✓
- Missing "Pública" in pattern
9. **CRA Escuela El Olivar** (Arica)
- Dataset has: "Biblioteca CRA El Olivar (Huasco)" ✓
- Region mismatch (Arica vs Huasco)
### Education Providers (1)
10. **Universidad Arturo Prat** (Iquique, Tarapacá)
- Not found in dataset (may not be included)
## Key Findings
### 1. Name Matching Challenges
- Chilean dataset uses varied naming conventions:
- Possessive forms: "Universidad's Archivo" vs "Archivo Universidad"
- Word order variations: "Arzobispado's Archivo Histórico" vs "Archivo Histórico del Arzobispado"
- Additional qualifiers: "Biblioteca Pública" vs "Biblioteca"
### 2. Region/Location Inconsistencies
- Same institution name appears in different regions
- Some locations are provinces (El Loa, Chañaral, Huasco) not regions
- Need better geographic matching strategy
### 3. Wikidata Coverage for Chilean Institutions
- Universities: Good Wikidata coverage (100% success rate for 2 tested)
- Museums: Sparse Wikidata coverage (many Chilean museums not in Wikidata)
- Archives/Libraries: Very limited Wikidata coverage
### 4. Match Quality
- High-confidence matches (≥85%): Excellent quality, no false positives
- Medium-confidence matches (70-85%): Requires careful verification (1/1 was incorrect)
## Recommendations for Batch 2
### Improved Search Strategy
1. **Normalize institution names** before matching:
- Strip possessive markers ("'s")
- Try multiple word orders
- Handle "Pública", "Municipal", "Regional" qualifiers
2. **Fuzzy location matching**:
- Match by city first, then region
- Handle province/region confusion
- Use lat/lon proximity for ambiguous cases
3. **Focus on high-probability targets**:
- Major universities (likely in Wikidata)
- National/regional museums
- Biblioteca Nacional and major public libraries
### Batch 2 Targets (Revised)
Priority institutions with high Wikidata likelihood:
**Universities/Education** (5):
- Universidad de Chile
- Universidad de Santiago de Chile (USACH)
- Universidad de Concepción
- Universidad Austral de Chile
- Pontificia Universidad Católica de Chile
**Major Museums** (5):
- Museo Histórico y Antropológico (Valdivia)
- Museo Colchagua
- Museo Gabriela Mistral
- Museo Antropológico Padre Sebastián Englert (Easter Island)
- Casa Museo Isla Negra (Pablo Neruda)
**National/Regional Libraries** (3):
- Biblioteca Nacional (if in dataset)
- Major university libraries
- Regional archive centers
## Technical Improvements Needed
1. **Better fuzzy matching**:
- Use token-based matching (not just string similarity)
- Weight geographic proximity in matching
2. **Wikidata query optimization**:
- Add region/city filters to SPARQL queries
- Query by institution coordinates when available
3. **Manual verification workflow**:
- Export candidates to CSV for batch review
- Include Wikidata descriptions in output
- Add Wikipedia links for verification
## Next Steps
1. ✅ Verify the 2 enriched institutions (both universities confirmed correct)
2. ⚠️ Reject the MASMA false positive (Q86280263 is wrong institution)
3. 🔧 Refine name matching patterns based on actual dataset names
4. 📋 Create Batch 2 with revised targets (focus on universities)
5. 🎯 Goal: Reach 20+ institutions (22% coverage) by end of Batch 2
## Files Generated
- **Input**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml`
- **Backup**: `data/instances/chile/chilean_institutions_geocoded_v2.batch1_backup`
- **Output**: `data/instances/chile/chilean_institutions_batch1_enriched.yaml`
- **Script**: `scripts/enrich_chilean_batch1_manual.py`
---
**Conclusion**: Batch 1 achieved modest success (2 enrichments) but revealed important challenges with name matching and Wikidata coverage. Universities show promise for high-quality enrichment. Need improved matching strategy for archives/libraries.