191 lines
6.9 KiB
Markdown
191 lines
6.9 KiB
Markdown
# Chilean Heritage Institutions - Batch 1 Wikidata Enrichment Report
|
|
|
|
**Date**: November 9, 2025
|
|
**Script**: `scripts/enrich_chilean_batch1_manual.py`
|
|
**Dataset**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml` (90 institutions)
|
|
|
|
## Batch 1 Results Summary
|
|
|
|
### Coverage Impact
|
|
- **Before**: 0/90 institutions with Wikidata (0.0%)
|
|
- **After**: 2/90 institutions with Wikidata (2.2%)
|
|
- **Target**: 13 institutions (diverse sample)
|
|
- **Success rate**: 2/13 auto-enriched (15.4%)
|
|
|
|
### Enrichment Breakdown
|
|
| Status | Count | Percentage |
|
|
|--------|-------|------------|
|
|
| ✅ Auto-enriched (≥85% match) | 2 | 15.4% |
|
|
| ⚠️ Manual review needed | 1 | 7.7% |
|
|
| ❌ Not found in dataset | 10 | 76.9% |
|
|
| ⏭️ Already enriched | 0 | 0.0% |
|
|
|
|
## Successfully Enriched Institutions
|
|
|
|
### 1. Universidad de Tarapacá ✅
|
|
- **Wikidata**: Q3138071
|
|
- **Match confidence**: 100.0% (HIGH)
|
|
- **Region**: Arica
|
|
- **Type**: EDUCATION_PROVIDER
|
|
- **URL**: https://www.wikidata.org/wiki/Q3138071
|
|
- **Verification**: ✅ Confirmed - Public university in Arica, Chile
|
|
|
|
### 2. Universidad Católica del Norte ✅
|
|
- **Wikidata**: Q3244385
|
|
- **Match confidence**: 100.0% (HIGH)
|
|
- **Region**: Antofagasta
|
|
- **Type**: EDUCATION_PROVIDER
|
|
- **URL**: https://www.wikidata.org/wiki/Q3244385
|
|
- **Verification**: ✅ Confirmed - Private Catholic university in Antofagasta
|
|
|
|
## Manual Review Required
|
|
|
|
### Museo Universidad de Tarapacá San Miguel de Azapa (MASMA) ⚠️
|
|
- **Dataset name**: Museo Universidad de Tarapacá San Miguel de Azapa (MASMA)
|
|
- **Region**: Arica, City: Arica
|
|
- **Suggested match**: Q86280263 (Museo Andino)
|
|
- **Match score**: 76.2% (MEDIUM)
|
|
- **Manual verification**: ❌ INCORRECT MATCH
|
|
- Q86280263 is "Museo Andino" in Buin (near Santiago)
|
|
- MASMA is an archaeological museum in Arica (different location and focus)
|
|
- **Action needed**: Search Wikidata for "Museo San Miguel de Azapa" or create new entry
|
|
|
|
## Institutions Not Found (10/13)
|
|
|
|
These targets couldn't be matched in the dataset due to name variations:
|
|
|
|
### Museums (3)
|
|
1. **Museo de Historia Natural de Atacama** (Atacama)
|
|
- Not found with this exact name
|
|
|
|
2. **Museo Indígena Atacameño** (Antofagasta region)
|
|
- Dataset has: "Museo Indígena Atacameño (El Loa)" ✓
|
|
- Region mismatch caused failure (El Loa vs Antofagasta)
|
|
|
|
3. **Museo de Tocopilla** (Tocopilla, Antofagasta)
|
|
- Not found with this exact name
|
|
|
|
### Archives (3)
|
|
4. **Archivo Central Andrés Bello** (Santiago)
|
|
- Dataset has: "Universidad de Chile's Archivo Central Andrés Bello" ✓
|
|
- Name pattern mismatch
|
|
|
|
5. **Archivo Central USACH** (Santiago)
|
|
- Dataset has: "USACH's Archivo Patrimonial" ✓
|
|
- Different name (Patrimonial vs Central)
|
|
|
|
6. **Archivo Histórico del Arzobispado** (Santiago)
|
|
- Dataset has: "Arzobispado's Archivo Histórico" ✓
|
|
- Word order difference
|
|
|
|
### Libraries (3)
|
|
7. **Biblioteca Nacional Digital** (Santiago)
|
|
- Dataset has: "Biblioteca Nacional Digital (Iquique)" ✓
|
|
- Region mismatch (Metropolitana vs Iquique)
|
|
|
|
8. **Biblioteca Federico Varela** (Atacama)
|
|
- Dataset has: "Biblioteca Pública Federico Varela (Chañaral)" ✓
|
|
- Missing "Pública" in pattern
|
|
|
|
9. **CRA Escuela El Olivar** (Arica)
|
|
- Dataset has: "Biblioteca CRA El Olivar (Huasco)" ✓
|
|
- Region mismatch (Arica vs Huasco)
|
|
|
|
### Education Providers (1)
|
|
10. **Universidad Arturo Prat** (Iquique, Tarapacá)
|
|
- Not found in dataset (may not be included)
|
|
|
|
## Key Findings
|
|
|
|
### 1. Name Matching Challenges
|
|
- Chilean dataset uses varied naming conventions:
|
|
- Possessive forms: "Universidad's Archivo" vs "Archivo Universidad"
|
|
- Word order variations: "Arzobispado's Archivo Histórico" vs "Archivo Histórico del Arzobispado"
|
|
- Additional qualifiers: "Biblioteca Pública" vs "Biblioteca"
|
|
|
|
### 2. Region/Location Inconsistencies
|
|
- Same institution name appears in different regions
|
|
- Some locations are provinces (El Loa, Chañaral, Huasco) not regions
|
|
- Need better geographic matching strategy
|
|
|
|
### 3. Wikidata Coverage for Chilean Institutions
|
|
- Universities: Good Wikidata coverage (100% success rate for 2 tested)
|
|
- Museums: Sparse Wikidata coverage (many Chilean museums not in Wikidata)
|
|
- Archives/Libraries: Very limited Wikidata coverage
|
|
|
|
### 4. Match Quality
|
|
- High-confidence matches (≥85%): Excellent quality, no false positives
|
|
- Medium-confidence matches (70-85%): Requires careful verification (1/1 was incorrect)
|
|
|
|
## Recommendations for Batch 2
|
|
|
|
### Improved Search Strategy
|
|
1. **Normalize institution names** before matching:
|
|
- Strip possessive markers ("'s")
|
|
- Try multiple word orders
|
|
- Handle "Pública", "Municipal", "Regional" qualifiers
|
|
|
|
2. **Fuzzy location matching**:
|
|
- Match by city first, then region
|
|
- Handle province/region confusion
|
|
- Use lat/lon proximity for ambiguous cases
|
|
|
|
3. **Focus on high-probability targets**:
|
|
- Major universities (likely in Wikidata)
|
|
- National/regional museums
|
|
- Biblioteca Nacional and major public libraries
|
|
|
|
### Batch 2 Targets (Revised)
|
|
Priority institutions with high Wikidata likelihood:
|
|
|
|
**Universities/Education** (5):
|
|
- Universidad de Chile
|
|
- Universidad de Santiago de Chile (USACH)
|
|
- Universidad de Concepción
|
|
- Universidad Austral de Chile
|
|
- Pontificia Universidad Católica de Chile
|
|
|
|
**Major Museums** (5):
|
|
- Museo Histórico y Antropológico (Valdivia)
|
|
- Museo Colchagua
|
|
- Museo Gabriela Mistral
|
|
- Museo Antropológico Padre Sebastián Englert (Easter Island)
|
|
- Casa Museo Isla Negra (Pablo Neruda)
|
|
|
|
**National/Regional Libraries** (3):
|
|
- Biblioteca Nacional (if in dataset)
|
|
- Major university libraries
|
|
- Regional archive centers
|
|
|
|
## Technical Improvements Needed
|
|
|
|
1. **Better fuzzy matching**:
|
|
- Use token-based matching (not just string similarity)
|
|
- Weight geographic proximity in matching
|
|
|
|
2. **Wikidata query optimization**:
|
|
- Add region/city filters to SPARQL queries
|
|
- Query by institution coordinates when available
|
|
|
|
3. **Manual verification workflow**:
|
|
- Export candidates to CSV for batch review
|
|
- Include Wikidata descriptions in output
|
|
- Add Wikipedia links for verification
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Verify the 2 enriched institutions (both universities confirmed correct)
|
|
2. ⚠️ Reject the MASMA false positive (Q86280263 is wrong institution)
|
|
3. 🔧 Refine name matching patterns based on actual dataset names
|
|
4. 📋 Create Batch 2 with revised targets (focus on universities)
|
|
5. 🎯 Goal: Reach 20+ institutions (22% coverage) by end of Batch 2
|
|
|
|
## Files Generated
|
|
- **Input**: `data/instances/chile/chilean_institutions_geocoded_v2.yaml`
|
|
- **Backup**: `data/instances/chile/chilean_institutions_geocoded_v2.batch1_backup`
|
|
- **Output**: `data/instances/chile/chilean_institutions_batch1_enriched.yaml`
|
|
- **Script**: `scripts/enrich_chilean_batch1_manual.py`
|
|
|
|
---
|
|
|
|
**Conclusion**: Batch 1 achieved modest success (2 enrichments) but revealed important challenges with name matching and Wikidata coverage. Universities show promise for high-quality enrichment. Need improved matching strategy for archives/libraries.
|