glam/CHILEAN_BATCH1_REPORT.md
2025-11-19 23:25:22 +01:00

6.9 KiB

Chilean Heritage Institutions - Batch 1 Wikidata Enrichment Report

Date: November 9, 2025
Script: scripts/enrich_chilean_batch1_manual.py
Dataset: data/instances/chile/chilean_institutions_geocoded_v2.yaml (90 institutions)

Batch 1 Results Summary

Coverage Impact

  • Before: 0/90 institutions with Wikidata (0.0%)
  • After: 2/90 institutions with Wikidata (2.2%)
  • Target: 13 institutions (diverse sample)
  • Success rate: 2/13 auto-enriched (15.4%)

Enrichment Breakdown

Status Count Percentage
Auto-enriched (≥85% match) 2 15.4%
⚠️ Manual review needed 1 7.7%
Not found in dataset 10 76.9%
⏭️ Already enriched 0 0.0%

Successfully Enriched Institutions

1. Universidad de Tarapacá

  • Wikidata: Q3138071
  • Match confidence: 100.0% (HIGH)
  • Region: Arica
  • Type: EDUCATION_PROVIDER
  • URL: https://www.wikidata.org/wiki/Q3138071
  • Verification: Confirmed - Public university in Arica, Chile

2. Universidad Católica del Norte

  • Wikidata: Q3244385
  • Match confidence: 100.0% (HIGH)
  • Region: Antofagasta
  • Type: EDUCATION_PROVIDER
  • URL: https://www.wikidata.org/wiki/Q3244385
  • Verification: Confirmed - Private Catholic university in Antofagasta

Manual Review Required

Museo Universidad de Tarapacá San Miguel de Azapa (MASMA) ⚠️

  • Dataset name: Museo Universidad de Tarapacá San Miguel de Azapa (MASMA)
  • Region: Arica, City: Arica
  • Suggested match: Q86280263 (Museo Andino)
  • Match score: 76.2% (MEDIUM)
  • Manual verification: INCORRECT MATCH
    • Q86280263 is "Museo Andino" in Buin (near Santiago)
    • MASMA is an archaeological museum in Arica (different location and focus)
  • Action needed: Search Wikidata for "Museo San Miguel de Azapa" or create new entry

Institutions Not Found (10/13)

These targets couldn't be matched in the dataset due to name variations:

Museums (3)

  1. Museo de Historia Natural de Atacama (Atacama)

    • Not found with this exact name
  2. Museo Indígena Atacameño (Antofagasta region)

    • Dataset has: "Museo Indígena Atacameño (El Loa)" ✓
    • Region mismatch caused failure (El Loa vs Antofagasta)
  3. Museo de Tocopilla (Tocopilla, Antofagasta)

    • Not found with this exact name

Archives (3)

  1. Archivo Central Andrés Bello (Santiago)

    • Dataset has: "Universidad de Chile's Archivo Central Andrés Bello" ✓
    • Name pattern mismatch
  2. Archivo Central USACH (Santiago)

    • Dataset has: "USACH's Archivo Patrimonial" ✓
    • Different name (Patrimonial vs Central)
  3. Archivo Histórico del Arzobispado (Santiago)

    • Dataset has: "Arzobispado's Archivo Histórico" ✓
    • Word order difference

Libraries (3)

  1. Biblioteca Nacional Digital (Santiago)

    • Dataset has: "Biblioteca Nacional Digital (Iquique)" ✓
    • Region mismatch (Metropolitana vs Iquique)
  2. Biblioteca Federico Varela (Atacama)

    • Dataset has: "Biblioteca Pública Federico Varela (Chañaral)" ✓
    • Missing "Pública" in pattern
  3. CRA Escuela El Olivar (Arica)

    • Dataset has: "Biblioteca CRA El Olivar (Huasco)" ✓
    • Region mismatch (Arica vs Huasco)

Education Providers (1)

  1. Universidad Arturo Prat (Iquique, Tarapacá)
    • Not found in dataset (may not be included)

Key Findings

1. Name Matching Challenges

  • Chilean dataset uses varied naming conventions:
    • Possessive forms: "Universidad's Archivo" vs "Archivo Universidad"
    • Word order variations: "Arzobispado's Archivo Histórico" vs "Archivo Histórico del Arzobispado"
    • Additional qualifiers: "Biblioteca Pública" vs "Biblioteca"

2. Region/Location Inconsistencies

  • Same institution name appears in different regions
  • Some locations are provinces (El Loa, Chañaral, Huasco) not regions
  • Need better geographic matching strategy

3. Wikidata Coverage for Chilean Institutions

  • Universities: Good Wikidata coverage (100% success rate for 2 tested)
  • Museums: Sparse Wikidata coverage (many Chilean museums not in Wikidata)
  • Archives/Libraries: Very limited Wikidata coverage

4. Match Quality

  • High-confidence matches (≥85%): Excellent quality, no false positives
  • Medium-confidence matches (70-85%): Requires careful verification (1/1 was incorrect)

Recommendations for Batch 2

Improved Search Strategy

  1. Normalize institution names before matching:

    • Strip possessive markers ("'s")
    • Try multiple word orders
    • Handle "Pública", "Municipal", "Regional" qualifiers
  2. Fuzzy location matching:

    • Match by city first, then region
    • Handle province/region confusion
    • Use lat/lon proximity for ambiguous cases
  3. Focus on high-probability targets:

    • Major universities (likely in Wikidata)
    • National/regional museums
    • Biblioteca Nacional and major public libraries

Batch 2 Targets (Revised)

Priority institutions with high Wikidata likelihood:

Universities/Education (5):

  • Universidad de Chile
  • Universidad de Santiago de Chile (USACH)
  • Universidad de Concepción
  • Universidad Austral de Chile
  • Pontificia Universidad Católica de Chile

Major Museums (5):

  • Museo Histórico y Antropológico (Valdivia)
  • Museo Colchagua
  • Museo Gabriela Mistral
  • Museo Antropológico Padre Sebastián Englert (Easter Island)
  • Casa Museo Isla Negra (Pablo Neruda)

National/Regional Libraries (3):

  • Biblioteca Nacional (if in dataset)
  • Major university libraries
  • Regional archive centers

Technical Improvements Needed

  1. Better fuzzy matching:

    • Use token-based matching (not just string similarity)
    • Weight geographic proximity in matching
  2. Wikidata query optimization:

    • Add region/city filters to SPARQL queries
    • Query by institution coordinates when available
  3. Manual verification workflow:

    • Export candidates to CSV for batch review
    • Include Wikidata descriptions in output
    • Add Wikipedia links for verification

Next Steps

  1. Verify the 2 enriched institutions (both universities confirmed correct)
  2. ⚠️ Reject the MASMA false positive (Q86280263 is wrong institution)
  3. 🔧 Refine name matching patterns based on actual dataset names
  4. 📋 Create Batch 2 with revised targets (focus on universities)
  5. 🎯 Goal: Reach 20+ institutions (22% coverage) by end of Batch 2

Files Generated

  • Input: data/instances/chile/chilean_institutions_geocoded_v2.yaml
  • Backup: data/instances/chile/chilean_institutions_geocoded_v2.batch1_backup
  • Output: data/instances/chile/chilean_institutions_batch1_enriched.yaml
  • Script: scripts/enrich_chilean_batch1_manual.py

Conclusion: Batch 1 achieved modest success (2 enrichments) but revealed important challenges with name matching and Wikidata coverage. Universities show promise for high-quality enrichment. Need improved matching strategy for archives/libraries.