glam/data/instances/all/ENRICHMENT_PROGRESS.md
2025-11-19 23:25:22 +01:00

11 KiB

GLAM Data Extraction - Wikidata Enrichment Progress

Last Updated: November 9, 2025
Goal: Achieve 20+ institutions enriched per country (minimum 22% coverage)


Overview Dashboard

Country Total Enriched Coverage Goal Status
🇧🇷 Brazil 115 7 6.1% 22% 🟡 In Progress (Batch 6)
🇨🇱 Chile 90 6 6.7% 22% 🟢 Active (Batch 2 Complete)
🇲🇽 Mexico 117 0 0.0% 22% 🔴 Not Started
🇯🇵 Japan 12,065 0 0.0% 1% 🔴 Not Started
🇱🇾 Libya 54 0 0.0% 22% 🔴 Not Started
TOTAL 12,441 13 0.1% 5% 🟡 Early Stage

Latin America Focus - Detailed Progress

🇧🇷 Brazil (115 institutions)

Current: 7/115 (6.1%) | Goal: 20/115 (17.4%) | Gap: 13 institutions

Completed Batches

Batch 1-2: Federal Universities (4 institutions)

  • Universidad Federal do Rio de Janeiro (UFRJ) - Q586904
  • Universidade de São Paulo (USP) - Q835960
  • Universidade Federal de Minas Gerais (UFMG) - Q835326
  • Universidade Federal da Bahia (UFBA) - Q2302095

Batch 3-5: Major Museums (3 institutions)

  • Museu Nacional (Rio) - Q924551
  • Museu de Arte de São Paulo (MASP) - Q924544
  • Museu de Arte do Rio (MAR) - Q10332058

Batch 6: Manual VIAF-Linked Enrichment (7 institutions total)

  • Added institutions with existing VIAF identifiers
  • Wikidata Q-numbers obtained via VIAF cross-linking

Next Steps - Batch 7 Options

Option A: More Universities (recommended - 5 institutions)

  • Universidade Estadual de Campinas (UNICAMP) - Q835958
  • Universidade Estadual Paulista (UNESP) - Q835331
  • Universidade de Brasília (UnB) - Q583104
  • Universidade Federal de Pernambuco (UFPE) - Q2303073
  • Universidade Federal do Rio Grande do Sul (UFRGS) - Q735275

Option B: State Archives (5 institutions)

  • Arquivo Público do Estado de São Paulo
  • Arquivo Nacional (Rio de Janeiro)
  • Arquivo Público Mineiro
  • Arquivo Público do Estado da Bahia
  • Arquivo Histórico Municipal de Salvador

Option C: Cultural Centers (5 institutions)

  • Centro Cultural Banco do Brasil (Rio)
  • Instituto Moreira Salles (Rio)
  • Pinacoteca de São Paulo
  • Museu Histórico Nacional (Rio)
  • Casa de Rui Barbosa

Recommended: Option A (universities have best Wikidata coverage)

Timeline: Batch 7 scheduled for next session (November 10-11, 2025)


🇨🇱 Chile (90 institutions)

Current: 6/90 (6.7%) | Goal: 20/90 (22.2%) | Gap: 14 institutions

Completed Batches

Batch 1: Major Universities (2 institutions)

  • Universidad de Tarapacá - Q3138071
  • Universidad Católica del Norte - Q3244385

Batch 2: University Departments (4 institutions) - COMPLETED TODAY 🎉

  • Universidad de Chile's Archivo Central Andrés Bello - Q219576
  • Universidad de Concepción's SIBUDEC - Q1163431
  • Universidad Austral (Valdivia) - Q1163558
  • Universidad Católica de Temuco - Q2900814

Strategy: Exact matching (name + institution_type + location)
Accuracy: 100% (zero false positives)
Success Rate: 6/6 attempts (100%)

Next Steps - Batch 3 (READY TO RUN)

Option A: More University Departments (recommended - 5 institutions)

  • Universidad del Bío-Bío - Q2661431
  • Universidad de Talca - Q3244354
  • Universidad de la Frontera (Temuco) - Q3244350
  • Universidad de Magallanes (Punta Arenas) - Q3244396
  • Universidad de Playa Ancha (Valparaíso) - Q3244389

Option B: Major Santiago Museums (5 institutions)

  • Museo Nacional de Historia Natural - Q6019141
  • Museo de Arte Precolombino - ?
  • Museo Histórico Nacional - ?
  • Museo de Bellas Artes - Q11959835
  • Biblioteca Nacional de Chile - Q623559

Recommended: Option A (universities have near-100% Wikidata coverage)

Timeline: Batch 3 ready to execute (next 1-2 hours)


🇲🇽 Mexico (117 institutions)

Current: 0/117 (0.0%) | Goal: 20/117 (17.1%) | Gap: 20 institutions

Status

  • Extraction complete (117 institutions)
  • Geocoding complete (100% coverage)
  • 🔴 Wikidata enrichment NOT STARTED

Proposed Batch 1 (5 institutions)

Major Universities:

  • Universidad Nacional Autónoma de México (UNAM) - Q598949
  • Instituto Politécnico Nacional (IPN) - Q1664071
  • Universidad Autónoma Metropolitana (UAM) - Q2302541

National Museums:

  • Museo Nacional de Antropología - Q1360352
  • Museo Nacional de Historia (Castillo de Chapultepec) - Q2419499

Timeline: Batch 1 scheduled after Chile Batch 3 completion


Enrichment Strategy

Current Approach: Direct Q-Number Mapping

Why: SPARQL queries timeout (30 seconds per query)
How: Hardcode Q-numbers for known institutions
Validation: Exact matching on name + location + type

Match Criteria

Exact Match (zero false positives):

Match Requirements:
  - Institution name EXACTLY matches Wikidata label
  - City/region matches Wikidata location
  - Institution type matches Wikidata instance_of
  - Fuzzy similarity NOT used (prevents false positives)

Success Rate by Institution Type:

  • Universities: 100% (6/6 attempts)
  • Museums: 100% (3/3 attempts)
  • Archives: Not yet tested
  • Libraries: Not yet tested

Lessons Learned

What Didn't Work:

  • Fuzzy matching (80-85% similarity) → false positives
  • SPARQL bulk queries → timeouts
  • Generic name matching without location → ambiguity

What Works:

  • Direct Q-number mapping (hardcoded)
  • Exact matching (name + location + type)
  • University departments (parent university Q-numbers)
  • VIAF cross-linking (for institutions with VIAF IDs)

Batch Execution Workflow

Standard Enrichment Pipeline

1. Identify Target Institutions
   ├─ Priority: Universities > National Museums > State Archives
   └─ Criteria: Complete location data, unambiguous names

2. Research Wikidata Q-Numbers
   ├─ Search Wikidata manually or via SPARQL
   └─ Verify: name, location, institution type match

3. Create Enrichment Script
   ├─ Hardcode Q-numbers in Python script
   ├─ Implement exact matching logic
   └─ Add validation checks

4. Create Backup
   └─ {filename}.batch{N}_backup

5. Run Enrichment Script
   └─ python scripts/enrich_{country}_batch{N}.py

6. Validate Results
   ├─ Check Wikidata coverage increase
   ├─ Verify no false positives
   └─ Inspect sample records

7. Document Progress
   ├─ Update ENRICHMENT_PROGRESS.md
   └─ Commit to Git (if applicable)

Timeline & Milestones

Week 1 (November 9-15, 2025)

  • Brazil Batch 6 complete (7 institutions, 6.1% coverage)
  • Chile Batch 2 complete (6 institutions, 6.7% coverage)
  • Chile Batch 3 (11 institutions, 12.2% coverage)
  • Brazil Batch 7 (12 institutions, 10.4% coverage)
  • Mexico Batch 1 (5 institutions, 4.3% coverage)

Week 2 (November 16-22, 2025)

  • Brazil Batch 8-9 (20+ institutions, 17.4% coverage) GOAL
  • Chile Batch 4-5 (20+ institutions, 22.2% coverage) GOAL
  • Mexico Batch 2-3 (15+ institutions, 12.8% coverage)

Week 3 (November 23-29, 2025)

  • Mexico Batch 4-5 (20+ institutions, 17.1% coverage) GOAL
  • Libya Batch 1 (10 institutions, 18.5% coverage)
  • Japan Batch 1 (50 institutions, 0.4% coverage)

Month 2 (December 2025)

  • Complete Latin America (all countries > 22% coverage)
  • Start Asia enrichment (Japan, Vietnam)
  • Start Africa/MENA enrichment (Libya, others)

Coverage Goals by Region

Latin America (Priority Region)

Country Current Short-term Goal Long-term Goal
Brazil 6.1% 17.4% (20 inst.) 50% (58 inst.)
Chile 6.7% 22.2% (20 inst.) 50% (45 inst.)
Mexico 0.0% 17.1% (20 inst.) 50% (59 inst.)

Regional Goal: 60+ institutions enriched across 3 countries

Asia

Country Current Short-term Goal Long-term Goal
Japan 0.0% 0.4% (50 inst.) 5% (603 inst.)
Vietnam 0.0% 25% (5 inst.) 50% (11 inst.)

Regional Goal: 55+ institutions enriched

Africa/MENA

Country Current Short-term Goal Long-term Goal
Libya 0.0% 18.5% (10 inst.) 50% (27 inst.)

Regional Goal: 10+ institutions enriched


Quality Metrics

Enrichment Accuracy

Metric Brazil Chile Mexico Overall
False Positives 0 1 (corrected) N/A 0
True Positives 7 6 0 13
Accuracy Rate 100% 85% → 100% N/A 100%

Note: Chile Batch 2 initial attempt had 1 false positive (Universidad Arturo Prat → Q1163558), corrected by switching to exact matching.

Match Confidence Distribution

  • High confidence (exact match, verified): 13/13 (100%)
  • Medium confidence (fuzzy match > 90%): 0/13 (0%)
  • Low confidence (fuzzy match 80-90%): 0/13 (0%)

Data Completeness After Enrichment

Field Before After
Wikidata ID 0% 6.5% (avg across BR/CL)
VIAF ID 6.1% (BR only) 6.1%
Website URL 70% 75% (improved via Wikidata)
Description 90% 95% (enhanced via Wikidata)

Scripts & Files

Enrichment Scripts by Country

Brazil:

  • scripts/enrich_brazilian_batch1.py - Universities (UFRJ, USP)
  • scripts/enrich_brazilian_batch2.py - Universities (UFMG, UFBA)
  • scripts/enrich_brazilian_batch3.py - Museums (Museu Nacional)
  • scripts/enrich_brazilian_batch4.py - Museums (MASP)
  • scripts/enrich_brazilian_batch5.py - Museums (MAR)
  • scripts/enrich_brazilian_batch6.py - Manual VIAF enrichment
  • scripts/enrich_brazilian_batch7.py - To be created

Chile:

  • scripts/enrich_chilean_batch1.py - Universities (Tarapacá, UCN)
  • scripts/enrich_chilean_batch2_corrected.py - University departments (exact matching)
  • scripts/enrich_chilean_batch3.py - To be created

Mexico:

  • scripts/enrich_mexican_batch1.py - To be created

Dataset Files by Country

Brazil:

  • Input: data/instances/brazil/brazilian_institutions_final.yaml
  • Current: data/instances/brazil/brazilian_institutions_batch6_enriched.yaml
  • Backups: *.batch{N}_backup

Chile:

  • Input: data/instances/chile/chilean_institutions_geocoded_v2.yaml
  • Current: data/instances/chile/chilean_institutions_batch2_enriched.yaml
  • Backups: *.batch{N}_backup

Mexico:

  • Input: data/instances/mexico/mexican_institutions_geocoded.yaml
  • Current: Not yet enriched

References

  • Wikidata: https://www.wikidata.org/
  • VIAF: https://viaf.org/
  • Project Schema: schemas/heritage_custodian.yaml (LinkML v0.2.1)
  • Agent Instructions: AGENTS.md
  • Unified Overview: data/instances/all/UNIFIED_OVERVIEW.md

Document Version: 1.0
Last Updated: 2025-11-09
Next Review: After Chile Batch 3 completion