11 KiB
GLAM Data Extraction - Wikidata Enrichment Progress
Last Updated: November 9, 2025
Goal: Achieve 20+ institutions enriched per country (minimum 22% coverage)
Overview Dashboard
| Country | Total | Enriched | Coverage | Goal | Status |
|---|---|---|---|---|---|
| 🇧🇷 Brazil | 115 | 7 | 6.1% | 22% | 🟡 In Progress (Batch 6) |
| 🇨🇱 Chile | 90 | 6 | 6.7% | 22% | 🟢 Active (Batch 2 Complete) |
| 🇲🇽 Mexico | 117 | 0 | 0.0% | 22% | 🔴 Not Started |
| 🇯🇵 Japan | 12,065 | 0 | 0.0% | 1% | 🔴 Not Started |
| 🇱🇾 Libya | 54 | 0 | 0.0% | 22% | 🔴 Not Started |
| TOTAL | 12,441 | 13 | 0.1% | 5% | 🟡 Early Stage |
Latin America Focus - Detailed Progress
🇧🇷 Brazil (115 institutions)
Current: 7/115 (6.1%) | Goal: 20/115 (17.4%) | Gap: 13 institutions
Completed Batches
✅ Batch 1-2: Federal Universities (4 institutions)
- Universidad Federal do Rio de Janeiro (UFRJ) - Q586904
- Universidade de São Paulo (USP) - Q835960
- Universidade Federal de Minas Gerais (UFMG) - Q835326
- Universidade Federal da Bahia (UFBA) - Q2302095
✅ Batch 3-5: Major Museums (3 institutions)
- Museu Nacional (Rio) - Q924551
- Museu de Arte de São Paulo (MASP) - Q924544
- Museu de Arte do Rio (MAR) - Q10332058
✅ Batch 6: Manual VIAF-Linked Enrichment (7 institutions total)
- Added institutions with existing VIAF identifiers
- Wikidata Q-numbers obtained via VIAF cross-linking
Next Steps - Batch 7 Options
Option A: More Universities (recommended - 5 institutions)
- Universidade Estadual de Campinas (UNICAMP) - Q835958
- Universidade Estadual Paulista (UNESP) - Q835331
- Universidade de Brasília (UnB) - Q583104
- Universidade Federal de Pernambuco (UFPE) - Q2303073
- Universidade Federal do Rio Grande do Sul (UFRGS) - Q735275
Option B: State Archives (5 institutions)
- Arquivo Público do Estado de São Paulo
- Arquivo Nacional (Rio de Janeiro)
- Arquivo Público Mineiro
- Arquivo Público do Estado da Bahia
- Arquivo Histórico Municipal de Salvador
Option C: Cultural Centers (5 institutions)
- Centro Cultural Banco do Brasil (Rio)
- Instituto Moreira Salles (Rio)
- Pinacoteca de São Paulo
- Museu Histórico Nacional (Rio)
- Casa de Rui Barbosa
Recommended: Option A (universities have best Wikidata coverage)
Timeline: Batch 7 scheduled for next session (November 10-11, 2025)
🇨🇱 Chile (90 institutions)
Current: 6/90 (6.7%) | Goal: 20/90 (22.2%) | Gap: 14 institutions
Completed Batches
✅ Batch 1: Major Universities (2 institutions)
- Universidad de Tarapacá - Q3138071
- Universidad Católica del Norte - Q3244385
✅ Batch 2: University Departments (4 institutions) - COMPLETED TODAY 🎉
- Universidad de Chile's Archivo Central Andrés Bello - Q219576
- Universidad de Concepción's SIBUDEC - Q1163431
- Universidad Austral (Valdivia) - Q1163558
- Universidad Católica de Temuco - Q2900814
Strategy: Exact matching (name + institution_type + location)
Accuracy: 100% (zero false positives)
Success Rate: 6/6 attempts (100%)
Next Steps - Batch 3 (READY TO RUN)
Option A: More University Departments (recommended - 5 institutions)
- Universidad del Bío-Bío - Q2661431
- Universidad de Talca - Q3244354
- Universidad de la Frontera (Temuco) - Q3244350
- Universidad de Magallanes (Punta Arenas) - Q3244396
- Universidad de Playa Ancha (Valparaíso) - Q3244389
Option B: Major Santiago Museums (5 institutions)
- Museo Nacional de Historia Natural - Q6019141
- Museo de Arte Precolombino - ?
- Museo Histórico Nacional - ?
- Museo de Bellas Artes - Q11959835
- Biblioteca Nacional de Chile - Q623559
Recommended: Option A (universities have near-100% Wikidata coverage)
Timeline: Batch 3 ready to execute (next 1-2 hours)
🇲🇽 Mexico (117 institutions)
Current: 0/117 (0.0%) | Goal: 20/117 (17.1%) | Gap: 20 institutions
Status
- ✅ Extraction complete (117 institutions)
- ✅ Geocoding complete (100% coverage)
- 🔴 Wikidata enrichment NOT STARTED
Proposed Batch 1 (5 institutions)
Major Universities:
- Universidad Nacional Autónoma de México (UNAM) - Q598949
- Instituto Politécnico Nacional (IPN) - Q1664071
- Universidad Autónoma Metropolitana (UAM) - Q2302541
National Museums:
- Museo Nacional de Antropología - Q1360352
- Museo Nacional de Historia (Castillo de Chapultepec) - Q2419499
Timeline: Batch 1 scheduled after Chile Batch 3 completion
Enrichment Strategy
Current Approach: Direct Q-Number Mapping
Why: SPARQL queries timeout (30 seconds per query)
How: Hardcode Q-numbers for known institutions
Validation: Exact matching on name + location + type
Match Criteria
Exact Match (zero false positives):
Match Requirements:
- Institution name EXACTLY matches Wikidata label
- City/region matches Wikidata location
- Institution type matches Wikidata instance_of
- Fuzzy similarity NOT used (prevents false positives)
Success Rate by Institution Type:
- Universities: 100% (6/6 attempts)
- Museums: 100% (3/3 attempts)
- Archives: Not yet tested
- Libraries: Not yet tested
Lessons Learned
❌ What Didn't Work:
- Fuzzy matching (80-85% similarity) → false positives
- SPARQL bulk queries → timeouts
- Generic name matching without location → ambiguity
✅ What Works:
- Direct Q-number mapping (hardcoded)
- Exact matching (name + location + type)
- University departments (parent university Q-numbers)
- VIAF cross-linking (for institutions with VIAF IDs)
Batch Execution Workflow
Standard Enrichment Pipeline
1. Identify Target Institutions
├─ Priority: Universities > National Museums > State Archives
└─ Criteria: Complete location data, unambiguous names
2. Research Wikidata Q-Numbers
├─ Search Wikidata manually or via SPARQL
└─ Verify: name, location, institution type match
3. Create Enrichment Script
├─ Hardcode Q-numbers in Python script
├─ Implement exact matching logic
└─ Add validation checks
4. Create Backup
└─ {filename}.batch{N}_backup
5. Run Enrichment Script
└─ python scripts/enrich_{country}_batch{N}.py
6. Validate Results
├─ Check Wikidata coverage increase
├─ Verify no false positives
└─ Inspect sample records
7. Document Progress
├─ Update ENRICHMENT_PROGRESS.md
└─ Commit to Git (if applicable)
Timeline & Milestones
Week 1 (November 9-15, 2025)
- Brazil Batch 6 complete (7 institutions, 6.1% coverage)
- Chile Batch 2 complete (6 institutions, 6.7% coverage)
- Chile Batch 3 (11 institutions, 12.2% coverage)
- Brazil Batch 7 (12 institutions, 10.4% coverage)
- Mexico Batch 1 (5 institutions, 4.3% coverage)
Week 2 (November 16-22, 2025)
- Brazil Batch 8-9 (20+ institutions, 17.4% coverage) ✅ GOAL
- Chile Batch 4-5 (20+ institutions, 22.2% coverage) ✅ GOAL
- Mexico Batch 2-3 (15+ institutions, 12.8% coverage)
Week 3 (November 23-29, 2025)
- Mexico Batch 4-5 (20+ institutions, 17.1% coverage) ✅ GOAL
- Libya Batch 1 (10 institutions, 18.5% coverage)
- Japan Batch 1 (50 institutions, 0.4% coverage)
Month 2 (December 2025)
- Complete Latin America (all countries > 22% coverage)
- Start Asia enrichment (Japan, Vietnam)
- Start Africa/MENA enrichment (Libya, others)
Coverage Goals by Region
Latin America (Priority Region)
| Country | Current | Short-term Goal | Long-term Goal |
|---|---|---|---|
| Brazil | 6.1% | 17.4% (20 inst.) | 50% (58 inst.) |
| Chile | 6.7% | 22.2% (20 inst.) | 50% (45 inst.) |
| Mexico | 0.0% | 17.1% (20 inst.) | 50% (59 inst.) |
Regional Goal: 60+ institutions enriched across 3 countries
Asia
| Country | Current | Short-term Goal | Long-term Goal |
|---|---|---|---|
| Japan | 0.0% | 0.4% (50 inst.) | 5% (603 inst.) |
| Vietnam | 0.0% | 25% (5 inst.) | 50% (11 inst.) |
Regional Goal: 55+ institutions enriched
Africa/MENA
| Country | Current | Short-term Goal | Long-term Goal |
|---|---|---|---|
| Libya | 0.0% | 18.5% (10 inst.) | 50% (27 inst.) |
Regional Goal: 10+ institutions enriched
Quality Metrics
Enrichment Accuracy
| Metric | Brazil | Chile | Mexico | Overall |
|---|---|---|---|---|
| False Positives | 0 | 1 (corrected) | N/A | 0 |
| True Positives | 7 | 6 | 0 | 13 |
| Accuracy Rate | 100% | 85% → 100% | N/A | 100% |
Note: Chile Batch 2 initial attempt had 1 false positive (Universidad Arturo Prat → Q1163558), corrected by switching to exact matching.
Match Confidence Distribution
- High confidence (exact match, verified): 13/13 (100%)
- Medium confidence (fuzzy match > 90%): 0/13 (0%)
- Low confidence (fuzzy match 80-90%): 0/13 (0%)
Data Completeness After Enrichment
| Field | Before | After |
|---|---|---|
| Wikidata ID | 0% | 6.5% (avg across BR/CL) |
| VIAF ID | 6.1% (BR only) | 6.1% |
| Website URL | 70% | 75% (improved via Wikidata) |
| Description | 90% | 95% (enhanced via Wikidata) |
Scripts & Files
Enrichment Scripts by Country
Brazil:
scripts/enrich_brazilian_batch1.py- Universities (UFRJ, USP)scripts/enrich_brazilian_batch2.py- Universities (UFMG, UFBA)scripts/enrich_brazilian_batch3.py- Museums (Museu Nacional)scripts/enrich_brazilian_batch4.py- Museums (MASP)scripts/enrich_brazilian_batch5.py- Museums (MAR)scripts/enrich_brazilian_batch6.py- Manual VIAF enrichmentscripts/enrich_brazilian_batch7.py- To be created
Chile:
scripts/enrich_chilean_batch1.py- Universities (Tarapacá, UCN)scripts/enrich_chilean_batch2_corrected.py- University departments (exact matching)scripts/enrich_chilean_batch3.py- To be created
Mexico:
scripts/enrich_mexican_batch1.py- To be created
Dataset Files by Country
Brazil:
- Input:
data/instances/brazil/brazilian_institutions_final.yaml - Current:
data/instances/brazil/brazilian_institutions_batch6_enriched.yaml - Backups:
*.batch{N}_backup
Chile:
- Input:
data/instances/chile/chilean_institutions_geocoded_v2.yaml - Current:
data/instances/chile/chilean_institutions_batch2_enriched.yaml - Backups:
*.batch{N}_backup
Mexico:
- Input:
data/instances/mexico/mexican_institutions_geocoded.yaml - Current: Not yet enriched
References
- Wikidata: https://www.wikidata.org/
- VIAF: https://viaf.org/
- Project Schema:
schemas/heritage_custodian.yaml(LinkML v0.2.1) - Agent Instructions:
AGENTS.md - Unified Overview:
data/instances/all/UNIFIED_OVERVIEW.md
Document Version: 1.0
Last Updated: 2025-11-09
Next Review: After Chile Batch 3 completion