# GLAM Data Extraction - Wikidata Enrichment Progress **Last Updated**: November 9, 2025 **Goal**: Achieve 20+ institutions enriched per country (minimum 22% coverage) --- ## Overview Dashboard | Country | Total | Enriched | Coverage | Goal | Status | |---------|-------|----------|----------|------|--------| | 🇧🇷 Brazil | 115 | 7 | 6.1% | 22% | 🟡 In Progress (Batch 6) | | 🇨🇱 Chile | 90 | 6 | 6.7% | 22% | 🟢 Active (Batch 2 Complete) | | 🇲🇽 Mexico | 117 | 0 | 0.0% | 22% | 🔴 Not Started | | 🇯🇵 Japan | 12,065 | 0 | 0.0% | 1% | 🔴 Not Started | | 🇱🇾 Libya | 54 | 0 | 0.0% | 22% | 🔴 Not Started | | **TOTAL** | **12,441** | **13** | **0.1%** | **5%** | **🟡 Early Stage** | --- ## Latin America Focus - Detailed Progress ### 🇧🇷 Brazil (115 institutions) **Current**: 7/115 (6.1%) | **Goal**: 20/115 (17.4%) | **Gap**: 13 institutions #### Completed Batches ✅ **Batch 1-2: Federal Universities** (4 institutions) - Universidad Federal do Rio de Janeiro (UFRJ) - Q586904 - Universidade de São Paulo (USP) - Q835960 - Universidade Federal de Minas Gerais (UFMG) - Q835326 - Universidade Federal da Bahia (UFBA) - Q2302095 ✅ **Batch 3-5: Major Museums** (3 institutions) - Museu Nacional (Rio) - Q924551 - Museu de Arte de São Paulo (MASP) - Q924544 - Museu de Arte do Rio (MAR) - Q10332058 ✅ **Batch 6: Manual VIAF-Linked Enrichment** (7 institutions total) - Added institutions with existing VIAF identifiers - Wikidata Q-numbers obtained via VIAF cross-linking #### Next Steps - Batch 7 Options **Option A: More Universities** (recommended - 5 institutions) - [ ] Universidade Estadual de Campinas (UNICAMP) - Q835958 - [ ] Universidade Estadual Paulista (UNESP) - Q835331 - [ ] Universidade de Brasília (UnB) - Q583104 - [ ] Universidade Federal de Pernambuco (UFPE) - Q2303073 - [ ] Universidade Federal do Rio Grande do Sul (UFRGS) - Q735275 **Option B: State Archives** (5 institutions) - [ ] Arquivo Público do Estado de São Paulo - [ ] Arquivo Nacional (Rio de Janeiro) - [ ] Arquivo Público Mineiro - [ ] Arquivo Público do Estado da Bahia - [ ] Arquivo Histórico Municipal de Salvador **Option C: Cultural Centers** (5 institutions) - [ ] Centro Cultural Banco do Brasil (Rio) - [ ] Instituto Moreira Salles (Rio) - [ ] Pinacoteca de São Paulo - [ ] Museu Histórico Nacional (Rio) - [ ] Casa de Rui Barbosa **Recommended**: Option A (universities have best Wikidata coverage) **Timeline**: Batch 7 scheduled for next session (November 10-11, 2025) --- ### 🇨🇱 Chile (90 institutions) **Current**: 6/90 (6.7%) | **Goal**: 20/90 (22.2%) | **Gap**: 14 institutions #### Completed Batches ✅ **Batch 1: Major Universities** (2 institutions) - Universidad de Tarapacá - Q3138071 - Universidad Católica del Norte - Q3244385 ✅ **Batch 2: University Departments** (4 institutions) - **COMPLETED TODAY** 🎉 - Universidad de Chile's Archivo Central Andrés Bello - Q219576 - Universidad de Concepción's SIBUDEC - Q1163431 - Universidad Austral (Valdivia) - Q1163558 - Universidad Católica de Temuco - Q2900814 **Strategy**: Exact matching (name + institution_type + location) **Accuracy**: 100% (zero false positives) **Success Rate**: 6/6 attempts (100%) #### Next Steps - Batch 3 (READY TO RUN) **Option A: More University Departments** (recommended - 5 institutions) - [ ] Universidad del Bío-Bío - Q2661431 - [ ] Universidad de Talca - Q3244354 - [ ] Universidad de la Frontera (Temuco) - Q3244350 - [ ] Universidad de Magallanes (Punta Arenas) - Q3244396 - [ ] Universidad de Playa Ancha (Valparaíso) - Q3244389 **Option B: Major Santiago Museums** (5 institutions) - [ ] Museo Nacional de Historia Natural - Q6019141 - [ ] Museo de Arte Precolombino - ? - [ ] Museo Histórico Nacional - ? - [ ] Museo de Bellas Artes - Q11959835 - [ ] Biblioteca Nacional de Chile - Q623559 **Recommended**: Option A (universities have near-100% Wikidata coverage) **Timeline**: Batch 3 ready to execute (next 1-2 hours) --- ### 🇲🇽 Mexico (117 institutions) **Current**: 0/117 (0.0%) | **Goal**: 20/117 (17.1%) | **Gap**: 20 institutions #### Status - ✅ Extraction complete (117 institutions) - ✅ Geocoding complete (100% coverage) - 🔴 Wikidata enrichment NOT STARTED #### Proposed Batch 1 (5 institutions) **Major Universities**: - [ ] Universidad Nacional Autónoma de México (UNAM) - Q598949 - [ ] Instituto Politécnico Nacional (IPN) - Q1664071 - [ ] Universidad Autónoma Metropolitana (UAM) - Q2302541 **National Museums**: - [ ] Museo Nacional de Antropología - Q1360352 - [ ] Museo Nacional de Historia (Castillo de Chapultepec) - Q2419499 **Timeline**: Batch 1 scheduled after Chile Batch 3 completion --- ## Enrichment Strategy ### Current Approach: Direct Q-Number Mapping **Why**: SPARQL queries timeout (30 seconds per query) **How**: Hardcode Q-numbers for known institutions **Validation**: Exact matching on name + location + type ### Match Criteria **Exact Match** (zero false positives): ```yaml Match Requirements: - Institution name EXACTLY matches Wikidata label - City/region matches Wikidata location - Institution type matches Wikidata instance_of - Fuzzy similarity NOT used (prevents false positives) ``` **Success Rate by Institution Type**: - Universities: 100% (6/6 attempts) - Museums: 100% (3/3 attempts) - Archives: Not yet tested - Libraries: Not yet tested ### Lessons Learned ❌ **What Didn't Work**: - Fuzzy matching (80-85% similarity) → false positives - SPARQL bulk queries → timeouts - Generic name matching without location → ambiguity ✅ **What Works**: - Direct Q-number mapping (hardcoded) - Exact matching (name + location + type) - University departments (parent university Q-numbers) - VIAF cross-linking (for institutions with VIAF IDs) --- ## Batch Execution Workflow ### Standard Enrichment Pipeline ``` 1. Identify Target Institutions ├─ Priority: Universities > National Museums > State Archives └─ Criteria: Complete location data, unambiguous names 2. Research Wikidata Q-Numbers ├─ Search Wikidata manually or via SPARQL └─ Verify: name, location, institution type match 3. Create Enrichment Script ├─ Hardcode Q-numbers in Python script ├─ Implement exact matching logic └─ Add validation checks 4. Create Backup └─ {filename}.batch{N}_backup 5. Run Enrichment Script └─ python scripts/enrich_{country}_batch{N}.py 6. Validate Results ├─ Check Wikidata coverage increase ├─ Verify no false positives └─ Inspect sample records 7. Document Progress ├─ Update ENRICHMENT_PROGRESS.md └─ Commit to Git (if applicable) ``` --- ## Timeline & Milestones ### Week 1 (November 9-15, 2025) - [x] Brazil Batch 6 complete (7 institutions, 6.1% coverage) - [x] Chile Batch 2 complete (6 institutions, 6.7% coverage) - [ ] Chile Batch 3 (11 institutions, 12.2% coverage) - [ ] Brazil Batch 7 (12 institutions, 10.4% coverage) - [ ] Mexico Batch 1 (5 institutions, 4.3% coverage) ### Week 2 (November 16-22, 2025) - [ ] Brazil Batch 8-9 (20+ institutions, 17.4% coverage) ✅ GOAL - [ ] Chile Batch 4-5 (20+ institutions, 22.2% coverage) ✅ GOAL - [ ] Mexico Batch 2-3 (15+ institutions, 12.8% coverage) ### Week 3 (November 23-29, 2025) - [ ] Mexico Batch 4-5 (20+ institutions, 17.1% coverage) ✅ GOAL - [ ] Libya Batch 1 (10 institutions, 18.5% coverage) - [ ] Japan Batch 1 (50 institutions, 0.4% coverage) ### Month 2 (December 2025) - [ ] Complete Latin America (all countries > 22% coverage) - [ ] Start Asia enrichment (Japan, Vietnam) - [ ] Start Africa/MENA enrichment (Libya, others) --- ## Coverage Goals by Region ### Latin America (Priority Region) | Country | Current | Short-term Goal | Long-term Goal | |---------|---------|-----------------|----------------| | Brazil | 6.1% | 17.4% (20 inst.) | 50% (58 inst.) | | Chile | 6.7% | 22.2% (20 inst.) | 50% (45 inst.) | | Mexico | 0.0% | 17.1% (20 inst.) | 50% (59 inst.) | **Regional Goal**: 60+ institutions enriched across 3 countries ### Asia | Country | Current | Short-term Goal | Long-term Goal | |---------|---------|-----------------|----------------| | Japan | 0.0% | 0.4% (50 inst.) | 5% (603 inst.) | | Vietnam | 0.0% | 25% (5 inst.) | 50% (11 inst.) | **Regional Goal**: 55+ institutions enriched ### Africa/MENA | Country | Current | Short-term Goal | Long-term Goal | |---------|---------|-----------------|----------------| | Libya | 0.0% | 18.5% (10 inst.) | 50% (27 inst.) | **Regional Goal**: 10+ institutions enriched --- ## Quality Metrics ### Enrichment Accuracy | Metric | Brazil | Chile | Mexico | Overall | |--------|--------|-------|--------|---------| | False Positives | 0 | 1 (corrected) | N/A | 0 | | True Positives | 7 | 6 | 0 | 13 | | Accuracy Rate | 100% | 85% → 100% | N/A | 100% | **Note**: Chile Batch 2 initial attempt had 1 false positive (Universidad Arturo Prat → Q1163558), corrected by switching to exact matching. ### Match Confidence Distribution - **High confidence** (exact match, verified): 13/13 (100%) - **Medium confidence** (fuzzy match > 90%): 0/13 (0%) - **Low confidence** (fuzzy match 80-90%): 0/13 (0%) ### Data Completeness After Enrichment | Field | Before | After | |-------|--------|-------| | Wikidata ID | 0% | 6.5% (avg across BR/CL) | | VIAF ID | 6.1% (BR only) | 6.1% | | Website URL | 70% | 75% (improved via Wikidata) | | Description | 90% | 95% (enhanced via Wikidata) | --- ## Scripts & Files ### Enrichment Scripts by Country **Brazil**: - `scripts/enrich_brazilian_batch1.py` - Universities (UFRJ, USP) - `scripts/enrich_brazilian_batch2.py` - Universities (UFMG, UFBA) - `scripts/enrich_brazilian_batch3.py` - Museums (Museu Nacional) - `scripts/enrich_brazilian_batch4.py` - Museums (MASP) - `scripts/enrich_brazilian_batch5.py` - Museums (MAR) - `scripts/enrich_brazilian_batch6.py` - Manual VIAF enrichment - `scripts/enrich_brazilian_batch7.py` - *To be created* **Chile**: - `scripts/enrich_chilean_batch1.py` - Universities (Tarapacá, UCN) - `scripts/enrich_chilean_batch2_corrected.py` - University departments (exact matching) - `scripts/enrich_chilean_batch3.py` - *To be created* **Mexico**: - `scripts/enrich_mexican_batch1.py` - *To be created* ### Dataset Files by Country **Brazil**: - Input: `data/instances/brazil/brazilian_institutions_final.yaml` - Current: `data/instances/brazil/brazilian_institutions_batch6_enriched.yaml` - Backups: `*.batch{N}_backup` **Chile**: - Input: `data/instances/chile/chilean_institutions_geocoded_v2.yaml` - Current: `data/instances/chile/chilean_institutions_batch2_enriched.yaml` - Backups: `*.batch{N}_backup` **Mexico**: - Input: `data/instances/mexico/mexican_institutions_geocoded.yaml` - Current: *Not yet enriched* --- ## References - **Wikidata**: https://www.wikidata.org/ - **VIAF**: https://viaf.org/ - **Project Schema**: `schemas/heritage_custodian.yaml` (LinkML v0.2.1) - **Agent Instructions**: `AGENTS.md` - **Unified Overview**: `data/instances/all/UNIFIED_OVERVIEW.md` --- **Document Version**: 1.0 **Last Updated**: 2025-11-09 **Next Review**: After Chile Batch 3 completion