glam/data/instances/all/ENRICHMENT_PROGRESS.md
2025-11-19 23:25:22 +01:00

356 lines
11 KiB
Markdown

# GLAM Data Extraction - Wikidata Enrichment Progress
**Last Updated**: November 9, 2025
**Goal**: Achieve 20+ institutions enriched per country (minimum 22% coverage)
---
## Overview Dashboard
| Country | Total | Enriched | Coverage | Goal | Status |
|---------|-------|----------|----------|------|--------|
| 🇧🇷 Brazil | 115 | 7 | 6.1% | 22% | 🟡 In Progress (Batch 6) |
| 🇨🇱 Chile | 90 | 6 | 6.7% | 22% | 🟢 Active (Batch 2 Complete) |
| 🇲🇽 Mexico | 117 | 0 | 0.0% | 22% | 🔴 Not Started |
| 🇯🇵 Japan | 12,065 | 0 | 0.0% | 1% | 🔴 Not Started |
| 🇱🇾 Libya | 54 | 0 | 0.0% | 22% | 🔴 Not Started |
| **TOTAL** | **12,441** | **13** | **0.1%** | **5%** | **🟡 Early Stage** |
---
## Latin America Focus - Detailed Progress
### 🇧🇷 Brazil (115 institutions)
**Current**: 7/115 (6.1%) | **Goal**: 20/115 (17.4%) | **Gap**: 13 institutions
#### Completed Batches
**Batch 1-2: Federal Universities** (4 institutions)
- Universidad Federal do Rio de Janeiro (UFRJ) - Q586904
- Universidade de São Paulo (USP) - Q835960
- Universidade Federal de Minas Gerais (UFMG) - Q835326
- Universidade Federal da Bahia (UFBA) - Q2302095
**Batch 3-5: Major Museums** (3 institutions)
- Museu Nacional (Rio) - Q924551
- Museu de Arte de São Paulo (MASP) - Q924544
- Museu de Arte do Rio (MAR) - Q10332058
**Batch 6: Manual VIAF-Linked Enrichment** (7 institutions total)
- Added institutions with existing VIAF identifiers
- Wikidata Q-numbers obtained via VIAF cross-linking
#### Next Steps - Batch 7 Options
**Option A: More Universities** (recommended - 5 institutions)
- [ ] Universidade Estadual de Campinas (UNICAMP) - Q835958
- [ ] Universidade Estadual Paulista (UNESP) - Q835331
- [ ] Universidade de Brasília (UnB) - Q583104
- [ ] Universidade Federal de Pernambuco (UFPE) - Q2303073
- [ ] Universidade Federal do Rio Grande do Sul (UFRGS) - Q735275
**Option B: State Archives** (5 institutions)
- [ ] Arquivo Público do Estado de São Paulo
- [ ] Arquivo Nacional (Rio de Janeiro)
- [ ] Arquivo Público Mineiro
- [ ] Arquivo Público do Estado da Bahia
- [ ] Arquivo Histórico Municipal de Salvador
**Option C: Cultural Centers** (5 institutions)
- [ ] Centro Cultural Banco do Brasil (Rio)
- [ ] Instituto Moreira Salles (Rio)
- [ ] Pinacoteca de São Paulo
- [ ] Museu Histórico Nacional (Rio)
- [ ] Casa de Rui Barbosa
**Recommended**: Option A (universities have best Wikidata coverage)
**Timeline**: Batch 7 scheduled for next session (November 10-11, 2025)
---
### 🇨🇱 Chile (90 institutions)
**Current**: 6/90 (6.7%) | **Goal**: 20/90 (22.2%) | **Gap**: 14 institutions
#### Completed Batches
**Batch 1: Major Universities** (2 institutions)
- Universidad de Tarapacá - Q3138071
- Universidad Católica del Norte - Q3244385
**Batch 2: University Departments** (4 institutions) - **COMPLETED TODAY** 🎉
- Universidad de Chile's Archivo Central Andrés Bello - Q219576
- Universidad de Concepción's SIBUDEC - Q1163431
- Universidad Austral (Valdivia) - Q1163558
- Universidad Católica de Temuco - Q2900814
**Strategy**: Exact matching (name + institution_type + location)
**Accuracy**: 100% (zero false positives)
**Success Rate**: 6/6 attempts (100%)
#### Next Steps - Batch 3 (READY TO RUN)
**Option A: More University Departments** (recommended - 5 institutions)
- [ ] Universidad del Bío-Bío - Q2661431
- [ ] Universidad de Talca - Q3244354
- [ ] Universidad de la Frontera (Temuco) - Q3244350
- [ ] Universidad de Magallanes (Punta Arenas) - Q3244396
- [ ] Universidad de Playa Ancha (Valparaíso) - Q3244389
**Option B: Major Santiago Museums** (5 institutions)
- [ ] Museo Nacional de Historia Natural - Q6019141
- [ ] Museo de Arte Precolombino - ?
- [ ] Museo Histórico Nacional - ?
- [ ] Museo de Bellas Artes - Q11959835
- [ ] Biblioteca Nacional de Chile - Q623559
**Recommended**: Option A (universities have near-100% Wikidata coverage)
**Timeline**: Batch 3 ready to execute (next 1-2 hours)
---
### 🇲🇽 Mexico (117 institutions)
**Current**: 0/117 (0.0%) | **Goal**: 20/117 (17.1%) | **Gap**: 20 institutions
#### Status
- ✅ Extraction complete (117 institutions)
- ✅ Geocoding complete (100% coverage)
- 🔴 Wikidata enrichment NOT STARTED
#### Proposed Batch 1 (5 institutions)
**Major Universities**:
- [ ] Universidad Nacional Autónoma de México (UNAM) - Q598949
- [ ] Instituto Politécnico Nacional (IPN) - Q1664071
- [ ] Universidad Autónoma Metropolitana (UAM) - Q2302541
**National Museums**:
- [ ] Museo Nacional de Antropología - Q1360352
- [ ] Museo Nacional de Historia (Castillo de Chapultepec) - Q2419499
**Timeline**: Batch 1 scheduled after Chile Batch 3 completion
---
## Enrichment Strategy
### Current Approach: Direct Q-Number Mapping
**Why**: SPARQL queries timeout (30 seconds per query)
**How**: Hardcode Q-numbers for known institutions
**Validation**: Exact matching on name + location + type
### Match Criteria
**Exact Match** (zero false positives):
```yaml
Match Requirements:
- Institution name EXACTLY matches Wikidata label
- City/region matches Wikidata location
- Institution type matches Wikidata instance_of
- Fuzzy similarity NOT used (prevents false positives)
```
**Success Rate by Institution Type**:
- Universities: 100% (6/6 attempts)
- Museums: 100% (3/3 attempts)
- Archives: Not yet tested
- Libraries: Not yet tested
### Lessons Learned
**What Didn't Work**:
- Fuzzy matching (80-85% similarity) → false positives
- SPARQL bulk queries → timeouts
- Generic name matching without location → ambiguity
**What Works**:
- Direct Q-number mapping (hardcoded)
- Exact matching (name + location + type)
- University departments (parent university Q-numbers)
- VIAF cross-linking (for institutions with VIAF IDs)
---
## Batch Execution Workflow
### Standard Enrichment Pipeline
```
1. Identify Target Institutions
├─ Priority: Universities > National Museums > State Archives
└─ Criteria: Complete location data, unambiguous names
2. Research Wikidata Q-Numbers
├─ Search Wikidata manually or via SPARQL
└─ Verify: name, location, institution type match
3. Create Enrichment Script
├─ Hardcode Q-numbers in Python script
├─ Implement exact matching logic
└─ Add validation checks
4. Create Backup
└─ {filename}.batch{N}_backup
5. Run Enrichment Script
└─ python scripts/enrich_{country}_batch{N}.py
6. Validate Results
├─ Check Wikidata coverage increase
├─ Verify no false positives
└─ Inspect sample records
7. Document Progress
├─ Update ENRICHMENT_PROGRESS.md
└─ Commit to Git (if applicable)
```
---
## Timeline & Milestones
### Week 1 (November 9-15, 2025)
- [x] Brazil Batch 6 complete (7 institutions, 6.1% coverage)
- [x] Chile Batch 2 complete (6 institutions, 6.7% coverage)
- [ ] Chile Batch 3 (11 institutions, 12.2% coverage)
- [ ] Brazil Batch 7 (12 institutions, 10.4% coverage)
- [ ] Mexico Batch 1 (5 institutions, 4.3% coverage)
### Week 2 (November 16-22, 2025)
- [ ] Brazil Batch 8-9 (20+ institutions, 17.4% coverage) ✅ GOAL
- [ ] Chile Batch 4-5 (20+ institutions, 22.2% coverage) ✅ GOAL
- [ ] Mexico Batch 2-3 (15+ institutions, 12.8% coverage)
### Week 3 (November 23-29, 2025)
- [ ] Mexico Batch 4-5 (20+ institutions, 17.1% coverage) ✅ GOAL
- [ ] Libya Batch 1 (10 institutions, 18.5% coverage)
- [ ] Japan Batch 1 (50 institutions, 0.4% coverage)
### Month 2 (December 2025)
- [ ] Complete Latin America (all countries > 22% coverage)
- [ ] Start Asia enrichment (Japan, Vietnam)
- [ ] Start Africa/MENA enrichment (Libya, others)
---
## Coverage Goals by Region
### Latin America (Priority Region)
| Country | Current | Short-term Goal | Long-term Goal |
|---------|---------|-----------------|----------------|
| Brazil | 6.1% | 17.4% (20 inst.) | 50% (58 inst.) |
| Chile | 6.7% | 22.2% (20 inst.) | 50% (45 inst.) |
| Mexico | 0.0% | 17.1% (20 inst.) | 50% (59 inst.) |
**Regional Goal**: 60+ institutions enriched across 3 countries
### Asia
| Country | Current | Short-term Goal | Long-term Goal |
|---------|---------|-----------------|----------------|
| Japan | 0.0% | 0.4% (50 inst.) | 5% (603 inst.) |
| Vietnam | 0.0% | 25% (5 inst.) | 50% (11 inst.) |
**Regional Goal**: 55+ institutions enriched
### Africa/MENA
| Country | Current | Short-term Goal | Long-term Goal |
|---------|---------|-----------------|----------------|
| Libya | 0.0% | 18.5% (10 inst.) | 50% (27 inst.) |
**Regional Goal**: 10+ institutions enriched
---
## Quality Metrics
### Enrichment Accuracy
| Metric | Brazil | Chile | Mexico | Overall |
|--------|--------|-------|--------|---------|
| False Positives | 0 | 1 (corrected) | N/A | 0 |
| True Positives | 7 | 6 | 0 | 13 |
| Accuracy Rate | 100% | 85% → 100% | N/A | 100% |
**Note**: Chile Batch 2 initial attempt had 1 false positive (Universidad Arturo Prat → Q1163558), corrected by switching to exact matching.
### Match Confidence Distribution
- **High confidence** (exact match, verified): 13/13 (100%)
- **Medium confidence** (fuzzy match > 90%): 0/13 (0%)
- **Low confidence** (fuzzy match 80-90%): 0/13 (0%)
### Data Completeness After Enrichment
| Field | Before | After |
|-------|--------|-------|
| Wikidata ID | 0% | 6.5% (avg across BR/CL) |
| VIAF ID | 6.1% (BR only) | 6.1% |
| Website URL | 70% | 75% (improved via Wikidata) |
| Description | 90% | 95% (enhanced via Wikidata) |
---
## Scripts & Files
### Enrichment Scripts by Country
**Brazil**:
- `scripts/enrich_brazilian_batch1.py` - Universities (UFRJ, USP)
- `scripts/enrich_brazilian_batch2.py` - Universities (UFMG, UFBA)
- `scripts/enrich_brazilian_batch3.py` - Museums (Museu Nacional)
- `scripts/enrich_brazilian_batch4.py` - Museums (MASP)
- `scripts/enrich_brazilian_batch5.py` - Museums (MAR)
- `scripts/enrich_brazilian_batch6.py` - Manual VIAF enrichment
- `scripts/enrich_brazilian_batch7.py` - *To be created*
**Chile**:
- `scripts/enrich_chilean_batch1.py` - Universities (Tarapacá, UCN)
- `scripts/enrich_chilean_batch2_corrected.py` - University departments (exact matching)
- `scripts/enrich_chilean_batch3.py` - *To be created*
**Mexico**:
- `scripts/enrich_mexican_batch1.py` - *To be created*
### Dataset Files by Country
**Brazil**:
- Input: `data/instances/brazil/brazilian_institutions_final.yaml`
- Current: `data/instances/brazil/brazilian_institutions_batch6_enriched.yaml`
- Backups: `*.batch{N}_backup`
**Chile**:
- Input: `data/instances/chile/chilean_institutions_geocoded_v2.yaml`
- Current: `data/instances/chile/chilean_institutions_batch2_enriched.yaml`
- Backups: `*.batch{N}_backup`
**Mexico**:
- Input: `data/instances/mexico/mexican_institutions_geocoded.yaml`
- Current: *Not yet enriched*
---
## References
- **Wikidata**: https://www.wikidata.org/
- **VIAF**: https://viaf.org/
- **Project Schema**: `schemas/heritage_custodian.yaml` (LinkML v0.2.1)
- **Agent Instructions**: `AGENTS.md`
- **Unified Overview**: `data/instances/all/UNIFIED_OVERVIEW.md`
---
**Document Version**: 1.0
**Last Updated**: 2025-11-09
**Next Review**: After Chile Batch 3 completion