356 lines
11 KiB
Markdown
356 lines
11 KiB
Markdown
# GLAM Data Extraction - Wikidata Enrichment Progress
|
|
|
|
**Last Updated**: November 9, 2025
|
|
**Goal**: Achieve 20+ institutions enriched per country (minimum 22% coverage)
|
|
|
|
---
|
|
|
|
## Overview Dashboard
|
|
|
|
| Country | Total | Enriched | Coverage | Goal | Status |
|
|
|---------|-------|----------|----------|------|--------|
|
|
| 🇧🇷 Brazil | 115 | 7 | 6.1% | 22% | 🟡 In Progress (Batch 6) |
|
|
| 🇨🇱 Chile | 90 | 6 | 6.7% | 22% | 🟢 Active (Batch 2 Complete) |
|
|
| 🇲🇽 Mexico | 117 | 0 | 0.0% | 22% | 🔴 Not Started |
|
|
| 🇯🇵 Japan | 12,065 | 0 | 0.0% | 1% | 🔴 Not Started |
|
|
| 🇱🇾 Libya | 54 | 0 | 0.0% | 22% | 🔴 Not Started |
|
|
| **TOTAL** | **12,441** | **13** | **0.1%** | **5%** | **🟡 Early Stage** |
|
|
|
|
---
|
|
|
|
## Latin America Focus - Detailed Progress
|
|
|
|
### 🇧🇷 Brazil (115 institutions)
|
|
|
|
**Current**: 7/115 (6.1%) | **Goal**: 20/115 (17.4%) | **Gap**: 13 institutions
|
|
|
|
#### Completed Batches
|
|
|
|
✅ **Batch 1-2: Federal Universities** (4 institutions)
|
|
- Universidad Federal do Rio de Janeiro (UFRJ) - Q586904
|
|
- Universidade de São Paulo (USP) - Q835960
|
|
- Universidade Federal de Minas Gerais (UFMG) - Q835326
|
|
- Universidade Federal da Bahia (UFBA) - Q2302095
|
|
|
|
✅ **Batch 3-5: Major Museums** (3 institutions)
|
|
- Museu Nacional (Rio) - Q924551
|
|
- Museu de Arte de São Paulo (MASP) - Q924544
|
|
- Museu de Arte do Rio (MAR) - Q10332058
|
|
|
|
✅ **Batch 6: Manual VIAF-Linked Enrichment** (7 institutions total)
|
|
- Added institutions with existing VIAF identifiers
|
|
- Wikidata Q-numbers obtained via VIAF cross-linking
|
|
|
|
#### Next Steps - Batch 7 Options
|
|
|
|
**Option A: More Universities** (recommended - 5 institutions)
|
|
- [ ] Universidade Estadual de Campinas (UNICAMP) - Q835958
|
|
- [ ] Universidade Estadual Paulista (UNESP) - Q835331
|
|
- [ ] Universidade de Brasília (UnB) - Q583104
|
|
- [ ] Universidade Federal de Pernambuco (UFPE) - Q2303073
|
|
- [ ] Universidade Federal do Rio Grande do Sul (UFRGS) - Q735275
|
|
|
|
**Option B: State Archives** (5 institutions)
|
|
- [ ] Arquivo Público do Estado de São Paulo
|
|
- [ ] Arquivo Nacional (Rio de Janeiro)
|
|
- [ ] Arquivo Público Mineiro
|
|
- [ ] Arquivo Público do Estado da Bahia
|
|
- [ ] Arquivo Histórico Municipal de Salvador
|
|
|
|
**Option C: Cultural Centers** (5 institutions)
|
|
- [ ] Centro Cultural Banco do Brasil (Rio)
|
|
- [ ] Instituto Moreira Salles (Rio)
|
|
- [ ] Pinacoteca de São Paulo
|
|
- [ ] Museu Histórico Nacional (Rio)
|
|
- [ ] Casa de Rui Barbosa
|
|
|
|
**Recommended**: Option A (universities have best Wikidata coverage)
|
|
|
|
**Timeline**: Batch 7 scheduled for next session (November 10-11, 2025)
|
|
|
|
---
|
|
|
|
### 🇨🇱 Chile (90 institutions)
|
|
|
|
**Current**: 6/90 (6.7%) | **Goal**: 20/90 (22.2%) | **Gap**: 14 institutions
|
|
|
|
#### Completed Batches
|
|
|
|
✅ **Batch 1: Major Universities** (2 institutions)
|
|
- Universidad de Tarapacá - Q3138071
|
|
- Universidad Católica del Norte - Q3244385
|
|
|
|
✅ **Batch 2: University Departments** (4 institutions) - **COMPLETED TODAY** 🎉
|
|
- Universidad de Chile's Archivo Central Andrés Bello - Q219576
|
|
- Universidad de Concepción's SIBUDEC - Q1163431
|
|
- Universidad Austral (Valdivia) - Q1163558
|
|
- Universidad Católica de Temuco - Q2900814
|
|
|
|
**Strategy**: Exact matching (name + institution_type + location)
|
|
**Accuracy**: 100% (zero false positives)
|
|
**Success Rate**: 6/6 attempts (100%)
|
|
|
|
#### Next Steps - Batch 3 (READY TO RUN)
|
|
|
|
**Option A: More University Departments** (recommended - 5 institutions)
|
|
- [ ] Universidad del Bío-Bío - Q2661431
|
|
- [ ] Universidad de Talca - Q3244354
|
|
- [ ] Universidad de la Frontera (Temuco) - Q3244350
|
|
- [ ] Universidad de Magallanes (Punta Arenas) - Q3244396
|
|
- [ ] Universidad de Playa Ancha (Valparaíso) - Q3244389
|
|
|
|
**Option B: Major Santiago Museums** (5 institutions)
|
|
- [ ] Museo Nacional de Historia Natural - Q6019141
|
|
- [ ] Museo de Arte Precolombino - ?
|
|
- [ ] Museo Histórico Nacional - ?
|
|
- [ ] Museo de Bellas Artes - Q11959835
|
|
- [ ] Biblioteca Nacional de Chile - Q623559
|
|
|
|
**Recommended**: Option A (universities have near-100% Wikidata coverage)
|
|
|
|
**Timeline**: Batch 3 ready to execute (next 1-2 hours)
|
|
|
|
---
|
|
|
|
### 🇲🇽 Mexico (117 institutions)
|
|
|
|
**Current**: 0/117 (0.0%) | **Goal**: 20/117 (17.1%) | **Gap**: 20 institutions
|
|
|
|
#### Status
|
|
- ✅ Extraction complete (117 institutions)
|
|
- ✅ Geocoding complete (100% coverage)
|
|
- 🔴 Wikidata enrichment NOT STARTED
|
|
|
|
#### Proposed Batch 1 (5 institutions)
|
|
|
|
**Major Universities**:
|
|
- [ ] Universidad Nacional Autónoma de México (UNAM) - Q598949
|
|
- [ ] Instituto Politécnico Nacional (IPN) - Q1664071
|
|
- [ ] Universidad Autónoma Metropolitana (UAM) - Q2302541
|
|
|
|
**National Museums**:
|
|
- [ ] Museo Nacional de Antropología - Q1360352
|
|
- [ ] Museo Nacional de Historia (Castillo de Chapultepec) - Q2419499
|
|
|
|
**Timeline**: Batch 1 scheduled after Chile Batch 3 completion
|
|
|
|
---
|
|
|
|
## Enrichment Strategy
|
|
|
|
### Current Approach: Direct Q-Number Mapping
|
|
|
|
**Why**: SPARQL queries timeout (30 seconds per query)
|
|
**How**: Hardcode Q-numbers for known institutions
|
|
**Validation**: Exact matching on name + location + type
|
|
|
|
### Match Criteria
|
|
|
|
**Exact Match** (zero false positives):
|
|
```yaml
|
|
Match Requirements:
|
|
- Institution name EXACTLY matches Wikidata label
|
|
- City/region matches Wikidata location
|
|
- Institution type matches Wikidata instance_of
|
|
- Fuzzy similarity NOT used (prevents false positives)
|
|
```
|
|
|
|
**Success Rate by Institution Type**:
|
|
- Universities: 100% (6/6 attempts)
|
|
- Museums: 100% (3/3 attempts)
|
|
- Archives: Not yet tested
|
|
- Libraries: Not yet tested
|
|
|
|
### Lessons Learned
|
|
|
|
❌ **What Didn't Work**:
|
|
- Fuzzy matching (80-85% similarity) → false positives
|
|
- SPARQL bulk queries → timeouts
|
|
- Generic name matching without location → ambiguity
|
|
|
|
✅ **What Works**:
|
|
- Direct Q-number mapping (hardcoded)
|
|
- Exact matching (name + location + type)
|
|
- University departments (parent university Q-numbers)
|
|
- VIAF cross-linking (for institutions with VIAF IDs)
|
|
|
|
---
|
|
|
|
## Batch Execution Workflow
|
|
|
|
### Standard Enrichment Pipeline
|
|
|
|
```
|
|
1. Identify Target Institutions
|
|
├─ Priority: Universities > National Museums > State Archives
|
|
└─ Criteria: Complete location data, unambiguous names
|
|
|
|
2. Research Wikidata Q-Numbers
|
|
├─ Search Wikidata manually or via SPARQL
|
|
└─ Verify: name, location, institution type match
|
|
|
|
3. Create Enrichment Script
|
|
├─ Hardcode Q-numbers in Python script
|
|
├─ Implement exact matching logic
|
|
└─ Add validation checks
|
|
|
|
4. Create Backup
|
|
└─ {filename}.batch{N}_backup
|
|
|
|
5. Run Enrichment Script
|
|
└─ python scripts/enrich_{country}_batch{N}.py
|
|
|
|
6. Validate Results
|
|
├─ Check Wikidata coverage increase
|
|
├─ Verify no false positives
|
|
└─ Inspect sample records
|
|
|
|
7. Document Progress
|
|
├─ Update ENRICHMENT_PROGRESS.md
|
|
└─ Commit to Git (if applicable)
|
|
```
|
|
|
|
---
|
|
|
|
## Timeline & Milestones
|
|
|
|
### Week 1 (November 9-15, 2025)
|
|
|
|
- [x] Brazil Batch 6 complete (7 institutions, 6.1% coverage)
|
|
- [x] Chile Batch 2 complete (6 institutions, 6.7% coverage)
|
|
- [ ] Chile Batch 3 (11 institutions, 12.2% coverage)
|
|
- [ ] Brazil Batch 7 (12 institutions, 10.4% coverage)
|
|
- [ ] Mexico Batch 1 (5 institutions, 4.3% coverage)
|
|
|
|
### Week 2 (November 16-22, 2025)
|
|
|
|
- [ ] Brazil Batch 8-9 (20+ institutions, 17.4% coverage) ✅ GOAL
|
|
- [ ] Chile Batch 4-5 (20+ institutions, 22.2% coverage) ✅ GOAL
|
|
- [ ] Mexico Batch 2-3 (15+ institutions, 12.8% coverage)
|
|
|
|
### Week 3 (November 23-29, 2025)
|
|
|
|
- [ ] Mexico Batch 4-5 (20+ institutions, 17.1% coverage) ✅ GOAL
|
|
- [ ] Libya Batch 1 (10 institutions, 18.5% coverage)
|
|
- [ ] Japan Batch 1 (50 institutions, 0.4% coverage)
|
|
|
|
### Month 2 (December 2025)
|
|
|
|
- [ ] Complete Latin America (all countries > 22% coverage)
|
|
- [ ] Start Asia enrichment (Japan, Vietnam)
|
|
- [ ] Start Africa/MENA enrichment (Libya, others)
|
|
|
|
---
|
|
|
|
## Coverage Goals by Region
|
|
|
|
### Latin America (Priority Region)
|
|
|
|
| Country | Current | Short-term Goal | Long-term Goal |
|
|
|---------|---------|-----------------|----------------|
|
|
| Brazil | 6.1% | 17.4% (20 inst.) | 50% (58 inst.) |
|
|
| Chile | 6.7% | 22.2% (20 inst.) | 50% (45 inst.) |
|
|
| Mexico | 0.0% | 17.1% (20 inst.) | 50% (59 inst.) |
|
|
|
|
**Regional Goal**: 60+ institutions enriched across 3 countries
|
|
|
|
### Asia
|
|
|
|
| Country | Current | Short-term Goal | Long-term Goal |
|
|
|---------|---------|-----------------|----------------|
|
|
| Japan | 0.0% | 0.4% (50 inst.) | 5% (603 inst.) |
|
|
| Vietnam | 0.0% | 25% (5 inst.) | 50% (11 inst.) |
|
|
|
|
**Regional Goal**: 55+ institutions enriched
|
|
|
|
### Africa/MENA
|
|
|
|
| Country | Current | Short-term Goal | Long-term Goal |
|
|
|---------|---------|-----------------|----------------|
|
|
| Libya | 0.0% | 18.5% (10 inst.) | 50% (27 inst.) |
|
|
|
|
**Regional Goal**: 10+ institutions enriched
|
|
|
|
---
|
|
|
|
## Quality Metrics
|
|
|
|
### Enrichment Accuracy
|
|
|
|
| Metric | Brazil | Chile | Mexico | Overall |
|
|
|--------|--------|-------|--------|---------|
|
|
| False Positives | 0 | 1 (corrected) | N/A | 0 |
|
|
| True Positives | 7 | 6 | 0 | 13 |
|
|
| Accuracy Rate | 100% | 85% → 100% | N/A | 100% |
|
|
|
|
**Note**: Chile Batch 2 initial attempt had 1 false positive (Universidad Arturo Prat → Q1163558), corrected by switching to exact matching.
|
|
|
|
### Match Confidence Distribution
|
|
|
|
- **High confidence** (exact match, verified): 13/13 (100%)
|
|
- **Medium confidence** (fuzzy match > 90%): 0/13 (0%)
|
|
- **Low confidence** (fuzzy match 80-90%): 0/13 (0%)
|
|
|
|
### Data Completeness After Enrichment
|
|
|
|
| Field | Before | After |
|
|
|-------|--------|-------|
|
|
| Wikidata ID | 0% | 6.5% (avg across BR/CL) |
|
|
| VIAF ID | 6.1% (BR only) | 6.1% |
|
|
| Website URL | 70% | 75% (improved via Wikidata) |
|
|
| Description | 90% | 95% (enhanced via Wikidata) |
|
|
|
|
---
|
|
|
|
## Scripts & Files
|
|
|
|
### Enrichment Scripts by Country
|
|
|
|
**Brazil**:
|
|
- `scripts/enrich_brazilian_batch1.py` - Universities (UFRJ, USP)
|
|
- `scripts/enrich_brazilian_batch2.py` - Universities (UFMG, UFBA)
|
|
- `scripts/enrich_brazilian_batch3.py` - Museums (Museu Nacional)
|
|
- `scripts/enrich_brazilian_batch4.py` - Museums (MASP)
|
|
- `scripts/enrich_brazilian_batch5.py` - Museums (MAR)
|
|
- `scripts/enrich_brazilian_batch6.py` - Manual VIAF enrichment
|
|
- `scripts/enrich_brazilian_batch7.py` - *To be created*
|
|
|
|
**Chile**:
|
|
- `scripts/enrich_chilean_batch1.py` - Universities (Tarapacá, UCN)
|
|
- `scripts/enrich_chilean_batch2_corrected.py` - University departments (exact matching)
|
|
- `scripts/enrich_chilean_batch3.py` - *To be created*
|
|
|
|
**Mexico**:
|
|
- `scripts/enrich_mexican_batch1.py` - *To be created*
|
|
|
|
### Dataset Files by Country
|
|
|
|
**Brazil**:
|
|
- Input: `data/instances/brazil/brazilian_institutions_final.yaml`
|
|
- Current: `data/instances/brazil/brazilian_institutions_batch6_enriched.yaml`
|
|
- Backups: `*.batch{N}_backup`
|
|
|
|
**Chile**:
|
|
- Input: `data/instances/chile/chilean_institutions_geocoded_v2.yaml`
|
|
- Current: `data/instances/chile/chilean_institutions_batch2_enriched.yaml`
|
|
- Backups: `*.batch{N}_backup`
|
|
|
|
**Mexico**:
|
|
- Input: `data/instances/mexico/mexican_institutions_geocoded.yaml`
|
|
- Current: *Not yet enriched*
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Wikidata**: https://www.wikidata.org/
|
|
- **VIAF**: https://viaf.org/
|
|
- **Project Schema**: `schemas/heritage_custodian.yaml` (LinkML v0.2.1)
|
|
- **Agent Instructions**: `AGENTS.md`
|
|
- **Unified Overview**: `data/instances/all/UNIFIED_OVERVIEW.md`
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Last Updated**: 2025-11-09
|
|
**Next Review**: After Chile Batch 3 completion
|