362 lines
11 KiB
Markdown
362 lines
11 KiB
Markdown
# Brazil Wikidata Enrichment - Batch 16 Plan
|
|
|
|
**Date**: November 11, 2025
|
|
**Current Coverage**: 63.2% (79/125)
|
|
**Target Coverage**: 65-68% (81-85/125)
|
|
**Goal**: Enrich 5-10 institutions from remaining 46 without Wikidata
|
|
|
|
---
|
|
|
|
## Current Status
|
|
|
|
### Coverage Statistics
|
|
- Total Brazilian institutions: **125**
|
|
- With Wikidata: **79** (63.2%)
|
|
- Without Wikidata: **46** (36.8%)
|
|
- Remaining to reach 70% target: **9 institutions**
|
|
|
|
### Institution Breakdown (46 without Wikidata)
|
|
After filtering out technical systems:
|
|
- **Real institutions**: 35
|
|
- **Technical systems/platforms**: 11 (APIs, DSpace, Tainacan, etc.)
|
|
|
|
---
|
|
|
|
## Batch 16 Strategy
|
|
|
|
### Priority Targeting
|
|
|
|
Focus on institutions with **highest Wikidata likelihood**:
|
|
|
|
1. ✅ **State/regional museums** (likely documented)
|
|
2. ✅ **Official cultural foundations** (government institutions)
|
|
3. ✅ **University collections** (academic institutions)
|
|
4. ✅ **Major state archives** (public institutions)
|
|
5. ⚠️ **Regional heritage projects** (lower priority - may lack Wikidata)
|
|
|
|
### Search Improvements
|
|
|
|
1. **Portuguese-language queries**: Use native Portuguese names
|
|
2. **State-level searches**: "Museu [State]", "Arquivo [State]"
|
|
3. **Alternative name variants**: Test abbreviations (MHAM, FEM, etc.)
|
|
4. **Geographic context**: Include city/state in searches
|
|
|
|
---
|
|
|
|
## Top 15 Candidates for Batch 16
|
|
|
|
### Tier 1: High Likelihood (Archives, Major Museums, Official Institutions)
|
|
|
|
#### 1. **DEAP Archives** ⭐⭐⭐
|
|
- **Type**: ARCHIVE
|
|
- **Location**: Paraná state
|
|
- **Description**: 100,000+ immigrant records online
|
|
- **Search Strategy**:
|
|
- "Departamento Estadual de Arquivo Público Paraná"
|
|
- "DEAP Paraná"
|
|
- "Arquivo Público do Paraná"
|
|
- **Likelihood**: **VERY HIGH** (state archive with large collection)
|
|
|
|
#### 2. **APESP** ⭐⭐⭐
|
|
- **Type**: MIXED (likely ARCHIVE)
|
|
- **Location**: São Paulo
|
|
- **Description**: 25M documents, 1M+ digitized pages
|
|
- **Search Strategy**:
|
|
- "Arquivo Público do Estado de São Paulo"
|
|
- "APESP São Paulo"
|
|
- **Likelihood**: **VERY HIGH** (major state archive, largest in Brazil)
|
|
|
|
#### 3. **Museu dos Povos Acreanos** ⭐⭐⭐
|
|
- **Type**: MUSEUM
|
|
- **Location**: Rio Branco, Acre
|
|
- **Description**: Opened 2023, $2.8M World Bank funding
|
|
- **Search Strategy**:
|
|
- "Museu dos Povos Acreanos"
|
|
- "Museum of Acrean Peoples"
|
|
- "Rio Branco museum"
|
|
- **Likelihood**: **HIGH** (new museum, recent World Bank project)
|
|
|
|
#### 4. **Museu Histórico de Alcântara** ⭐⭐
|
|
- **Type**: MUSEUM
|
|
- **Location**: Alcântara, Maranhão
|
|
- **Description**: 10,000 pieces
|
|
- **Search Strategy**:
|
|
- "Museu Histórico de Alcântara"
|
|
- "Alcântara historical museum"
|
|
- "Museu Casa Histórica Alcântara"
|
|
- **Likelihood**: **MEDIUM-HIGH** (colonial city, UNESCO site area)
|
|
|
|
#### 5. **Sistema Brasileiro de Museus (SBM)** ⭐⭐⭐
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Location**: National (Brasília)
|
|
- **Description**: Created 2004, updated 2013 (Decree 8.124), IBRAM coordination
|
|
- **Search Strategy**:
|
|
- "Sistema Brasileiro de Museus"
|
|
- "Brazilian Museum System"
|
|
- "SBM Brasil IBRAM"
|
|
- **Likelihood**: **VERY HIGH** (federal system, official government program)
|
|
|
|
#### 6. **Fundação de Cultura Elias Mansour** ⭐⭐
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Location**: Acre state
|
|
- **Description**: State cultural foundation (https://www.femcultura.ac.gov.br/)
|
|
- **Search Strategy**:
|
|
- "Fundação de Cultura Elias Mansour"
|
|
- "FEM Acre"
|
|
- "Elias Mansour Cultural Foundation"
|
|
- **Likelihood**: **MEDIUM-HIGH** (state foundation with active website)
|
|
|
|
#### 7. **FCRB** ⭐⭐⭐
|
|
- **Type**: MIXED (likely LIBRARY/OFFICIAL_INSTITUTION)
|
|
- **Location**: Rio de Janeiro
|
|
- **Description**: RUBI repository (DSpace), UNESCO recognition
|
|
- **Search Strategy**:
|
|
- "Fundação Casa de Rui Barbosa"
|
|
- "FCRB Rio de Janeiro"
|
|
- "Casa de Rui Barbosa"
|
|
- **Likelihood**: **VERY HIGH** (major federal foundation, UNESCO recognized)
|
|
|
|
#### 8. **FUMDHAM** ⭐⭐
|
|
- **Type**: MIXED (likely OFFICIAL_INSTITUTION/RESEARCH_CENTER)
|
|
- **Location**: São Raimundo Nonato, Piauí
|
|
- **Description**: Rock art preservation (Serra da Capivara)
|
|
- **Search Strategy**:
|
|
- "Fundação Museu do Homem Americano"
|
|
- "FUMDHAM"
|
|
- "Serra da Capivara foundation"
|
|
- **Likelihood**: **MEDIUM-HIGH** (UNESCO World Heritage site management)
|
|
|
|
#### 9. **Museu Memória Rondoniense** ⭐
|
|
- **Type**: MUSEUM
|
|
- **Location**: Porto Velho, Rondônia
|
|
- **Description**: @museudamemoriarondoniense, 10,000+ records
|
|
- **Search Strategy**:
|
|
- "Museu da Memória Rondoniense"
|
|
- "Porto Velho memory museum"
|
|
- **Likelihood**: **MEDIUM** (state capital museum, active social media)
|
|
|
|
#### 10. **MuseusBr** ⭐⭐⭐
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Location**: National platform
|
|
- **Description**: National platform with thousands of museum records, IBRAM governance
|
|
- **Search Strategy**:
|
|
- "MuseusBr platform"
|
|
- "MuseusBr IBRAM"
|
|
- "Brazilian museums platform"
|
|
- **Likelihood**: **HIGH** (official IBRAM platform, government system)
|
|
|
|
---
|
|
|
|
### Tier 2: Medium Likelihood (University Collections, Regional Museums)
|
|
|
|
#### 11. **MUSEAR/UFMT** ⭐
|
|
- **Type**: EDUCATION_PROVIDER
|
|
- **Location**: Mato Grosso (UFMT - Federal University)
|
|
- **Description**: @musearufmt, 3,000+ pieces, 29+ tribes
|
|
- **Search Strategy**:
|
|
- "MUSEAR UFMT"
|
|
- "Museu Arqueologia Etnologia UFMT"
|
|
- "Universidade Federal Mato Grosso museum"
|
|
- **Likelihood**: **MEDIUM** (university museum, anthropology focus)
|
|
|
|
#### 12. **Instituto Insikiran** ⭐
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Location**: Roraima
|
|
- **Description**: Indigenous higher education, 300+ graduates
|
|
- **Search Strategy**:
|
|
- "Instituto Insikiran"
|
|
- "UFRR Insikiran"
|
|
- "Indigenous education Roraima"
|
|
- **Likelihood**: **MEDIUM** (UFRR institute, indigenous focus)
|
|
|
|
#### 13. **Natural History Museum Campina Grande** ⭐
|
|
- **Type**: MUSEUM
|
|
- **Location**: Campina Grande, Paraíba
|
|
- **Description**: Natural history focus
|
|
- **Search Strategy**:
|
|
- "Museu História Natural Campina Grande"
|
|
- "Natural history museum Paraíba"
|
|
- "MUHNA Campina Grande"
|
|
- **Likelihood**: **MEDIUM** (regional natural history museum)
|
|
|
|
#### 14. **SECULT Amapá** ⭐
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Location**: Amapá state
|
|
- **Description**: State Culture Secretariat (gab.secult@secult.ap.gov.br)
|
|
- **Search Strategy**:
|
|
- "Secretaria de Cultura Amapá"
|
|
- "SECULT Amapá"
|
|
- "Amapá cultural department"
|
|
- **Likelihood**: **MEDIUM** (state government department)
|
|
|
|
#### 15. **Casa das Minas / Casa de Nagô** ⭐
|
|
- **Type**: MIXED (likely HOLY_SITES)
|
|
- **Location**: Maranhão
|
|
- **Description**: Afro-Brazilian heritage (Tambor de Mina)
|
|
- **Search Strategy**:
|
|
- "Casa das Minas São Luís"
|
|
- "Casa Nagô Maranhão"
|
|
- "Tambor de Mina"
|
|
- **Likelihood**: **MEDIUM** (important Afro-Brazilian religious sites)
|
|
|
|
---
|
|
|
|
## Excluded from Batch 16
|
|
|
|
### Technical Systems (Not Heritage Institutions)
|
|
- APIs, DSpace, AtoM, LOCKSS Cariniana (technical platforms)
|
|
- Tainacan implementations (content management system)
|
|
- Metadata, Hemeroteca Digital (technical services)
|
|
- Mapa Cultural (mapping platform)
|
|
|
|
### Duplicate Entries
|
|
- Fundação de Cultura Elias Mansour vs. FEM (same institution, two entries)
|
|
|
|
### Low-Priority Regional Projects
|
|
- Jalapão Heritage (regional project, unlikely Wikidata)
|
|
- Ouro Preto System (municipal system)
|
|
- Guarani-Kaiowá Projects (anthropological documentation, not institution)
|
|
|
|
---
|
|
|
|
## Batch 16 Execution Plan
|
|
|
|
### Phase 1: Search Top 10 Candidates (Tier 1)
|
|
**Target**: 5-7 successful matches
|
|
|
|
1. DEAP Archives (Paraná state archive)
|
|
2. APESP (São Paulo state archive)
|
|
3. Sistema Brasileiro de Museus (national museum system)
|
|
4. FCRB - Casa de Rui Barbosa (federal foundation)
|
|
5. MuseusBr (IBRAM national platform)
|
|
6. Museu dos Povos Acreanos (Acre museum, 2023)
|
|
7. FUMDHAM (Serra da Capivara, UNESCO site)
|
|
8. Fundação Elias Mansour (Acre cultural foundation)
|
|
9. Museu Histórico Alcântara (Maranhão colonial museum)
|
|
10. Museu Memória Rondoniense (Rondônia state museum)
|
|
|
|
### Phase 2: Search Tier 2 Candidates (if needed)
|
|
**Target**: 2-3 additional matches
|
|
|
|
11. MUSEAR/UFMT (university anthropology museum)
|
|
12. Instituto Insikiran (indigenous education institute)
|
|
13. Natural History Museum Campina Grande
|
|
14. SECULT Amapá (state culture department)
|
|
15. Casa das Minas/Nagô (Afro-Brazilian heritage sites)
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Minimum Success
|
|
- **5+ institutions enriched**
|
|
- Coverage: 63.2% → 67.2% (84/125)
|
|
|
|
### Target Success
|
|
- **7-8 institutions enriched**
|
|
- Coverage: 63.2% → 69.6% (87/125)
|
|
- Within 1-2 institutions of 70% goal
|
|
|
|
### Maximum Success
|
|
- **10+ institutions enriched**
|
|
- Coverage: 63.2% → **71.2%+ (89+/125)** ✨ **70% GOAL ACHIEVED**
|
|
|
|
---
|
|
|
|
## Search Quality Standards
|
|
|
|
### Match Criteria
|
|
- **Minimum similarity**: 0.85 (high confidence)
|
|
- **Manual verification**: Flag scores 0.85-0.90 for review
|
|
- **Identifier requirements**: Prioritize institutions with multiple external IDs
|
|
|
|
### Documentation Standards
|
|
- Complete descriptions (100+ words minimum)
|
|
- Alternative names (Portuguese + English)
|
|
- GeoNames location IDs
|
|
- Full provenance metadata
|
|
- Enrichment history with confidence scores
|
|
|
|
---
|
|
|
|
## Expected Outcomes
|
|
|
|
### Optimistic Scenario (70% Goal Reached)
|
|
If 10+ institutions enriched → **Coverage: 71.2%** ✨
|
|
- **Mission accomplished**: 70% target exceeded
|
|
- Move to quality verification phase
|
|
- Prepare for other countries
|
|
|
|
### Realistic Scenario (Approaching 70%)
|
|
If 7-8 institutions enriched → **Coverage: 68-70%**
|
|
- **One more batch** (Batch 17) needed to secure 70%
|
|
- Focus remaining efforts on university collections
|
|
|
|
### Pessimistic Scenario (Incremental Progress)
|
|
If 5-6 institutions enriched → **Coverage: 67-68%**
|
|
- **Two more batches** (Batches 17-18) to reach 70%
|
|
- May need to create Wikidata items for notable institutions
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
### Low-Risk Targets (Tier 1: 1-10)
|
|
- **APESP, DEAP, FCRB, SBM**: Major state/federal institutions → VERY HIGH Wikidata likelihood
|
|
- **Expected success rate**: 70-80% (7-8 of 10 found)
|
|
|
|
### Medium-Risk Targets (Tier 2: 11-15)
|
|
- **MUSEAR, Insikiran, Campina Grande, SECULT**: Regional/academic institutions
|
|
- **Expected success rate**: 40-60% (2-3 of 5 found)
|
|
|
|
### Overall Expected Success
|
|
- **Combined success rate**: 60-70% (9-10 of 15 institutions found)
|
|
- **Likely coverage after Batch 16**: **68-71%**
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ **Execute Batch 16 searches** for Tier 1 candidates (top 10)
|
|
2. ✅ **Extract full metadata** for successful matches
|
|
3. ✅ **Create batch16_enriched.yaml** with LinkML-compliant records
|
|
4. ✅ **Merge into main dataset** (create backup first)
|
|
5. ✅ **Generate Batch 16 report** with coverage statistics
|
|
6. 🎯 **Assess if 70% reached** or if Batch 17 needed
|
|
|
|
---
|
|
|
|
## Files to Create
|
|
|
|
```
|
|
data/instances/brazil/
|
|
└── batch16_enriched.yaml (5-10 institutions)
|
|
|
|
scripts/
|
|
└── merge_batch16.py (merge script)
|
|
|
|
reports/brazil/
|
|
├── batch16_plan.md (this file)
|
|
└── batch16_report.md (to be created after execution)
|
|
|
|
data/instances/all/
|
|
├── globalglam-20251111.yaml.bak.batch16 (backup)
|
|
└── globalglam-20251111.yaml (updated)
|
|
```
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
- **Planning**: ✅ Complete (this document)
|
|
- **Execution**: Ready to begin
|
|
- **Estimated duration**: 30-45 minutes (searches + extraction)
|
|
- **Report generation**: 10 minutes
|
|
|
|
**Total estimated time**: 1 hour for complete Batch 16 cycle
|
|
|
|
---
|
|
|
|
**Ready to execute**: Yes ✅
|
|
**Next action**: Begin Wikidata searches for Tier 1 candidates
|
|
**Goal**: Reach 70% coverage (88/125 institutions)
|