241 lines
7.9 KiB
Markdown
241 lines
7.9 KiB
Markdown
# Session Summary: Brazil Batch 8 Merge - Investigation Complete
|
|
|
|
**Date**: November 11, 2025
|
|
**Phase**: Phase 2 - Latin America Regional Campaign
|
|
**Status**: Brazil Merge Complete ✅
|
|
|
|
---
|
|
|
|
## Problem Identified
|
|
|
|
Previous session summary claimed Brazil had 69/212 (32.5%) Wikidata coverage, but analysis showed only 29/212 (13.7%) in the master dataset. Investigation revealed batch enrichment files were never fully merged.
|
|
|
|
## Investigation Results
|
|
|
|
### Brazil Dataset Structure Discovery
|
|
|
|
Found **two separate Brazil datasets**:
|
|
|
|
1. **Batch enrichment files** (115 institutions):
|
|
- `brazilian_institutions_batch6_enriched.yaml` (7 with Wikidata)
|
|
- `brazilian_institutions_batch7_enriched.yaml` (8 with Wikidata)
|
|
- `brazilian_institutions_batch8_enriched.yaml` (9 with Wikidata) ← **Most recent**
|
|
- These are iterative enrichments of a 115-institution subset
|
|
|
|
2. **Master unified dataset** (212 institutions):
|
|
- All from CONVERSATION_NLP extraction
|
|
- Contains MORE institutions than batch files
|
|
- Only had 7 institutions from batch enrichments merged previously
|
|
|
|
### Root Cause
|
|
|
|
Batch 8 enrichment was completed but **2 institutions were never merged** back into the master dataset:
|
|
1. Biblioteca Nacional Digital (BNDigital) - Q948882
|
|
2. Biblioteca Brasiliana Guita e José Mindlin - Q18500412
|
|
|
|
## Actions Taken
|
|
|
|
### 1. Created Brazil Merge Script ✅
|
|
|
|
**File**: `scripts/merge_brazil_batch8.py`
|
|
|
|
Features:
|
|
- Matches institutions by ID URL
|
|
- Merges only NEW Wikidata enrichments (not already in master)
|
|
- Preserves GHCID fields from master dataset
|
|
- Reports merge statistics and verification
|
|
|
|
### 2. Executed Merge ✅
|
|
|
|
**Results**:
|
|
- ✅ Merged 2 new Wikidata enrichments
|
|
- ✅ Skipped 7 already-merged institutions
|
|
- ✅ Saved updated dataset: `data/instances/all/globalglam-20251111.yaml`
|
|
|
|
**Coverage improvement**:
|
|
- Before: 29/212 institutions (13.7%)
|
|
- After: 31/212 institutions (14.6%)
|
|
- **Improvement**: +2 institutions
|
|
|
|
### 3. Verified Merge ✅
|
|
|
|
Both institutions now have Wikidata in master dataset:
|
|
- ✅ Biblioteca Nacional Digital (BNDigital) → Q948882
|
|
- ✅ Biblioteca Brasiliana Guita e José Mindlin → Q18500412
|
|
|
|
## Corrected Brazil Status
|
|
|
|
### Current Coverage: 31/212 institutions (14.6%)
|
|
|
|
**Wikidata Enrichment Sources**:
|
|
- **📦 Batch 8 enrichments**: 9 institutions (targeted manual enrichment)
|
|
- **🌍 Conversation extraction**: 22 institutions (already had Wikidata from NLP)
|
|
|
|
**All 31 Institutions with Wikidata**:
|
|
1. Arquivo Público (Q8203651) 🌍
|
|
2. Biblioteca Brasiliana Guita e José Mindlin (Q18500412) 📦
|
|
3. Biblioteca Nacional Digital (BNDigital) (Q948882) 📦
|
|
4. CCBB Brasília (Q56693296) 🌍
|
|
5. Centro Dragão do Mar (Q5305525) 🌍
|
|
6. Dom Bosco Museum (Q10333447) 📦
|
|
7. FUNDAJ (Q10286348) 🌍
|
|
8. Forte do Presépio (Q56694297) 🌍
|
|
9. Forte dos Reis Magos (Q3304114) 🌍
|
|
10. Geopark Araripe (Q10288918) 📦
|
|
11. IMS (Q6041378) 🌍
|
|
12. Inhotim (Q478245) 🌍
|
|
13. Instituto Ricardo Brennand (Q2216591) 🌍
|
|
14. MAM-BA (Q10333768) 🌍
|
|
15. MARCO (Q10333754) 📦
|
|
16. MARGS (Q7335252) 🌍
|
|
17. MASP (Q82941) 🌍
|
|
18. MON (Q4991927) 🌍
|
|
19. Memorial do RS (Q10328566) 📦
|
|
20. Museu Goeldi (Q3328425) 🌍
|
|
21. Museu Nacional (Q1850416) 🌍
|
|
22. Museu Zoroastro Artiaga (Q10333459) 🌍
|
|
23. Museu da Borracha (Q10333651) 🌍
|
|
24. Museu do Homem Sergipano (Q10333684) 📦
|
|
25. Museu do Piauí (Q10333916) 🌍
|
|
26. Parque Memorial Quilombo dos Palmares (Q10345196) 🌍
|
|
27. Pinacoteca (Q2095209) 🌍
|
|
28. São Luís UNESCO Site (Q8343768) 📦
|
|
29. Teatro Amazonas (Q1434444) 🌍
|
|
30. Teatro da Paz (Q3063375) 🌍
|
|
31. UNESCO Goiás Velho (Q427697) 📦
|
|
|
|
📦 = From batch 8 targeted enrichment
|
|
🌍 = From conversation NLP extraction
|
|
|
|
## Global Dataset Status
|
|
|
|
### Overall Statistics
|
|
- **Total institutions**: 13,502
|
|
- **With Wikidata**: 7,858
|
|
- **Global coverage**: 58.2%
|
|
|
|
### Phase 1 Complete (7 countries at 100%)
|
|
✅ Belgium: 7 institutions
|
|
✅ United States: 7 institutions
|
|
✅ Great Britain: 4 institutions
|
|
✅ Italy: 3 institutions
|
|
✅ Russia: 1 institution
|
|
✅ Denmark: 1 institution
|
|
✅ Luxembourg: 1 institution
|
|
|
|
**Phase 1 Total**: 31 institutions across 7 countries
|
|
|
|
## Phase 2 Status: Latin America Campaign
|
|
|
|
### Current Coverage by Country
|
|
|
|
| Country | Institutions | With Wikidata | Coverage | Status |
|
|
|---------|-------------|---------------|----------|--------|
|
|
| 🇧🇷 Brazil | 212 | 31 | 14.6% | ⏳ LOW - Primary target |
|
|
| 🇲🇽 Mexico | 192 | 96 | 50.0% | 🎯 MEDIUM - Good progress |
|
|
| 🇨🇱 Chile | 180 | 97 | 53.9% | 🎯 MEDIUM - Good progress |
|
|
| 🇦🇷 Argentina | 2 | 1 | 50.0% | ✅ Small dataset |
|
|
|
|
### Priority Targets
|
|
|
|
**Immediate Priority**: 🇧🇷 **Brazil** (14.6% → target 30%+)
|
|
- 181 institutions without Wikidata
|
|
- Focus on major museums, archives, cultural centers
|
|
- Use batch enrichment strategy (proven with batch 8)
|
|
|
|
**Secondary Targets**:
|
|
- 🇲🇽 Mexico: 50.0% → 70%+ (96 institutions to enrich)
|
|
- 🇨🇱 Chile: 53.9% → 70%+ (83 institutions to enrich)
|
|
|
|
## Files Modified Today
|
|
|
|
### Created
|
|
- `scripts/merge_brazil_batch8.py` - Brazil batch 8 merge script ✅
|
|
|
|
### Updated
|
|
- `data/instances/all/globalglam-20251111.yaml` - Master dataset (+2 Wikidata enrichments) ✅
|
|
|
|
## Key Findings
|
|
|
|
1. **Batch enrichment files are subsets**: The 115-institution batch files are NOT the complete Brazil dataset
|
|
2. **Master has MORE institutions**: 212 total from conversation extraction vs. 115 in batches
|
|
3. **Multiple enrichment sources**: Both targeted batch enrichment AND conversation NLP contribute Wikidata
|
|
4. **Previous coverage claim was wrong**: Actual 14.6%, not 32.5%
|
|
|
|
## Next Steps
|
|
|
|
### Option A: Continue Brazil Batch Enrichment (Recommended)
|
|
|
|
**Create Batch 9**: Enrich 10-15 more high-priority institutions
|
|
- Focus on major museums without Wikidata
|
|
- Target institutions likely to have Wikidata entries
|
|
- Use existing batch 8 script as template
|
|
|
|
**Candidates** (181 institutions without Wikidata):
|
|
- State museums (Museu do Estado)
|
|
- Municipal archives (Arquivo Municipal)
|
|
- Cultural centers (Centro Cultural)
|
|
- Historical institutes (Instituto Histórico)
|
|
|
|
### Option B: Switch to Mexico/Chile Campaign
|
|
|
|
**Mexico** (96 with Wikidata, 96 remaining):
|
|
- 50% coverage achieved
|
|
- Push to 70%+ with targeted enrichment
|
|
- Large museums and archives likely in Wikidata
|
|
|
|
**Chile** (97 with Wikidata, 83 remaining):
|
|
- 53.9% coverage achieved
|
|
- Similar strategy to Mexico
|
|
- Regional museums and archives
|
|
|
|
### Option C: Create Analysis Report
|
|
|
|
**Generate enrichment candidates list**:
|
|
- Group by institution type (museum, archive, library)
|
|
- Prioritize by size/importance indicators
|
|
- Check Wikidata manually for existence
|
|
- Create targeted enrichment batch
|
|
|
|
## Success Metrics
|
|
|
|
### Phase 1 Achievements ✅
|
|
- 7 countries at 100% coverage (31 institutions)
|
|
- All small English-speaking datasets complete
|
|
- Quick wins strategy: **COMPLETE**
|
|
|
|
### Phase 2 Progress 🔄
|
|
- Brazil merge issue: **RESOLVED** ✅
|
|
- Brazil coverage: 14.6% (baseline established)
|
|
- Ready for Latin America campaign scale-up
|
|
|
|
### Overall Progress
|
|
- Dataset: 13,502 institutions
|
|
- Wikidata coverage: 58.2%
|
|
- Countries represented: 60+
|
|
- Focus regions: Europe (complete), Latin America (in progress)
|
|
|
|
## Recommendations
|
|
|
|
**Immediate next action**: Create **Brazil Batch 9 Enrichment Script**
|
|
|
|
**Strategy**:
|
|
1. Analyze 181 institutions without Wikidata
|
|
2. Prioritize by type (museums > archives > cultural centers)
|
|
3. Manual Wikidata search for 10-15 high-value targets
|
|
4. Create `scripts/enrich_brazil_batch9.py`
|
|
5. Merge into master with `scripts/merge_brazil_batch9.py`
|
|
6. Iterate until 30%+ coverage achieved
|
|
|
|
**Timeline**:
|
|
- Batch 9: +10 institutions (→ 19.3% coverage)
|
|
- Batch 10: +10 institutions (→ 24.1% coverage)
|
|
- Batch 11: +15 institutions (→ 31.3% coverage) ← **Target reached**
|
|
|
|
---
|
|
|
|
**Session Status**: ✅ COMPLETE
|
|
**Files Created**: 1 (merge script)
|
|
**Institutions Enriched**: 2 (Brazil)
|
|
**Global Coverage**: 58.2% (unchanged, 2 institutions merged)
|
|
**Phase 2**: Ready to scale up Brazil enrichment campaign
|