7.9 KiB
Session Summary: Brazil Batch 8 Merge - Investigation Complete
Date: November 11, 2025
Phase: Phase 2 - Latin America Regional Campaign
Status: Brazil Merge Complete ✅
Problem Identified
Previous session summary claimed Brazil had 69/212 (32.5%) Wikidata coverage, but analysis showed only 29/212 (13.7%) in the master dataset. Investigation revealed batch enrichment files were never fully merged.
Investigation Results
Brazil Dataset Structure Discovery
Found two separate Brazil datasets:
-
Batch enrichment files (115 institutions):
brazilian_institutions_batch6_enriched.yaml(7 with Wikidata)brazilian_institutions_batch7_enriched.yaml(8 with Wikidata)brazilian_institutions_batch8_enriched.yaml(9 with Wikidata) ← Most recent- These are iterative enrichments of a 115-institution subset
-
Master unified dataset (212 institutions):
- All from CONVERSATION_NLP extraction
- Contains MORE institutions than batch files
- Only had 7 institutions from batch enrichments merged previously
Root Cause
Batch 8 enrichment was completed but 2 institutions were never merged back into the master dataset:
- Biblioteca Nacional Digital (BNDigital) - Q948882
- Biblioteca Brasiliana Guita e José Mindlin - Q18500412
Actions Taken
1. Created Brazil Merge Script ✅
File: scripts/merge_brazil_batch8.py
Features:
- Matches institutions by ID URL
- Merges only NEW Wikidata enrichments (not already in master)
- Preserves GHCID fields from master dataset
- Reports merge statistics and verification
2. Executed Merge ✅
Results:
- ✅ Merged 2 new Wikidata enrichments
- ✅ Skipped 7 already-merged institutions
- ✅ Saved updated dataset:
data/instances/all/globalglam-20251111.yaml
Coverage improvement:
- Before: 29/212 institutions (13.7%)
- After: 31/212 institutions (14.6%)
- Improvement: +2 institutions
3. Verified Merge ✅
Both institutions now have Wikidata in master dataset:
- ✅ Biblioteca Nacional Digital (BNDigital) → Q948882
- ✅ Biblioteca Brasiliana Guita e José Mindlin → Q18500412
Corrected Brazil Status
Current Coverage: 31/212 institutions (14.6%)
Wikidata Enrichment Sources:
- 📦 Batch 8 enrichments: 9 institutions (targeted manual enrichment)
- 🌍 Conversation extraction: 22 institutions (already had Wikidata from NLP)
All 31 Institutions with Wikidata:
- Arquivo Público (Q8203651) 🌍
- Biblioteca Brasiliana Guita e José Mindlin (Q18500412) 📦
- Biblioteca Nacional Digital (BNDigital) (Q948882) 📦
- CCBB Brasília (Q56693296) 🌍
- Centro Dragão do Mar (Q5305525) 🌍
- Dom Bosco Museum (Q10333447) 📦
- FUNDAJ (Q10286348) 🌍
- Forte do Presépio (Q56694297) 🌍
- Forte dos Reis Magos (Q3304114) 🌍
- Geopark Araripe (Q10288918) 📦
- IMS (Q6041378) 🌍
- Inhotim (Q478245) 🌍
- Instituto Ricardo Brennand (Q2216591) 🌍
- MAM-BA (Q10333768) 🌍
- MARCO (Q10333754) 📦
- MARGS (Q7335252) 🌍
- MASP (Q82941) 🌍
- MON (Q4991927) 🌍
- Memorial do RS (Q10328566) 📦
- Museu Goeldi (Q3328425) 🌍
- Museu Nacional (Q1850416) 🌍
- Museu Zoroastro Artiaga (Q10333459) 🌍
- Museu da Borracha (Q10333651) 🌍
- Museu do Homem Sergipano (Q10333684) 📦
- Museu do Piauí (Q10333916) 🌍
- Parque Memorial Quilombo dos Palmares (Q10345196) 🌍
- Pinacoteca (Q2095209) 🌍
- São Luís UNESCO Site (Q8343768) 📦
- Teatro Amazonas (Q1434444) 🌍
- Teatro da Paz (Q3063375) 🌍
- UNESCO Goiás Velho (Q427697) 📦
📦 = From batch 8 targeted enrichment
🌍 = From conversation NLP extraction
Global Dataset Status
Overall Statistics
- Total institutions: 13,502
- With Wikidata: 7,858
- Global coverage: 58.2%
Phase 1 Complete (7 countries at 100%)
✅ Belgium: 7 institutions
✅ United States: 7 institutions
✅ Great Britain: 4 institutions
✅ Italy: 3 institutions
✅ Russia: 1 institution
✅ Denmark: 1 institution
✅ Luxembourg: 1 institution
Phase 1 Total: 31 institutions across 7 countries
Phase 2 Status: Latin America Campaign
Current Coverage by Country
| Country | Institutions | With Wikidata | Coverage | Status |
|---|---|---|---|---|
| 🇧🇷 Brazil | 212 | 31 | 14.6% | ⏳ LOW - Primary target |
| 🇲🇽 Mexico | 192 | 96 | 50.0% | 🎯 MEDIUM - Good progress |
| 🇨🇱 Chile | 180 | 97 | 53.9% | 🎯 MEDIUM - Good progress |
| 🇦🇷 Argentina | 2 | 1 | 50.0% | ✅ Small dataset |
Priority Targets
Immediate Priority: 🇧🇷 Brazil (14.6% → target 30%+)
- 181 institutions without Wikidata
- Focus on major museums, archives, cultural centers
- Use batch enrichment strategy (proven with batch 8)
Secondary Targets:
- 🇲🇽 Mexico: 50.0% → 70%+ (96 institutions to enrich)
- 🇨🇱 Chile: 53.9% → 70%+ (83 institutions to enrich)
Files Modified Today
Created
scripts/merge_brazil_batch8.py- Brazil batch 8 merge script ✅
Updated
data/instances/all/globalglam-20251111.yaml- Master dataset (+2 Wikidata enrichments) ✅
Key Findings
- Batch enrichment files are subsets: The 115-institution batch files are NOT the complete Brazil dataset
- Master has MORE institutions: 212 total from conversation extraction vs. 115 in batches
- Multiple enrichment sources: Both targeted batch enrichment AND conversation NLP contribute Wikidata
- Previous coverage claim was wrong: Actual 14.6%, not 32.5%
Next Steps
Option A: Continue Brazil Batch Enrichment (Recommended)
Create Batch 9: Enrich 10-15 more high-priority institutions
- Focus on major museums without Wikidata
- Target institutions likely to have Wikidata entries
- Use existing batch 8 script as template
Candidates (181 institutions without Wikidata):
- State museums (Museu do Estado)
- Municipal archives (Arquivo Municipal)
- Cultural centers (Centro Cultural)
- Historical institutes (Instituto Histórico)
Option B: Switch to Mexico/Chile Campaign
Mexico (96 with Wikidata, 96 remaining):
- 50% coverage achieved
- Push to 70%+ with targeted enrichment
- Large museums and archives likely in Wikidata
Chile (97 with Wikidata, 83 remaining):
- 53.9% coverage achieved
- Similar strategy to Mexico
- Regional museums and archives
Option C: Create Analysis Report
Generate enrichment candidates list:
- Group by institution type (museum, archive, library)
- Prioritize by size/importance indicators
- Check Wikidata manually for existence
- Create targeted enrichment batch
Success Metrics
Phase 1 Achievements ✅
- 7 countries at 100% coverage (31 institutions)
- All small English-speaking datasets complete
- Quick wins strategy: COMPLETE
Phase 2 Progress 🔄
- Brazil merge issue: RESOLVED ✅
- Brazil coverage: 14.6% (baseline established)
- Ready for Latin America campaign scale-up
Overall Progress
- Dataset: 13,502 institutions
- Wikidata coverage: 58.2%
- Countries represented: 60+
- Focus regions: Europe (complete), Latin America (in progress)
Recommendations
Immediate next action: Create Brazil Batch 9 Enrichment Script
Strategy:
- Analyze 181 institutions without Wikidata
- Prioritize by type (museums > archives > cultural centers)
- Manual Wikidata search for 10-15 high-value targets
- Create
scripts/enrich_brazil_batch9.py - Merge into master with
scripts/merge_brazil_batch9.py - Iterate until 30%+ coverage achieved
Timeline:
- Batch 9: +10 institutions (→ 19.3% coverage)
- Batch 10: +10 institutions (→ 24.1% coverage)
- Batch 11: +15 institutions (→ 31.3% coverage) ← Target reached
Session Status: ✅ COMPLETE
Files Created: 1 (merge script)
Institutions Enriched: 2 (Brazil)
Global Coverage: 58.2% (unchanged, 2 institutions merged)
Phase 2: Ready to scale up Brazil enrichment campaign