glam/SESSION_SUMMARY_20251111_BRAZIL_MERGE.md
2025-11-19 23:25:22 +01:00

7.9 KiB

Session Summary: Brazil Batch 8 Merge - Investigation Complete

Date: November 11, 2025
Phase: Phase 2 - Latin America Regional Campaign
Status: Brazil Merge Complete


Problem Identified

Previous session summary claimed Brazil had 69/212 (32.5%) Wikidata coverage, but analysis showed only 29/212 (13.7%) in the master dataset. Investigation revealed batch enrichment files were never fully merged.

Investigation Results

Brazil Dataset Structure Discovery

Found two separate Brazil datasets:

  1. Batch enrichment files (115 institutions):

    • brazilian_institutions_batch6_enriched.yaml (7 with Wikidata)
    • brazilian_institutions_batch7_enriched.yaml (8 with Wikidata)
    • brazilian_institutions_batch8_enriched.yaml (9 with Wikidata) ← Most recent
    • These are iterative enrichments of a 115-institution subset
  2. Master unified dataset (212 institutions):

    • All from CONVERSATION_NLP extraction
    • Contains MORE institutions than batch files
    • Only had 7 institutions from batch enrichments merged previously

Root Cause

Batch 8 enrichment was completed but 2 institutions were never merged back into the master dataset:

  1. Biblioteca Nacional Digital (BNDigital) - Q948882
  2. Biblioteca Brasiliana Guita e José Mindlin - Q18500412

Actions Taken

1. Created Brazil Merge Script

File: scripts/merge_brazil_batch8.py

Features:

  • Matches institutions by ID URL
  • Merges only NEW Wikidata enrichments (not already in master)
  • Preserves GHCID fields from master dataset
  • Reports merge statistics and verification

2. Executed Merge

Results:

  • Merged 2 new Wikidata enrichments
  • Skipped 7 already-merged institutions
  • Saved updated dataset: data/instances/all/globalglam-20251111.yaml

Coverage improvement:

  • Before: 29/212 institutions (13.7%)
  • After: 31/212 institutions (14.6%)
  • Improvement: +2 institutions

3. Verified Merge

Both institutions now have Wikidata in master dataset:

  • Biblioteca Nacional Digital (BNDigital) → Q948882
  • Biblioteca Brasiliana Guita e José Mindlin → Q18500412

Corrected Brazil Status

Current Coverage: 31/212 institutions (14.6%)

Wikidata Enrichment Sources:

  • 📦 Batch 8 enrichments: 9 institutions (targeted manual enrichment)
  • 🌍 Conversation extraction: 22 institutions (already had Wikidata from NLP)

All 31 Institutions with Wikidata:

  1. Arquivo Público (Q8203651) 🌍
  2. Biblioteca Brasiliana Guita e José Mindlin (Q18500412) 📦
  3. Biblioteca Nacional Digital (BNDigital) (Q948882) 📦
  4. CCBB Brasília (Q56693296) 🌍
  5. Centro Dragão do Mar (Q5305525) 🌍
  6. Dom Bosco Museum (Q10333447) 📦
  7. FUNDAJ (Q10286348) 🌍
  8. Forte do Presépio (Q56694297) 🌍
  9. Forte dos Reis Magos (Q3304114) 🌍
  10. Geopark Araripe (Q10288918) 📦
  11. IMS (Q6041378) 🌍
  12. Inhotim (Q478245) 🌍
  13. Instituto Ricardo Brennand (Q2216591) 🌍
  14. MAM-BA (Q10333768) 🌍
  15. MARCO (Q10333754) 📦
  16. MARGS (Q7335252) 🌍
  17. MASP (Q82941) 🌍
  18. MON (Q4991927) 🌍
  19. Memorial do RS (Q10328566) 📦
  20. Museu Goeldi (Q3328425) 🌍
  21. Museu Nacional (Q1850416) 🌍
  22. Museu Zoroastro Artiaga (Q10333459) 🌍
  23. Museu da Borracha (Q10333651) 🌍
  24. Museu do Homem Sergipano (Q10333684) 📦
  25. Museu do Piauí (Q10333916) 🌍
  26. Parque Memorial Quilombo dos Palmares (Q10345196) 🌍
  27. Pinacoteca (Q2095209) 🌍
  28. São Luís UNESCO Site (Q8343768) 📦
  29. Teatro Amazonas (Q1434444) 🌍
  30. Teatro da Paz (Q3063375) 🌍
  31. UNESCO Goiás Velho (Q427697) 📦

📦 = From batch 8 targeted enrichment
🌍 = From conversation NLP extraction

Global Dataset Status

Overall Statistics

  • Total institutions: 13,502
  • With Wikidata: 7,858
  • Global coverage: 58.2%

Phase 1 Complete (7 countries at 100%)

Belgium: 7 institutions
United States: 7 institutions
Great Britain: 4 institutions
Italy: 3 institutions
Russia: 1 institution
Denmark: 1 institution
Luxembourg: 1 institution

Phase 1 Total: 31 institutions across 7 countries

Phase 2 Status: Latin America Campaign

Current Coverage by Country

Country Institutions With Wikidata Coverage Status
🇧🇷 Brazil 212 31 14.6% LOW - Primary target
🇲🇽 Mexico 192 96 50.0% 🎯 MEDIUM - Good progress
🇨🇱 Chile 180 97 53.9% 🎯 MEDIUM - Good progress
🇦🇷 Argentina 2 1 50.0% Small dataset

Priority Targets

Immediate Priority: 🇧🇷 Brazil (14.6% → target 30%+)

  • 181 institutions without Wikidata
  • Focus on major museums, archives, cultural centers
  • Use batch enrichment strategy (proven with batch 8)

Secondary Targets:

  • 🇲🇽 Mexico: 50.0% → 70%+ (96 institutions to enrich)
  • 🇨🇱 Chile: 53.9% → 70%+ (83 institutions to enrich)

Files Modified Today

Created

  • scripts/merge_brazil_batch8.py - Brazil batch 8 merge script

Updated

  • data/instances/all/globalglam-20251111.yaml - Master dataset (+2 Wikidata enrichments)

Key Findings

  1. Batch enrichment files are subsets: The 115-institution batch files are NOT the complete Brazil dataset
  2. Master has MORE institutions: 212 total from conversation extraction vs. 115 in batches
  3. Multiple enrichment sources: Both targeted batch enrichment AND conversation NLP contribute Wikidata
  4. Previous coverage claim was wrong: Actual 14.6%, not 32.5%

Next Steps

Create Batch 9: Enrich 10-15 more high-priority institutions

  • Focus on major museums without Wikidata
  • Target institutions likely to have Wikidata entries
  • Use existing batch 8 script as template

Candidates (181 institutions without Wikidata):

  • State museums (Museu do Estado)
  • Municipal archives (Arquivo Municipal)
  • Cultural centers (Centro Cultural)
  • Historical institutes (Instituto Histórico)

Option B: Switch to Mexico/Chile Campaign

Mexico (96 with Wikidata, 96 remaining):

  • 50% coverage achieved
  • Push to 70%+ with targeted enrichment
  • Large museums and archives likely in Wikidata

Chile (97 with Wikidata, 83 remaining):

  • 53.9% coverage achieved
  • Similar strategy to Mexico
  • Regional museums and archives

Option C: Create Analysis Report

Generate enrichment candidates list:

  • Group by institution type (museum, archive, library)
  • Prioritize by size/importance indicators
  • Check Wikidata manually for existence
  • Create targeted enrichment batch

Success Metrics

Phase 1 Achievements

  • 7 countries at 100% coverage (31 institutions)
  • All small English-speaking datasets complete
  • Quick wins strategy: COMPLETE

Phase 2 Progress 🔄

  • Brazil merge issue: RESOLVED
  • Brazil coverage: 14.6% (baseline established)
  • Ready for Latin America campaign scale-up

Overall Progress

  • Dataset: 13,502 institutions
  • Wikidata coverage: 58.2%
  • Countries represented: 60+
  • Focus regions: Europe (complete), Latin America (in progress)

Recommendations

Immediate next action: Create Brazil Batch 9 Enrichment Script

Strategy:

  1. Analyze 181 institutions without Wikidata
  2. Prioritize by type (museums > archives > cultural centers)
  3. Manual Wikidata search for 10-15 high-value targets
  4. Create scripts/enrich_brazil_batch9.py
  5. Merge into master with scripts/merge_brazil_batch9.py
  6. Iterate until 30%+ coverage achieved

Timeline:

  • Batch 9: +10 institutions (→ 19.3% coverage)
  • Batch 10: +10 institutions (→ 24.1% coverage)
  • Batch 11: +15 institutions (→ 31.3% coverage) ← Target reached

Session Status: COMPLETE
Files Created: 1 (merge script)
Institutions Enriched: 2 (Brazil)
Global Coverage: 58.2% (unchanged, 2 institutions merged)
Phase 2: Ready to scale up Brazil enrichment campaign