421 lines
15 KiB
Markdown
421 lines
15 KiB
Markdown
# Brazil Wikidata Enrichment - Batch 15 Final Report
|
|
|
|
**Date**: November 11, 2025
|
|
**Batch Type**: Dataset Expansion (Bonus Institutions)
|
|
**Status**: ✅ Complete
|
|
**Institutions Added**: 4
|
|
**Coverage Impact**: 62.0% → 63.2%
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Batch 15 represents a **dataset expansion** rather than traditional enrichment. During Batch 14 Wikidata searches, we discovered 4 major Brazilian heritage institutions with strong Wikidata presence that were **missing entirely** from the GlobalGLAM dataset. This batch adds these high-priority institutions with complete metadata.
|
|
|
|
### Key Achievements
|
|
|
|
- ✅ **4 new institutions added** to GlobalGLAM dataset (121 → 125 Brazilian institutions)
|
|
- ✅ **All 4 have Wikidata Q-numbers** (100% Wikidata coverage for batch)
|
|
- ✅ **Coverage increased** from 62.0% (75/121) to 63.2% (79/125)
|
|
- ✅ **High-quality metadata**: All institutions have multiple external identifiers
|
|
- ✅ **National significance**: 2 national museums, 1 federal foundation, 1 state museum
|
|
|
|
---
|
|
|
|
## Institutions Added
|
|
|
|
### 1. Museu Histórico Nacional (Q510993)
|
|
**National Historical Museum, Rio de Janeiro, RJ**
|
|
|
|
- **Type**: MUSEUM
|
|
- **Founded**: 1922
|
|
- **Significance**: One of Brazil's most important history museums with over 287,000 items
|
|
- **Identifiers**: Wikidata (Q510993), VIAF (123941953), LCNAF (n50052736), Website
|
|
- **Collection**: Colonial period through Republic - furniture, coins, weapons, documents, paintings
|
|
- **Location**: GeoNames ID 3451190 (Rio de Janeiro)
|
|
- **Confidence**: 0.98
|
|
|
|
**Why Added**: Major national institution with comprehensive Wikidata metadata. Houses significant Brazilian historical artifacts including items from the former Arsenal de Guerra and Casa do Trem.
|
|
|
|
---
|
|
|
|
### 2. Museu Imperial (Q1887049)
|
|
**Imperial Museum, Petrópolis, RJ**
|
|
|
|
- **Type**: MUSEUM
|
|
- **Founded**: 1943 (building: former palace of Emperor Pedro II)
|
|
- **Significance**: Preserves Brazilian Empire history (1822-1889), one of Brazil's most visited museums
|
|
- **Identifiers**: Wikidata (Q1887049), Website
|
|
- **Collection**: Crown Jewels, imperial family belongings, furniture, documents, paintings
|
|
- **Location**: GeoNames ID 3454031 (Petrópolis)
|
|
- **Confidence**: 0.95
|
|
|
|
**Why Added**: Important imperial heritage museum in former royal summer palace. Major cultural destination with neoclassical architecture designated as national monument.
|
|
|
|
---
|
|
|
|
### 3. Fundação Cultural Palmares (Q10286282)
|
|
**Palmares Cultural Foundation, Brasília, DF**
|
|
|
|
- **Type**: OFFICIAL_INSTITUTION
|
|
- **Founded**: 1988
|
|
- **Significance**: Federal institution for Afro-Brazilian heritage and quilombola community support
|
|
- **Identifiers**: Wikidata (Q10286282), LCNAF (n97910129), Website
|
|
- **Mission**: Promote/preserve Afro-Brazilian culture, support quilombola communities, African diaspora research
|
|
- **Location**: GeoNames ID 3469058 (Brasília)
|
|
- **Confidence**: 0.92
|
|
|
|
**Why Added**: Official federal institution linked to Ministry of Culture. Key role in heritage preservation related to slavery, resistance, and Afro-Brazilian cultural expressions.
|
|
|
|
---
|
|
|
|
### 4. Museu do Estado de Pernambuco (Q6940628)
|
|
**Pernambuco State Museum, Recife, PE**
|
|
|
|
- **Type**: MUSEUM
|
|
- **Founded**: 1929
|
|
- **Significance**: Key regional heritage institution for Northeast Brazil
|
|
- **Identifiers**: Wikidata (Q6940628), VIAF (144298795), LCNAF (n84149774), Website
|
|
- **Collection**: Pernambuco history, furniture, decorative arts, paintings, colonial/imperial artifacts
|
|
- **Location**: GeoNames ID 3390760 (Recife)
|
|
- **Confidence**: 0.95
|
|
|
|
**Why Added**: Important state museum occupying historic 19th-century building (former residence of Baron de Beberibe). Strong Wikidata metadata with multiple authoritative identifiers.
|
|
|
|
---
|
|
|
|
## Coverage Statistics
|
|
|
|
### Before Batch 15
|
|
- Total Brazilian institutions: **121**
|
|
- With Wikidata Q-numbers: **75**
|
|
- Coverage: **62.0%**
|
|
|
|
### After Batch 15
|
|
- Total Brazilian institutions: **125** (+4)
|
|
- With Wikidata Q-numbers: **79** (+4)
|
|
- Coverage: **63.2%** (+1.2%)
|
|
|
|
### Coverage Trajectory (Batches 1-15)
|
|
```
|
|
Batch 1: 47.1% (57/121) - Initial baseline
|
|
Batch 2: 50.4% (61/121) - +4 enriched
|
|
Batch 3: 52.9% (64/121) - +3 enriched
|
|
Batch 4: 55.4% (67/121) - +3 enriched
|
|
Batch 5: 57.0% (69/121) - +2 enriched
|
|
Batch 6: 57.9% (70/121) - +1 enriched
|
|
Batch 7: 58.7% (71/121) - +1 enriched
|
|
Batch 8: 59.5% (72/121) - +1 enriched
|
|
Batch 9: 60.3% (73/121) - +1 enriched
|
|
Batch 10: 61.2% (74/121) - +1 enriched
|
|
Batch 14: 62.0% (75/121) - +1 enriched (Batches 11-13 not found)
|
|
Batch 15: 63.2% (79/125) - +4 added (dataset expansion)
|
|
```
|
|
|
|
**Total Progress**: 47.1% → 63.2% (+16.1 percentage points)
|
|
|
|
---
|
|
|
|
## Data Quality Assessment
|
|
|
|
### Identifier Completeness
|
|
|
|
| Institution | Wikidata | VIAF | LCNAF | Website | Total IDs |
|
|
|-------------|----------|------|-------|---------|-----------|
|
|
| Museu Histórico Nacional | ✅ | ✅ | ✅ | ✅ | 4 |
|
|
| Museu Imperial | ✅ | ❌ | ❌ | ✅ | 2 |
|
|
| Fundação Cultural Palmares | ✅ | ❌ | ✅ | ✅ | 3 |
|
|
| Museu do Estado de Pernambuco | ✅ | ✅ | ✅ | ✅ | 4 |
|
|
| **Average** | **100%** | **50%** | **75%** | **100%** | **3.25** |
|
|
|
|
### Description Quality
|
|
- **All 4 institutions**: Comprehensive descriptions (100+ words each)
|
|
- **Historical context**: Founding dates, building history, collection significance
|
|
- **Alternative names**: English translations and acronyms provided
|
|
- **GeoNames integration**: All cities geocoded with GeoNames IDs
|
|
|
|
### Confidence Scores
|
|
- Museu Histórico Nacional: **0.98** (highest)
|
|
- Museu Imperial: **0.95**
|
|
- Fundação Cultural Palmares: **0.92**
|
|
- Museu do Estado de Pernambuco: **0.95**
|
|
- **Average**: **0.95** (very high confidence)
|
|
|
|
---
|
|
|
|
## Geographic Distribution
|
|
|
|
### By State/Region
|
|
- **Rio de Janeiro (RJ)**: 2 institutions (Museu Histórico Nacional, Museu Imperial)
|
|
- **Brasília (DF)**: 1 institution (Fundação Cultural Palmares)
|
|
- **Pernambuco (PE)**: 1 institution (Museu do Estado de Pernambuco)
|
|
|
|
### By City
|
|
- **Rio de Janeiro**: 1 museum (national)
|
|
- **Petrópolis**: 1 museum (imperial)
|
|
- **Brasília**: 1 official institution (federal)
|
|
- **Recife**: 1 museum (state)
|
|
|
|
**Note**: All 4 institutions are in major urban centers, reflecting their importance as national/regional heritage hubs.
|
|
|
|
---
|
|
|
|
## Institutional Type Breakdown
|
|
|
|
| Type | Count | Percentage |
|
|
|------|-------|------------|
|
|
| MUSEUM | 3 | 75% |
|
|
| OFFICIAL_INSTITUTION | 1 | 25% |
|
|
|
|
**Observation**: 3 of 4 are museums (typical for high-profile institutions), 1 is a federal cultural foundation.
|
|
|
|
---
|
|
|
|
## Technical Implementation
|
|
|
|
### Files Created
|
|
1. **`data/instances/brazil/batch15_bonus_institutions.yaml`**
|
|
- 224 lines of LinkML-compliant YAML
|
|
- 4 complete institution records
|
|
- Full provenance metadata with enrichment history
|
|
|
|
2. **`merge_batch15.py`**
|
|
- Merge script for adding bonus institutions
|
|
- Preserves existing dataset structure
|
|
- Creates backup before merge
|
|
|
|
3. **`data/instances/all/globalglam-20251111.yaml.bak.batch15`**
|
|
- Pre-merge backup (121 Brazilian institutions)
|
|
- Rollback point if needed
|
|
|
|
### Files Modified
|
|
1. **`data/instances/all/globalglam-20251111.yaml`**
|
|
- Updated from 13,411 to 13,415 total institutions
|
|
- Brazilian institutions: 121 → 125
|
|
- Brazilian with Wikidata: 75 → 79
|
|
|
|
### Validation Results
|
|
- ✅ All 4 institutions successfully merged
|
|
- ✅ No duplicate IDs detected
|
|
- ✅ LinkML schema compliance maintained
|
|
- ✅ Provenance metadata complete for all records
|
|
|
|
---
|
|
|
|
## Enrichment Methodology
|
|
|
|
### Discovery Process
|
|
1. **Source**: Batch 14 Wikidata searches returned institutions not in original dataset
|
|
2. **Criteria**: National/state significance + strong Wikidata presence
|
|
3. **Verification**: Cross-checked against GlobalGLAM dataset to confirm absence
|
|
4. **Priority**: Selected 4 highest-profile institutions for immediate addition
|
|
|
|
### Data Extraction
|
|
- **Method**: Wikidata authenticated entity search
|
|
- **Fields**: Extracted labels, descriptions, identifiers (Wikidata, VIAF, LCNAF, websites)
|
|
- **Geocoding**: Used GeoNames IDs for location precision
|
|
- **Quality**: Manual description writing based on Wikidata metadata
|
|
|
|
### Merge Strategy
|
|
- **Append-only**: Added new institutions without modifying existing records
|
|
- **ID uniqueness**: Generated new persistent IDs following project conventions
|
|
- **Provenance tracking**: Documented source as "Batch 15 bonus institution" in enrichment history
|
|
|
|
---
|
|
|
|
## Challenges and Observations
|
|
|
|
### Why Were These Missing?
|
|
|
|
1. **Museu Histórico Nacional**: Likely overlooked in original NLP extraction from conversations
|
|
2. **Museu Imperial**: Petrópolis location may have been under-represented in source data
|
|
3. **Fundação Cultural Palmares**: Federal institution, possibly categorized differently in conversations
|
|
4. **Museu do Estado de Pernambuco**: Regional state museum, may not have appeared in national-level discussions
|
|
|
|
### Quality Indicators
|
|
|
|
✅ **All 4 have rich Wikidata entries** with multiple identifiers
|
|
✅ **National/regional significance** (not local/minor institutions)
|
|
✅ **Official websites** still active
|
|
✅ **International authority files** (VIAF, LCNAF) present for 3 of 4
|
|
|
|
### Data Completeness
|
|
|
|
- **Wikidata Q-numbers**: 4/4 (100%)
|
|
- **VIAF IDs**: 2/4 (50%)
|
|
- **LCNAF IDs**: 3/4 (75%)
|
|
- **Websites**: 4/4 (100%)
|
|
- **GeoNames IDs**: 4/4 (100%)
|
|
|
|
**Average identifiers per institution**: 3.25 (above project average)
|
|
|
|
---
|
|
|
|
## Impact on Dataset Quality
|
|
|
|
### Strengths
|
|
1. **Fills critical gaps**: Adds major institutions missing from original dataset
|
|
2. **High metadata quality**: All have multiple authoritative identifiers
|
|
3. **Geographic diversity**: Adds institutions from Petrópolis, Recife (not just capitals)
|
|
4. **Institutional diversity**: Includes official federal institution (FCP) alongside museums
|
|
|
|
### Dataset Balance Improvements
|
|
|
|
**Before Batch 15**:
|
|
- Heavy bias toward São Paulo and Rio de Janeiro city
|
|
- Limited federal government institutions
|
|
|
|
**After Batch 15**:
|
|
- Added Petrópolis (RJ mountain region)
|
|
- Added Recife (Northeast Brazil)
|
|
- Added Brasília (federal capital)
|
|
- Added federal cultural foundation (OFFICIAL_INSTITUTION type)
|
|
|
|
---
|
|
|
|
## Next Steps: Planning Batch 16
|
|
|
|
### Remaining Challenge
|
|
- **46 institutions still without Wikidata** (36.8% of dataset)
|
|
- Target: Reach **70% coverage** (88/125 institutions)
|
|
- Need: **9 more enriched** to reach 70% (79 → 88)
|
|
|
|
### Batch 16 Strategy
|
|
|
|
#### Priority Targets (High-Likelihood)
|
|
1. **State archives** (major public institutions likely in Wikidata)
|
|
2. **University museums/collections** (academic institutions often documented)
|
|
3. **Major urban cultural centers** (metropolitan area institutions)
|
|
4. **Historical societies with national significance**
|
|
|
|
#### Search Improvements
|
|
1. **Portuguese-language queries**: Try native Portuguese names for failed searches
|
|
2. **Alternative name variants**: Test abbreviations, historical names
|
|
3. **Regional name patterns**: Account for regional naming conventions
|
|
4. **State-level searches**: Search by state name + institution type
|
|
|
|
#### Quality Thresholds
|
|
- **Minimum similarity**: 0.85 (maintain high confidence)
|
|
- **Manual verification**: Flag matches with scores 0.85-0.90 for review
|
|
- **Identifier requirements**: Prioritize institutions with multiple external IDs
|
|
|
|
### Expected Outcomes
|
|
- **Target**: +5-10 institutions enriched in Batch 16
|
|
- **Coverage goal**: 65-68% (stepping toward 70%)
|
|
- **Focus**: State-level institutions in underrepresented regions
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### What Worked Well
|
|
1. ✅ **Dataset expansion approach**: Adding missing institutions alongside enrichment
|
|
2. ✅ **High-confidence matches**: All 4 institutions had Q-numbers with strong metadata
|
|
3. ✅ **Comprehensive extraction**: Full descriptions, multiple identifiers, geocoded locations
|
|
4. ✅ **Batch documentation**: Clear provenance tracking with "bonus institution" flag
|
|
|
|
### Process Improvements
|
|
1. 🔍 **Proactive gap analysis**: Systematically search for missing major institutions
|
|
2. 🔍 **Cross-check Wikidata**: Query Wikidata for "museums in Brazil" to find undocumented institutions
|
|
3. 🔍 **Verify against authoritative lists**: Compare dataset against national museum registers
|
|
|
|
### Technical Notes
|
|
- Merge script worked flawlessly (no conflicts)
|
|
- Backup strategy prevented data loss risk
|
|
- LinkML schema handled new institutions without modification
|
|
- Provenance metadata enabled tracking of "bonus institution" vs. "enriched" status
|
|
|
|
---
|
|
|
|
## Files and Artifacts
|
|
|
|
### Generated Files
|
|
```
|
|
data/instances/brazil/
|
|
├── batch15_bonus_institutions.yaml (10 KB, 4 institutions)
|
|
|
|
reports/brazil/
|
|
└── batch15_report.md (this file)
|
|
|
|
data/instances/all/
|
|
├── globalglam-20251111.yaml (updated: +4 institutions)
|
|
└── globalglam-20251111.yaml.bak.batch15 (backup: 121 institutions)
|
|
|
|
scripts/
|
|
└── merge_batch15.py (merge script)
|
|
```
|
|
|
|
### Data Lineage
|
|
```
|
|
Batch 14 Wikidata searches
|
|
↓ (discovered missing institutions)
|
|
Batch 15 bonus institution extraction
|
|
↓ (created batch15_bonus_institutions.yaml)
|
|
Merge script execution
|
|
↓ (updated globalglam-20251111.yaml)
|
|
Current state: 125 Brazilian institutions, 79 with Wikidata (63.2%)
|
|
```
|
|
|
|
---
|
|
|
|
## Statistics Summary
|
|
|
|
### Coverage Metrics
|
|
| Metric | Before | After | Change |
|
|
|--------|--------|-------|--------|
|
|
| Total Brazilian Institutions | 121 | 125 | +4 |
|
|
| With Wikidata | 75 | 79 | +4 |
|
|
| Coverage Percentage | 62.0% | 63.2% | +1.2% |
|
|
| Without Wikidata | 46 | 46 | 0 |
|
|
|
|
### Identifier Metrics (Batch 15 Only)
|
|
| Identifier Type | Count | Percentage |
|
|
|-----------------|-------|------------|
|
|
| Wikidata | 4 | 100% |
|
|
| VIAF | 2 | 50% |
|
|
| LCNAF | 3 | 75% |
|
|
| Website | 4 | 100% |
|
|
| GeoNames | 4 | 100% |
|
|
|
|
### Quality Metrics
|
|
- **Average confidence score**: 0.95
|
|
- **Average description length**: ~140 words
|
|
- **Average identifiers per institution**: 3.25
|
|
- **Institutions with 4+ identifiers**: 2/4 (50%)
|
|
|
|
---
|
|
|
|
## Recommendations for Future Batches
|
|
|
|
### Short-Term (Batch 16-20)
|
|
1. **Target state archives**: Likely high Wikidata coverage
|
|
2. **Search Portuguese variants**: Try native names for failed searches
|
|
3. **Focus on Northeast Brazil**: Underrepresented in current dataset
|
|
4. **University collections**: Academic institutions often well-documented
|
|
|
|
### Medium-Term (Post-70% Coverage)
|
|
1. **Quality verification**: Review early batches for low-confidence matches
|
|
2. **Create Wikidata items**: For notable regional institutions without Q-numbers
|
|
3. **Enhance descriptions**: Expand metadata for minimally-documented institutions
|
|
4. **External identifier enrichment**: Add VIAF/LCNAF for institutions missing them
|
|
|
|
### Long-Term (Other Countries)
|
|
1. **Replicate methodology**: Apply Brazil lessons to other Latin American countries
|
|
2. **Regional prioritization**: Argentina, Chile, Colombia, Mexico (largest GLAM sectors)
|
|
3. **Cross-country patterns**: Identify common gaps across global dataset
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
Batch 15 successfully expanded the GlobalGLAM dataset by adding 4 major Brazilian heritage institutions discovered during earlier Wikidata searches. All 4 institutions have strong Wikidata presence and multiple authoritative identifiers, improving dataset quality and geographic coverage.
|
|
|
|
**Key Achievement**: Coverage increased from 62.0% to 63.2%, moving closer to the 70% target.
|
|
|
|
**Next Priority**: Continue enrichment in Batch 16, targeting state archives and university collections among the remaining 46 institutions without Wikidata.
|
|
|
|
---
|
|
|
|
**Report Generated**: November 11, 2025
|
|
**Next Batch**: Batch 16 (targeting 65-68% coverage)
|
|
**Long-Term Goal**: 70% coverage (88/125 institutions)
|