glam/reports/brazil/batch16_report.md
2025-11-19 23:25:22 +01:00

404 lines
14 KiB
Markdown

# Brazil Batch 16 Enrichment Report
**Date**: November 11, 2025
**Campaign**: Manual Wikidata search for Brazilian heritage institutions
**Batch**: 16 of ongoing Brazilian enrichment effort
---
## Executive Summary
Batch 16 successfully enriched 6 Brazilian heritage institutions with Wikidata identifiers, improving coverage from **63.2%** to **67.5%** (minimum goal: 65%, stretch goal: 70%).
**Key Achievement**: ✅ **Minimum 65% coverage goal ACHIEVED**
---
## Batch 16 Results
### Institutions Enriched
| Institution | Type | Wikidata | Status |
|-------------|------|----------|--------|
| **Museu Histórico de Alcântara** | MUSEUM | Q61000855 | UPDATED |
| **Departamento Estadual de Arquivo Público do Paraná** | ARCHIVE | Q56693461 | UPDATED |
| **Fundação Museu do Homem Americano** | MUSEUM | Q10286369 | UPDATED |
| **Arquivo Público do Estado de São Paulo** | ARCHIVE | Q9630401 | UPDATED |
| **Sistema Brasileiro de Museus (SBM)** | OFFICIAL_INSTITUTION | Q61000205 | UPDATED* |
| **Museu Casa de Rui Barbosa** | MUSEUM | Q56693872 | NEW |
*Sistema Brasileiro de Museus required duplicate resolution (see Technical Notes)
### Statistics
**Before Batch 16** (November 11, 2025):
- Total Brazilian institutions: **125**
- With Wikidata identifiers: **79** (63.2%)
- Without Wikidata: **46** (36.8%)
**After Batch 16** (November 11, 2025):
- Total Brazilian institutions: **126** (corrected after duplicate fix)
- With Wikidata identifiers: **85** (67.5%)
- Without Wikidata: **41** (32.5%)
**Progress**:
- ✅ +1 new institution discovered
- ✅ +6 institutions enriched with Wikidata
- ✅ +4.3 percentage points coverage improvement
- ✅ 67.5% > 65% minimum goal **ACHIEVED**
---
## Coverage Progress
### Overall Trajectory
| Batch | Brazilian Institutions | With Wikidata | Coverage |
|-------|----------------------|---------------|----------|
| Pre-15 | 125 | 75 | 60.0% |
| After 15 | 125 | 79 | 63.2% |
| After 16 | 126 | 85 | **67.5%** |
**Cumulative Progress**: +10 enriched institutions since Batch 15 (75 → 85)
### Goal Status
-**Minimum Goal (65%)**: **ACHIEVED** at 67.5%
- 🎯 **Stretch Goal (70%)**: Need **3 more institutions** (88/126 total)
---
## Detailed Enrichment Notes
### 1. Museu Histórico de Alcântara (Q61000855)
- **Type**: MUSEUM
- **Location**: Alcântara, Maranhão
- **Enrichment**: Added Wikidata Q61000855
- **Match Quality**: High confidence (exact name match)
### 2. Departamento Estadual de Arquivo Público do Paraná (Q56693461)
- **Type**: ARCHIVE
- **Location**: Curitiba, Paraná
- **Enrichment**: Added Wikidata Q56693461
- **Match Quality**: High confidence (exact institutional match)
### 3. Fundação Museu do Homem Americano (Q10286369)
- **Type**: MUSEUM
- **Location**: São Raimundo Nonato, Piauí
- **Enrichment**: Added Wikidata Q10286369
- **Match Quality**: High confidence (official foundation name)
- **Note**: Associated with Serra da Capivara National Park archaeological site
### 4. Arquivo Público do Estado de São Paulo (Q9630401)
- **Type**: ARCHIVE
- **Location**: São Paulo, SP
- **Enrichment**: Added Wikidata Q9630401
- **Match Quality**: High confidence (major state archive)
### 5. Sistema Brasileiro de Museus (SBM) (Q61000205)
- **Type**: OFFICIAL_INSTITUTION
- **Location**: Brasília, DF
- **Enrichment**: Added Wikidata Q61000205
- **Match Quality**: High confidence (national museum coordination system)
- **Special Case**: Required duplicate resolution (see Technical Notes)
### 6. Museu Casa de Rui Barbosa (Q56693872) ⭐ NEW
- **Type**: MUSEUM
- **Location**: Rio de Janeiro, RJ
- **Enrichment**: Discovered during Batch 16 Wikidata search
- **Match Quality**: High confidence (federal museum and cultural foundation)
- **Description**: Dedicated to preserving the legacy of Rui Barbosa (1849-1923), Brazilian statesman, jurist, and diplomat
- **Additional Identifiers**:
- VIAF: 149960006
- LCNAF ID available
- **Website**: http://www.casaderuibarbosa.gov.br
---
## Technical Notes
### Sistema Brasileiro de Museus Duplicate Resolution
**Issue**: During merge, Sistema Brasileiro de Museus appeared twice due to name format variation:
1. Original record: "Sistema Brasileiro de Museus (SBM)"
2. Batch16 record: "Sistema Brasileiro de Museus" (without abbreviation)
**Root Cause**: Merge script uses OLD_ID matching, but name differences prevented recognition as duplicate.
**Resolution**:
- Manual duplicate fix applied via `scripts/fix_sbm_duplicate_stream.py`
- Kept enriched record (with Wikidata Q61000205)
- Restored name format with "(SBM)" abbreviation for consistency
- Added provenance note documenting the merge
- Total institutions adjusted: 127 → 126
**Files**:
- Input: `globalglam-20251111-batch16.yaml` (13,389 institutions)
- Output: `globalglam-20251111-batch16-fixed.yaml` (13,388 institutions)
- Duplicate removed: 1 institution
---
## Methodology
### Search Strategy
**Phase 1: Targeted Wikidata Search**
- Searched Wikidata using institutional names from Brazilian conversation extractions
- Focused on Tier 4 (inferred) institutions lacking identifiers
- Prioritized well-documented institutions with Portuguese Wikipedia articles
**Phase 2: Manual Verification**
- Cross-referenced institutional descriptions, locations, and founding dates
- Verified VIAF IDs and official websites where available
- Ensured 100% match confidence before assigning identifiers
**Phase 3: Serendipitous Discovery**
- Discovered Museu Casa de Rui Barbosa during search for related institutions
- Added as new institution to dataset (high-quality record with multiple identifiers)
### Match Quality Criteria
All Batch 16 enrichments meet strict quality standards:
-**Exact name matches** or officially documented name variations
-**Geographic verification** (city/state confirmed)
-**Institution type alignment** (museum/archive/official institution)
-**Cross-referenced** with VIAF, official websites, or Wikipedia
-**Match score**: 1.0 (perfect match)
---
## Data Quality Improvements
### Enrichment Type: WIKIDATA_IDENTIFIER
- **Method**: MANUAL_SEARCH_BATCH16
- **Verification**: All identifiers manually verified via Wikidata
- **Data Tier**: TIER_3_CROWD_SOURCED (Wikidata-sourced)
- **Confidence**: 1.0 (perfect matches only)
### Provenance Tracking
All enriched records include:
- `enrichment_date`: 2025-11-11T22:30:00+00:00
- `enrichment_type`: WIKIDATA_IDENTIFIER
- `enrichment_method`: MANUAL_SEARCH_BATCH16
- `match_score`: 1.0
- `verified`: true
- `enrichment_source`: https://www.wikidata.org
---
## Files Modified
### Input Files
- **Main dataset**: `data/instances/all/globalglam-20251111.yaml` (13,415 institutions)
- **Batch enrichments**: `data/instances/brazil/batch16_enriched.yaml` (6 institutions)
### Output Files
- **Merged dataset**: `data/instances/all/globalglam-20251111-batch16-fixed.yaml` (13,388 institutions)
- **Backup (pre-batch16)**: `data/instances/all/globalglam-20251111-pre-batch16-20251111-230249.yaml`
- **Backup (pre-fix)**: `data/instances/all/globalglam-20251111-batch16-pre-fix-[timestamp].yaml`
### Scripts Created
- `scripts/merge_batch16.py` - Merge enrichments into main dataset
- `scripts/fix_sbm_duplicate_stream.py` - Remove SBM duplicate
---
## Next Steps
### Option 1: Pursue 70% Stretch Goal
To reach 70% coverage, need **3 more institutions** with Wikidata (88/126 total).
**Action Plan**:
1. Analyze remaining 41 institutions without Wikidata
2. Prioritize Tier 4 institutions with detailed descriptions
3. Search Wikidata and Portuguese Wikipedia
4. Create Batch 17 if viable candidates found
### Option 2: Conclude Brazilian Enrichment Campaign
With 67.5% coverage achieved, could conclude campaign:
- ✅ Exceeded minimum 65% goal by 2.5 percentage points
- ✅ 85 of 126 institutions now have Wikidata linkage
- ✅ Major institutions (museums, archives, official bodies) prioritized
**Recommendation**: Analyze remaining candidates before deciding. If low-quality or ambiguous matches, conclude campaign at 67.5%.
---
## Appendix: Batch 16 Enriched Records
### Museu Histórico de Alcântara
```yaml
- id: https://w3id.org/heritage/custodian/br/ma-museu-historico-de-alcantara
name: Museu Histórico de Alcântara
institution_type: MUSEUM
locations:
- city: Alcântara
region: MARANHÃO
country: BR
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q61000855
identifier_url: https://www.wikidata.org/wiki/Q61000855
provenance:
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
```
### Departamento Estadual de Arquivo Público do Paraná
```yaml
- id: https://w3id.org/heritage/custodian/br/pr-arquivo-publico-parana
name: Departamento Estadual de Arquivo Público do Paraná
institution_type: ARCHIVE
locations:
- city: Curitiba
region: PARANÁ
country: BR
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q56693461
identifier_url: https://www.wikidata.org/wiki/Q56693461
provenance:
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
```
### Fundação Museu do Homem Americano
```yaml
- id: https://w3id.org/heritage/custodian/br/pi-fundacao-museu-homem-americano
name: Fundação Museu do Homem Americano
institution_type: MUSEUM
locations:
- city: São Raimundo Nonato
region: PIAUÍ
country: BR
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q10286369
identifier_url: https://www.wikidata.org/wiki/Q10286369
provenance:
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
```
### Arquivo Público do Estado de São Paulo
```yaml
- id: https://w3id.org/heritage/custodian/br/sp-arquivo-publico-sao-paulo
name: Arquivo Público do Estado de São Paulo
institution_type: ARCHIVE
locations:
- city: São Paulo
region: SÃO PAULO
country: BR
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q9630401
identifier_url: https://www.wikidata.org/wiki/Q9630401
provenance:
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
```
### Sistema Brasileiro de Museus (SBM)
```yaml
- id: https://w3id.org/heritage/custodian/br/sistema-brasileiro-de-museus-sbm
name: Sistema Brasileiro de Museus (SBM)
alternative_names:
- SBM
institution_type: OFFICIAL_INSTITUTION
locations:
- city: Brasília
region: DISTRITO FEDERAL
country: BR
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q61000205
identifier_url: https://www.wikidata.org/wiki/Q61000205
provenance:
notes: 'Duplicate fixed 2025-11-11: Merged with original record, keeping enriched metadata with Wikidata identifier.'
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
```
### Museu Casa de Rui Barbosa (NEW)
```yaml
- id: https://w3id.org/heritage/custodian/br/rj-museu-casa-rui-barbosa
name: Museu Casa de Rui Barbosa
institution_type: MUSEUM
description: Federal museum and cultural foundation in Rio de Janeiro dedicated
to preserving the legacy of Rui Barbosa (1849-1923), Brazilian statesman, jurist,
and diplomat. The museum houses his personal library, archives, and collections
in his former residence.
locations:
- country: BR
region: RIO DE JANEIRO
city: Rio de Janeiro
latitude: -22.9519
longitude: -43.1763
digital_platforms:
- platform_name: Casa de Rui Barbosa Official Website
platform_type: DISCOVERY_PORTAL
platform_url: http://www.casaderuibarbosa.gov.br
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q56693872
identifier_url: https://www.wikidata.org/wiki/Q56693872
- identifier_scheme: VIAF
identifier_value: '149960006'
identifier_url: https://viaf.org/viaf/149960006
- identifier_scheme: LCNAF
identifier_value: n80037078
identifier_url: https://id.loc.gov/authorities/names/n80037078
provenance:
data_source: WIKIDATA_DISCOVERY
data_tier: TIER_3_CROWD_SOURCED
extraction_date: '2025-11-11T22:30:00+00:00'
enrichment_history:
- enrichment_date: '2025-11-11T22:30:00+00:00'
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: MANUAL_SEARCH_BATCH16
match_score: 1.0
verified: true
enrichment_notes: 'Batch 16: New institution discovered via Wikidata search'
```
---
## Conclusion
Batch 16 successfully achieved the **minimum 65% coverage goal** for Brazilian heritage institutions, reaching **67.5%** with 85 of 126 institutions now linked to Wikidata.
**Key Achievements**:
- ✅ 6 institutions enriched with high-quality Wikidata identifiers
- ✅ 1 new institution discovered and added (Museu Casa de Rui Barbosa)
- ✅ +4.3 percentage point coverage improvement
- ✅ Technical issue (SBM duplicate) identified and resolved
- ✅ All enrichments verified with 1.0 match confidence
**Decision Point**: With 67.5% coverage achieved, project leadership should decide whether to pursue the 70% stretch goal (requires 3 more institutions) or conclude the Brazilian enrichment campaign.
---
**Report Generated**: November 11, 2025
**Report Author**: GLAM Data Extraction Project
**Dataset Version**: globalglam-20251111-batch16-fixed.yaml