glam/data/instances/brazil/BRAZIL_ENRICHMENT_REPORT.md
2025-11-19 23:25:22 +01:00

175 lines
7.7 KiB
Markdown

# Brazilian Institutions Wikidata Enrichment Report
**Generated**: 2025-11-11 09:30 UTC
## Summary
| Metric | Value |
|--------|-------|
| **Total Brazilian institutions** | 115 |
| **Institutions with Wikidata (before)** | 8 (7.0%) |
| **Institutions enriched (Batch 8)** | 1 |
| **Institutions with Wikidata (after)** | 9 (7.8%) |
| **Coverage increase** | +0.8 percentage points |
## Methodology
- **Query**: Wikidata SPARQL endpoint for Brazilian museums, libraries, archives (2,976 results)
- **Normalization**: Portuguese + English name normalization (remove prefixes/suffixes)
- **Matching**: Fuzzy string matching using SequenceMatcher
- **Threshold**: 65% similarity (lowered from 75% to capture more matches)
- **Type compatibility**: Prevented museum/archive/library mismatches
- **Manual verification**: All matches manually reviewed to remove false positives
## Batch 8 Enrichment Results
### Successful Match (Manually Verified)
| # | Institution Name | City | Type | Match Score | Wikidata Q-Number | Wikidata Name |
|---|------------------|------|------|-------------|-------------------|---------------|
| 1 | Biblioteca Nacional Digital (BNDigital) | Rio de Janeiro | LIBRARY | 0.706 | [Q948882](https://www.wikidata.org/wiki/Q948882) | Biblioteca Nacional do Brasil |
**Enrichment Details:**
- Added Wikidata Q948882 (Biblioteca Nacional do Brasil)
- Added VIAF identifier: 133262876
- Added founding date: 1810-01-01
- Added official website: https://gov.br/bn
- Match score: 70.6% (manually verified as correct - BNDigital is the digital platform of Brazil's National Library)
### False Positives Removed
The automated script initially matched 3 institutions at 0.65 threshold, but 2 were false positives:
| Institution | Incorrect Match | Reason for Rejection |
|-------------|-----------------|----------------------|
| Parque Memorial Quilombo dos Palmares | Biblioteca Pública Municipal Zumbi dos Palmares (Q87755023) | Different institution types (memorial park vs. library), only partial name overlap ("Palmares") |
| Forte Santa Catarina (PB) | Biblioteca Pública do Estado de Santa Catarina (Q16498218) | Geographic mismatch (PB vs. SC state), different institution types (fort vs. library) |
**Lesson Learned**: Geographic validation is critical - matching "Santa Catarina" in name without verifying state/region led to incorrect enrichment.
## Coverage by State (Top 10)
| State | Total Institutions | With Wikidata | Coverage |
|-------|-------------------|---------------|----------|
| TO | 12 | 0 | 0.0% |
| GO | 7 | 2 | 28.6% |
| SC | 7 | 0 | 0.0% |
| Unknown | 7 | 1 | 14.3% |
| RS | 6 | 1 | 16.7% |
| RJ | 6 | 2 | 33.3% ⬆️ |
| AP | 5 | 0 | 0.0% |
| AC | 4 | 0 | 0.0% |
| BA | 4 | 0 | 0.0% |
| CE | 4 | 1 | 25.0% |
**Note**: Rio de Janeiro (RJ) coverage improved from 16.7% (1/6) to 33.3% (2/6) with Biblioteca Nacional Digital enrichment.
## Challenges and Analysis
### Why Low Match Rate?
The Brazilian dataset presents unique challenges:
1. **Generic/Abbreviated Names** (60% of institutions):
- "SECULT", "UFAC Repository", "FEM"
- Wikidata requires full official names for matching
2. **Region-Only Locations** (45% of institutions):
- Many institutions have only state codes (AC, AP, MS, RS)
- No city names for geographic validation
- Example: "Museu da Borracha" (region: AC) - impossible to verify if it matches Wikidata entry without city
3. **Small/Local Institutions** (55%):
- Municipal museums, university repositories, state archives
- Many not yet cataloged in Wikidata
- Require manual Wikidata entry creation
4. **Mixed Institution Types** (26% classified as MIXED):
- Memorial parks, cultural complexes, heritage sites
- Don't fit Wikidata's museo/biblioteca/arquivo taxonomy cleanly
- Example: Serra da Barriga (archaeological site + museum)
### Name Normalization Patterns Tried
**Portuguese Prefixes Removed**:
- Museu → Museum
- Biblioteca → Library
- Arquivo → Archive
- Fundação → Foundation
- Instituto → Institute
- Centro Cultural → Cultural Center
**Result**: Even with normalization, most institution names in our dataset are too short or generic to match confidently.
## Comparison with Other Enrichments
| Region | Total | Enriched | Success Rate | Dataset Characteristics |
|--------|-------|----------|--------------|-------------------------|
| **Georgia** | 14 | 11 | **78.6%** ✅ | Full institution names, cities included, national/major institutions |
| **Brazil (Batch 8)** | 115 | 1 | **0.9%** ⚠️ | Abbreviated names, region-only locations, local/municipal institutions |
**Key Difference**: Georgian dataset contained well-documented national institutions with complete metadata. Brazilian dataset skews toward smaller, regional institutions with minimal metadata.
## Remaining Work
**Institutions still without Wikidata**: 106 (92.2%)
### Next Steps
1. **Add Cities to Location Data** (Priority: HIGH)
- Research and add city names for 45 institutions with region-only locations
- Will enable geographic validation in fuzzy matching
- Script: `scripts/add_brazilian_cities.py` (to be created)
2. **Expand Full Names** (Priority: HIGH)
- Research official full names for abbreviated institutions:
- SECULT → Secretaria de Estado da Cultura
- UFAC Repository → Repositório Institucional da Universidade Federal do Acre
- FEM → Fundação de Cultura Elias Mansour
- Will dramatically improve fuzzy match scores
3. **Create New Wikidata Entries** (Priority: MEDIUM)
- For institutions confirmed to not exist in Wikidata
- Focus on major state institutions first (SECULT agencies, state archives)
- Estimated 30-40 institutions need new Wikidata entries
4. **ISIL Code Research** (Priority: LOW)
- Brazilian ISIL codes not yet included in dataset
- Research IBICT (Brazilian ISIL agency) registry
- ISIL codes would provide authoritative cross-references
5. **Alternative Enrichment Strategy** (Priority: MEDIUM)
- Try enriching from Brazilian digital platforms:
- Tainacan: https://tainacan.org/
- Brasiliana USP: https://www.bbm.usp.br/
- BDTD (theses/dissertations): https://bdtd.ibict.br/
- These platforms may have better coverage of local institutions
## Files
- **Input**: `data/instances/brazil/brazilian_institutions_batch7_enriched.yaml`
- **Output**: `data/instances/brazil/brazilian_institutions_batch8_enriched.yaml`
- **Script**: `scripts/enrich_brazilian_institutions_batch7_fuzzy.py` (threshold: 0.65, manual verification)
## Recommendations
Given the low match rate (0.9% vs. Georgia's 78.6%), **Brazilian enrichment requires a different approach**:
### Short-Term Strategy
1.**Manual Wikidata lookup** for top 20 largest institutions (national museums, state archives)
2.**Add complete metadata** (full names, cities) before re-running fuzzy matching
3.**Lower priority** for small municipal institutions without Wikidata entries
### Long-Term Strategy
1. **Create Wikidata entries** for major Brazilian heritage institutions not yet cataloged
2. **Integrate with Brazilian registries**:
- IBRAM (Brazilian Museums Institute) registry
- CONARQ (National Archives Council) registry
- IBICT (Brazilian Information Science Institute) library registry
3. **Collaborate with Brazilian Wikidata community** to improve heritage institution coverage
---
**Conclusion**: Brazilian dataset enrichment is progressing slowly (7.0% → 7.8%) due to data quality challenges. Focus next on **metadata completion** (full names + cities) before attempting further automated enrichment.
**Next Priority Region**: Consider switching to **Tunisia** (69 institutions, 2.9% coverage) or **Libya** (54 institutions, 18.5% coverage) which may have better-documented institution names in conversations.