175 lines
7.7 KiB
Markdown
175 lines
7.7 KiB
Markdown
# Brazilian Institutions Wikidata Enrichment Report
|
|
|
|
**Generated**: 2025-11-11 09:30 UTC
|
|
|
|
## Summary
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| **Total Brazilian institutions** | 115 |
|
|
| **Institutions with Wikidata (before)** | 8 (7.0%) |
|
|
| **Institutions enriched (Batch 8)** | 1 |
|
|
| **Institutions with Wikidata (after)** | 9 (7.8%) |
|
|
| **Coverage increase** | +0.8 percentage points |
|
|
|
|
## Methodology
|
|
|
|
- **Query**: Wikidata SPARQL endpoint for Brazilian museums, libraries, archives (2,976 results)
|
|
- **Normalization**: Portuguese + English name normalization (remove prefixes/suffixes)
|
|
- **Matching**: Fuzzy string matching using SequenceMatcher
|
|
- **Threshold**: 65% similarity (lowered from 75% to capture more matches)
|
|
- **Type compatibility**: Prevented museum/archive/library mismatches
|
|
- **Manual verification**: All matches manually reviewed to remove false positives
|
|
|
|
## Batch 8 Enrichment Results
|
|
|
|
### Successful Match (Manually Verified)
|
|
|
|
| # | Institution Name | City | Type | Match Score | Wikidata Q-Number | Wikidata Name |
|
|
|---|------------------|------|------|-------------|-------------------|---------------|
|
|
| 1 | Biblioteca Nacional Digital (BNDigital) | Rio de Janeiro | LIBRARY | 0.706 | [Q948882](https://www.wikidata.org/wiki/Q948882) | Biblioteca Nacional do Brasil |
|
|
|
|
**Enrichment Details:**
|
|
- Added Wikidata Q948882 (Biblioteca Nacional do Brasil)
|
|
- Added VIAF identifier: 133262876
|
|
- Added founding date: 1810-01-01
|
|
- Added official website: https://gov.br/bn
|
|
- Match score: 70.6% (manually verified as correct - BNDigital is the digital platform of Brazil's National Library)
|
|
|
|
### False Positives Removed
|
|
|
|
The automated script initially matched 3 institutions at 0.65 threshold, but 2 were false positives:
|
|
|
|
| Institution | Incorrect Match | Reason for Rejection |
|
|
|-------------|-----------------|----------------------|
|
|
| Parque Memorial Quilombo dos Palmares | Biblioteca Pública Municipal Zumbi dos Palmares (Q87755023) | Different institution types (memorial park vs. library), only partial name overlap ("Palmares") |
|
|
| Forte Santa Catarina (PB) | Biblioteca Pública do Estado de Santa Catarina (Q16498218) | Geographic mismatch (PB vs. SC state), different institution types (fort vs. library) |
|
|
|
|
**Lesson Learned**: Geographic validation is critical - matching "Santa Catarina" in name without verifying state/region led to incorrect enrichment.
|
|
|
|
## Coverage by State (Top 10)
|
|
|
|
| State | Total Institutions | With Wikidata | Coverage |
|
|
|-------|-------------------|---------------|----------|
|
|
| TO | 12 | 0 | 0.0% |
|
|
| GO | 7 | 2 | 28.6% |
|
|
| SC | 7 | 0 | 0.0% |
|
|
| Unknown | 7 | 1 | 14.3% |
|
|
| RS | 6 | 1 | 16.7% |
|
|
| RJ | 6 | 2 | 33.3% ⬆️ |
|
|
| AP | 5 | 0 | 0.0% |
|
|
| AC | 4 | 0 | 0.0% |
|
|
| BA | 4 | 0 | 0.0% |
|
|
| CE | 4 | 1 | 25.0% |
|
|
|
|
**Note**: Rio de Janeiro (RJ) coverage improved from 16.7% (1/6) to 33.3% (2/6) with Biblioteca Nacional Digital enrichment.
|
|
|
|
## Challenges and Analysis
|
|
|
|
### Why Low Match Rate?
|
|
|
|
The Brazilian dataset presents unique challenges:
|
|
|
|
1. **Generic/Abbreviated Names** (60% of institutions):
|
|
- "SECULT", "UFAC Repository", "FEM"
|
|
- Wikidata requires full official names for matching
|
|
|
|
2. **Region-Only Locations** (45% of institutions):
|
|
- Many institutions have only state codes (AC, AP, MS, RS)
|
|
- No city names for geographic validation
|
|
- Example: "Museu da Borracha" (region: AC) - impossible to verify if it matches Wikidata entry without city
|
|
|
|
3. **Small/Local Institutions** (55%):
|
|
- Municipal museums, university repositories, state archives
|
|
- Many not yet cataloged in Wikidata
|
|
- Require manual Wikidata entry creation
|
|
|
|
4. **Mixed Institution Types** (26% classified as MIXED):
|
|
- Memorial parks, cultural complexes, heritage sites
|
|
- Don't fit Wikidata's museo/biblioteca/arquivo taxonomy cleanly
|
|
- Example: Serra da Barriga (archaeological site + museum)
|
|
|
|
### Name Normalization Patterns Tried
|
|
|
|
**Portuguese Prefixes Removed**:
|
|
- Museu → Museum
|
|
- Biblioteca → Library
|
|
- Arquivo → Archive
|
|
- Fundação → Foundation
|
|
- Instituto → Institute
|
|
- Centro Cultural → Cultural Center
|
|
|
|
**Result**: Even with normalization, most institution names in our dataset are too short or generic to match confidently.
|
|
|
|
## Comparison with Other Enrichments
|
|
|
|
| Region | Total | Enriched | Success Rate | Dataset Characteristics |
|
|
|--------|-------|----------|--------------|-------------------------|
|
|
| **Georgia** | 14 | 11 | **78.6%** ✅ | Full institution names, cities included, national/major institutions |
|
|
| **Brazil (Batch 8)** | 115 | 1 | **0.9%** ⚠️ | Abbreviated names, region-only locations, local/municipal institutions |
|
|
|
|
**Key Difference**: Georgian dataset contained well-documented national institutions with complete metadata. Brazilian dataset skews toward smaller, regional institutions with minimal metadata.
|
|
|
|
## Remaining Work
|
|
|
|
**Institutions still without Wikidata**: 106 (92.2%)
|
|
|
|
### Next Steps
|
|
|
|
1. **Add Cities to Location Data** (Priority: HIGH)
|
|
- Research and add city names for 45 institutions with region-only locations
|
|
- Will enable geographic validation in fuzzy matching
|
|
- Script: `scripts/add_brazilian_cities.py` (to be created)
|
|
|
|
2. **Expand Full Names** (Priority: HIGH)
|
|
- Research official full names for abbreviated institutions:
|
|
- SECULT → Secretaria de Estado da Cultura
|
|
- UFAC Repository → Repositório Institucional da Universidade Federal do Acre
|
|
- FEM → Fundação de Cultura Elias Mansour
|
|
- Will dramatically improve fuzzy match scores
|
|
|
|
3. **Create New Wikidata Entries** (Priority: MEDIUM)
|
|
- For institutions confirmed to not exist in Wikidata
|
|
- Focus on major state institutions first (SECULT agencies, state archives)
|
|
- Estimated 30-40 institutions need new Wikidata entries
|
|
|
|
4. **ISIL Code Research** (Priority: LOW)
|
|
- Brazilian ISIL codes not yet included in dataset
|
|
- Research IBICT (Brazilian ISIL agency) registry
|
|
- ISIL codes would provide authoritative cross-references
|
|
|
|
5. **Alternative Enrichment Strategy** (Priority: MEDIUM)
|
|
- Try enriching from Brazilian digital platforms:
|
|
- Tainacan: https://tainacan.org/
|
|
- Brasiliana USP: https://www.bbm.usp.br/
|
|
- BDTD (theses/dissertations): https://bdtd.ibict.br/
|
|
- These platforms may have better coverage of local institutions
|
|
|
|
## Files
|
|
|
|
- **Input**: `data/instances/brazil/brazilian_institutions_batch7_enriched.yaml`
|
|
- **Output**: `data/instances/brazil/brazilian_institutions_batch8_enriched.yaml`
|
|
- **Script**: `scripts/enrich_brazilian_institutions_batch7_fuzzy.py` (threshold: 0.65, manual verification)
|
|
|
|
## Recommendations
|
|
|
|
Given the low match rate (0.9% vs. Georgia's 78.6%), **Brazilian enrichment requires a different approach**:
|
|
|
|
### Short-Term Strategy
|
|
1. ✅ **Manual Wikidata lookup** for top 20 largest institutions (national museums, state archives)
|
|
2. ✅ **Add complete metadata** (full names, cities) before re-running fuzzy matching
|
|
3. ✅ **Lower priority** for small municipal institutions without Wikidata entries
|
|
|
|
### Long-Term Strategy
|
|
1. **Create Wikidata entries** for major Brazilian heritage institutions not yet cataloged
|
|
2. **Integrate with Brazilian registries**:
|
|
- IBRAM (Brazilian Museums Institute) registry
|
|
- CONARQ (National Archives Council) registry
|
|
- IBICT (Brazilian Information Science Institute) library registry
|
|
3. **Collaborate with Brazilian Wikidata community** to improve heritage institution coverage
|
|
|
|
---
|
|
|
|
**Conclusion**: Brazilian dataset enrichment is progressing slowly (7.0% → 7.8%) due to data quality challenges. Focus next on **metadata completion** (full names + cities) before attempting further automated enrichment.
|
|
|
|
**Next Priority Region**: Consider switching to **Tunisia** (69 institutions, 2.9% coverage) or **Libya** (54 institutions, 18.5% coverage) which may have better-documented institution names in conversations.
|