7.7 KiB
Brazilian Institutions Wikidata Enrichment Report
Generated: 2025-11-11 09:30 UTC
Summary
| Metric | Value |
|---|---|
| Total Brazilian institutions | 115 |
| Institutions with Wikidata (before) | 8 (7.0%) |
| Institutions enriched (Batch 8) | 1 |
| Institutions with Wikidata (after) | 9 (7.8%) |
| Coverage increase | +0.8 percentage points |
Methodology
- Query: Wikidata SPARQL endpoint for Brazilian museums, libraries, archives (2,976 results)
- Normalization: Portuguese + English name normalization (remove prefixes/suffixes)
- Matching: Fuzzy string matching using SequenceMatcher
- Threshold: 65% similarity (lowered from 75% to capture more matches)
- Type compatibility: Prevented museum/archive/library mismatches
- Manual verification: All matches manually reviewed to remove false positives
Batch 8 Enrichment Results
Successful Match (Manually Verified)
| # | Institution Name | City | Type | Match Score | Wikidata Q-Number | Wikidata Name |
|---|---|---|---|---|---|---|
| 1 | Biblioteca Nacional Digital (BNDigital) | Rio de Janeiro | LIBRARY | 0.706 | Q948882 | Biblioteca Nacional do Brasil |
Enrichment Details:
- Added Wikidata Q948882 (Biblioteca Nacional do Brasil)
- Added VIAF identifier: 133262876
- Added founding date: 1810-01-01
- Added official website: https://gov.br/bn
- Match score: 70.6% (manually verified as correct - BNDigital is the digital platform of Brazil's National Library)
False Positives Removed
The automated script initially matched 3 institutions at 0.65 threshold, but 2 were false positives:
| Institution | Incorrect Match | Reason for Rejection |
|---|---|---|
| Parque Memorial Quilombo dos Palmares | Biblioteca Pública Municipal Zumbi dos Palmares (Q87755023) | Different institution types (memorial park vs. library), only partial name overlap ("Palmares") |
| Forte Santa Catarina (PB) | Biblioteca Pública do Estado de Santa Catarina (Q16498218) | Geographic mismatch (PB vs. SC state), different institution types (fort vs. library) |
Lesson Learned: Geographic validation is critical - matching "Santa Catarina" in name without verifying state/region led to incorrect enrichment.
Coverage by State (Top 10)
| State | Total Institutions | With Wikidata | Coverage |
|---|---|---|---|
| TO | 12 | 0 | 0.0% |
| GO | 7 | 2 | 28.6% |
| SC | 7 | 0 | 0.0% |
| Unknown | 7 | 1 | 14.3% |
| RS | 6 | 1 | 16.7% |
| RJ | 6 | 2 | 33.3% ⬆️ |
| AP | 5 | 0 | 0.0% |
| AC | 4 | 0 | 0.0% |
| BA | 4 | 0 | 0.0% |
| CE | 4 | 1 | 25.0% |
Note: Rio de Janeiro (RJ) coverage improved from 16.7% (1/6) to 33.3% (2/6) with Biblioteca Nacional Digital enrichment.
Challenges and Analysis
Why Low Match Rate?
The Brazilian dataset presents unique challenges:
-
Generic/Abbreviated Names (60% of institutions):
- "SECULT", "UFAC Repository", "FEM"
- Wikidata requires full official names for matching
-
Region-Only Locations (45% of institutions):
- Many institutions have only state codes (AC, AP, MS, RS)
- No city names for geographic validation
- Example: "Museu da Borracha" (region: AC) - impossible to verify if it matches Wikidata entry without city
-
Small/Local Institutions (55%):
- Municipal museums, university repositories, state archives
- Many not yet cataloged in Wikidata
- Require manual Wikidata entry creation
-
Mixed Institution Types (26% classified as MIXED):
- Memorial parks, cultural complexes, heritage sites
- Don't fit Wikidata's museo/biblioteca/arquivo taxonomy cleanly
- Example: Serra da Barriga (archaeological site + museum)
Name Normalization Patterns Tried
Portuguese Prefixes Removed:
- Museu → Museum
- Biblioteca → Library
- Arquivo → Archive
- Fundação → Foundation
- Instituto → Institute
- Centro Cultural → Cultural Center
Result: Even with normalization, most institution names in our dataset are too short or generic to match confidently.
Comparison with Other Enrichments
| Region | Total | Enriched | Success Rate | Dataset Characteristics |
|---|---|---|---|---|
| Georgia | 14 | 11 | 78.6% ✅ | Full institution names, cities included, national/major institutions |
| Brazil (Batch 8) | 115 | 1 | 0.9% ⚠️ | Abbreviated names, region-only locations, local/municipal institutions |
Key Difference: Georgian dataset contained well-documented national institutions with complete metadata. Brazilian dataset skews toward smaller, regional institutions with minimal metadata.
Remaining Work
Institutions still without Wikidata: 106 (92.2%)
Next Steps
-
Add Cities to Location Data (Priority: HIGH)
- Research and add city names for 45 institutions with region-only locations
- Will enable geographic validation in fuzzy matching
- Script:
scripts/add_brazilian_cities.py(to be created)
-
Expand Full Names (Priority: HIGH)
- Research official full names for abbreviated institutions:
- SECULT → Secretaria de Estado da Cultura
- UFAC Repository → Repositório Institucional da Universidade Federal do Acre
- FEM → Fundação de Cultura Elias Mansour
- Will dramatically improve fuzzy match scores
- Research official full names for abbreviated institutions:
-
Create New Wikidata Entries (Priority: MEDIUM)
- For institutions confirmed to not exist in Wikidata
- Focus on major state institutions first (SECULT agencies, state archives)
- Estimated 30-40 institutions need new Wikidata entries
-
ISIL Code Research (Priority: LOW)
- Brazilian ISIL codes not yet included in dataset
- Research IBICT (Brazilian ISIL agency) registry
- ISIL codes would provide authoritative cross-references
-
Alternative Enrichment Strategy (Priority: MEDIUM)
- Try enriching from Brazilian digital platforms:
- Tainacan: https://tainacan.org/
- Brasiliana USP: https://www.bbm.usp.br/
- BDTD (theses/dissertations): https://bdtd.ibict.br/
- These platforms may have better coverage of local institutions
- Try enriching from Brazilian digital platforms:
Files
- Input:
data/instances/brazil/brazilian_institutions_batch7_enriched.yaml - Output:
data/instances/brazil/brazilian_institutions_batch8_enriched.yaml - Script:
scripts/enrich_brazilian_institutions_batch7_fuzzy.py(threshold: 0.65, manual verification)
Recommendations
Given the low match rate (0.9% vs. Georgia's 78.6%), Brazilian enrichment requires a different approach:
Short-Term Strategy
- ✅ Manual Wikidata lookup for top 20 largest institutions (national museums, state archives)
- ✅ Add complete metadata (full names, cities) before re-running fuzzy matching
- ✅ Lower priority for small municipal institutions without Wikidata entries
Long-Term Strategy
- Create Wikidata entries for major Brazilian heritage institutions not yet cataloged
- Integrate with Brazilian registries:
- IBRAM (Brazilian Museums Institute) registry
- CONARQ (National Archives Council) registry
- IBICT (Brazilian Information Science Institute) library registry
- Collaborate with Brazilian Wikidata community to improve heritage institution coverage
Conclusion: Brazilian dataset enrichment is progressing slowly (7.0% → 7.8%) due to data quality challenges. Focus next on metadata completion (full names + cities) before attempting further automated enrichment.
Next Priority Region: Consider switching to Tunisia (69 institutions, 2.9% coverage) or Libya (54 institutions, 18.5% coverage) which may have better-documented institution names in conversations.