glam/data/instances/brazil/BRAZIL_ENRICHMENT_REPORT.md
2025-11-19 23:25:22 +01:00

7.7 KiB

Brazilian Institutions Wikidata Enrichment Report

Generated: 2025-11-11 09:30 UTC

Summary

Metric Value
Total Brazilian institutions 115
Institutions with Wikidata (before) 8 (7.0%)
Institutions enriched (Batch 8) 1
Institutions with Wikidata (after) 9 (7.8%)
Coverage increase +0.8 percentage points

Methodology

  • Query: Wikidata SPARQL endpoint for Brazilian museums, libraries, archives (2,976 results)
  • Normalization: Portuguese + English name normalization (remove prefixes/suffixes)
  • Matching: Fuzzy string matching using SequenceMatcher
  • Threshold: 65% similarity (lowered from 75% to capture more matches)
  • Type compatibility: Prevented museum/archive/library mismatches
  • Manual verification: All matches manually reviewed to remove false positives

Batch 8 Enrichment Results

Successful Match (Manually Verified)

# Institution Name City Type Match Score Wikidata Q-Number Wikidata Name
1 Biblioteca Nacional Digital (BNDigital) Rio de Janeiro LIBRARY 0.706 Q948882 Biblioteca Nacional do Brasil

Enrichment Details:

  • Added Wikidata Q948882 (Biblioteca Nacional do Brasil)
  • Added VIAF identifier: 133262876
  • Added founding date: 1810-01-01
  • Added official website: https://gov.br/bn
  • Match score: 70.6% (manually verified as correct - BNDigital is the digital platform of Brazil's National Library)

False Positives Removed

The automated script initially matched 3 institutions at 0.65 threshold, but 2 were false positives:

Institution Incorrect Match Reason for Rejection
Parque Memorial Quilombo dos Palmares Biblioteca Pública Municipal Zumbi dos Palmares (Q87755023) Different institution types (memorial park vs. library), only partial name overlap ("Palmares")
Forte Santa Catarina (PB) Biblioteca Pública do Estado de Santa Catarina (Q16498218) Geographic mismatch (PB vs. SC state), different institution types (fort vs. library)

Lesson Learned: Geographic validation is critical - matching "Santa Catarina" in name without verifying state/region led to incorrect enrichment.

Coverage by State (Top 10)

State Total Institutions With Wikidata Coverage
TO 12 0 0.0%
GO 7 2 28.6%
SC 7 0 0.0%
Unknown 7 1 14.3%
RS 6 1 16.7%
RJ 6 2 33.3% ⬆️
AP 5 0 0.0%
AC 4 0 0.0%
BA 4 0 0.0%
CE 4 1 25.0%

Note: Rio de Janeiro (RJ) coverage improved from 16.7% (1/6) to 33.3% (2/6) with Biblioteca Nacional Digital enrichment.

Challenges and Analysis

Why Low Match Rate?

The Brazilian dataset presents unique challenges:

  1. Generic/Abbreviated Names (60% of institutions):

    • "SECULT", "UFAC Repository", "FEM"
    • Wikidata requires full official names for matching
  2. Region-Only Locations (45% of institutions):

    • Many institutions have only state codes (AC, AP, MS, RS)
    • No city names for geographic validation
    • Example: "Museu da Borracha" (region: AC) - impossible to verify if it matches Wikidata entry without city
  3. Small/Local Institutions (55%):

    • Municipal museums, university repositories, state archives
    • Many not yet cataloged in Wikidata
    • Require manual Wikidata entry creation
  4. Mixed Institution Types (26% classified as MIXED):

    • Memorial parks, cultural complexes, heritage sites
    • Don't fit Wikidata's museo/biblioteca/arquivo taxonomy cleanly
    • Example: Serra da Barriga (archaeological site + museum)

Name Normalization Patterns Tried

Portuguese Prefixes Removed:

  • Museu → Museum
  • Biblioteca → Library
  • Arquivo → Archive
  • Fundação → Foundation
  • Instituto → Institute
  • Centro Cultural → Cultural Center

Result: Even with normalization, most institution names in our dataset are too short or generic to match confidently.

Comparison with Other Enrichments

Region Total Enriched Success Rate Dataset Characteristics
Georgia 14 11 78.6% Full institution names, cities included, national/major institutions
Brazil (Batch 8) 115 1 0.9% ⚠️ Abbreviated names, region-only locations, local/municipal institutions

Key Difference: Georgian dataset contained well-documented national institutions with complete metadata. Brazilian dataset skews toward smaller, regional institutions with minimal metadata.

Remaining Work

Institutions still without Wikidata: 106 (92.2%)

Next Steps

  1. Add Cities to Location Data (Priority: HIGH)

    • Research and add city names for 45 institutions with region-only locations
    • Will enable geographic validation in fuzzy matching
    • Script: scripts/add_brazilian_cities.py (to be created)
  2. Expand Full Names (Priority: HIGH)

    • Research official full names for abbreviated institutions:
      • SECULT → Secretaria de Estado da Cultura
      • UFAC Repository → Repositório Institucional da Universidade Federal do Acre
      • FEM → Fundação de Cultura Elias Mansour
    • Will dramatically improve fuzzy match scores
  3. Create New Wikidata Entries (Priority: MEDIUM)

    • For institutions confirmed to not exist in Wikidata
    • Focus on major state institutions first (SECULT agencies, state archives)
    • Estimated 30-40 institutions need new Wikidata entries
  4. ISIL Code Research (Priority: LOW)

    • Brazilian ISIL codes not yet included in dataset
    • Research IBICT (Brazilian ISIL agency) registry
    • ISIL codes would provide authoritative cross-references
  5. Alternative Enrichment Strategy (Priority: MEDIUM)

Files

  • Input: data/instances/brazil/brazilian_institutions_batch7_enriched.yaml
  • Output: data/instances/brazil/brazilian_institutions_batch8_enriched.yaml
  • Script: scripts/enrich_brazilian_institutions_batch7_fuzzy.py (threshold: 0.65, manual verification)

Recommendations

Given the low match rate (0.9% vs. Georgia's 78.6%), Brazilian enrichment requires a different approach:

Short-Term Strategy

  1. Manual Wikidata lookup for top 20 largest institutions (national museums, state archives)
  2. Add complete metadata (full names, cities) before re-running fuzzy matching
  3. Lower priority for small municipal institutions without Wikidata entries

Long-Term Strategy

  1. Create Wikidata entries for major Brazilian heritage institutions not yet cataloged
  2. Integrate with Brazilian registries:
    • IBRAM (Brazilian Museums Institute) registry
    • CONARQ (National Archives Council) registry
    • IBICT (Brazilian Information Science Institute) library registry
  3. Collaborate with Brazilian Wikidata community to improve heritage institution coverage

Conclusion: Brazilian dataset enrichment is progressing slowly (7.0% → 7.8%) due to data quality challenges. Focus next on metadata completion (full names + cities) before attempting further automated enrichment.

Next Priority Region: Consider switching to Tunisia (69 institutions, 2.9% coverage) or Libya (54 institutions, 18.5% coverage) which may have better-documented institution names in conversations.