11 KiB
BATCH 12 ENRICHMENT REPORT - Brazilian Institutions
Date: 2025-11-11
Batch: 12
Status: ✅ COMPLETE
Success Rate: 100% (10/10)
Executive Summary
Successfully enriched 10 Brazilian institutions with verified Wikidata Q-numbers using authenticated Wikidata search API and SPARQL queries. This batch focused primarily on federal universities (8 institutions) and historical/research institutes (2 institutions).
Coverage Milestone Achieved 🎉
- Previous coverage (Batch 11): 47.1% (57/121)
- New coverage (Batch 12): 55.4% (67/121) ← Passed 50% threshold!
- Institutions enriched: 10
- Remaining to enrich: 54
Enriched Institutions
Federal Universities (8 institutions)
| # | Institution | Q-number | Wikidata Label | State | Confidence |
|---|---|---|---|---|---|
| 1 | UFPR | Q1232831 | Universidade Federal do Paraná | Paraná | 95% |
| 2 | UFPE | Q2322256 | Universidade Federal de Pernambuco | Pernambuco | 95% |
| 3 | UFPI | Q945699 | Universidade Federal do Piauí | Piauí | 95% |
| 4 | UFRN | Q3847505 | Universidade Federal do Rio Grande do Norte | Rio Grande do Norte | 95% |
| 5 | UFRR | Q7894378 | Universidade Federal de Roraima | Roraima | 95% |
| 6 | UFS | Q7894380 | Universidade Federal de Sergipe | Sergipe | 95% |
| 7 | UFT | Q4481798 | Fundação Universidade Federal do Tocantins | Tocantins | 95% |
| 8 | UFAM | Q5440476 | Universidade Federal do Amazonas | Amazonas | 95% |
Historical & Research Institutes (2 institutions)
| # | Institution | Q-number | Wikidata Label | State | Confidence |
|---|---|---|---|---|---|
| 9 | Instituto Histórico | Q108221092 | Instituto Histórico e Geográfico de Mato Grosso | Mato Grosso | 93% |
| 10 | UFMS Repositories | Q5440478 | Universidade Federal de Mato Grosso do Sul | Mato Grosso do Sul | 95% |
Methodology
Search Strategy
-
Primary Search: Wikidata authenticated search API
- Query format: Full Portuguese institution name
- Fallback: English translation with abbreviation
-
Verification: SPARQL queries when needed
- Example: UFRN required SPARQL query to disambiguate from library entity
- Query pattern: Universities (P31: Q3918) in Brazilian states (P17: Q155, P131: state)
-
Metadata Validation: All Q-numbers verified via
get_metadata()API- Confirmed Portuguese labels match expected institution names
- Verified descriptions indicate correct institution type (university, not library/archive/museum)
Data Quality Issues Resolved
False Positive Corrections (from initial search)
-
UFRR initial match: Q118133039 (Museu de Solos - Soil Museum) ❌
- Correct match: Q7894378 (Universidade Federal de Roraima) ✅
-
UFS initial match: Q50811482 (Tomo - Academic journal) ❌
- Correct match: Q7894380 (Universidade Federal de Sergipe) ✅
-
UFRN initial match: Q107617217 (No label/description found) ❌
- Correct match via SPARQL: Q3847505 (Universidade Federal do Rio Grande do Norte) ✅
SPARQL Query Example
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q3918 . # Instance of: university
?item wdt:P17 wd:Q155 . # Country: Brazil
?item wdt:P131* wd:Q43255 . # Located in: Rio Grande do Norte state
SERVICE wikibase:label { bd:serviceParam wikibase:language "pt,en". }
}
Result: Q3847505 - Universidade Federal do Rio Grande do Norte
Enrichment Statistics
Success Metrics
- Total attempts: 10
- Successful enrichments: 10
- Failed matches: 0
- Success rate: 100%
Confidence Distribution
- 95% confidence: 9 institutions (exact name matches)
- 93% confidence: 1 institution (Instituto Histórico - partial name match)
Geographic Distribution (States Covered)
| Region | States | Count |
|---|---|---|
| Northeast | Pernambuco, Piauí, Rio Grande do Norte, Sergipe | 4 |
| North | Amazonas, Roraima, Tocantins | 3 |
| South | Paraná | 1 |
| Central-West | Mato Grosso, Mato Grosso do Sul | 2 |
Impact Analysis
Coverage Progress
Batch 11 (47.1%) ████████████████████░░░░░░░░░░░░░░░░░░
57/121 institutions
Batch 12 (55.4%) ██████████████████████░░░░░░░░░░░░░░░░
67/121 institutions (+10)
Target (80%) ████████████████████████████████░░░░░░
97/121 institutions (30 more needed)
Enrichment Velocity
- Batch 11: 10 institutions (from 47.1% to 47.1% baseline)
- Batch 12: 10 institutions (from 47.1% to 55.4%)
- Increase: +8.3 percentage points
- Average per batch: 10 institutions
Projection to 80% Coverage
- Current: 67/121 (55.4%)
- Target: 97/121 (80%)
- Remaining: 30 institutions
- Estimated batches needed: 3 batches (13-15)
- Estimated completion: Mid-late November 2025
Technical Implementation
Files Modified
-
Main dataset:
data/instances/all/globalglam-20251111.yaml- Added 10 Wikidata identifiers
- Updated provenance metadata (enrichment_history, last_updated)
- Created backup:
globalglam-20251111.batch12_backup
-
Batch output:
data/instances/brazil/batch12_enriched.yaml- Summary file with 10 enriched institutions
- Includes Q-numbers, labels, confidence scores
Enrichment Script
- File:
enrich_brazil_batch12.py - Features:
- Fuzzy name matching (exact and partial)
- Empty name string bug fix (from Batch 11)
- Provenance tracking with timestamps
- Enrichment history entries
- Automatic backup creation
- Coverage statistics reporting
Provenance Metadata Example
provenance:
enrichment_history:
- enrichment_date: "2025-11-11T..."
enrichment_type: WIKIDATA_IDENTIFIER
enrichment_method: WIKIDATA_AUTHENTICATED_SEARCH
match_score: 0.95
verified: true
enrichment_source: https://www.wikidata.org
enrichment_notes: "Batch 12: Federal University of Paraná - exact match. Wikidata label: Universidade Federal do Paraná"
last_updated: "2025-11-11T..."
Data Quality Assurance
Verification Checklist
- ✅ All Q-numbers are real Wikidata entities (no synthetic identifiers)
- ✅ All Q-numbers verified via
get_metadata()API - ✅ Portuguese labels match expected institution names
- ✅ Descriptions confirm correct institution types
- ✅ All institutions verified as Brazilian (country: BR)
- ✅ No duplicate Q-numbers across dataset
- ✅ Confidence scores accurately reflect match quality
Name Matching Quality
| Match Type | Count | Example |
|---|---|---|
| Exact abbreviation match | 9 | UFPR → UFPR |
| Partial name match | 1 | Instituto Histórico → Instituto Histórico |
Challenges & Solutions
Challenge 1: False Positive Matches
Problem: Initial Wikidata searches returned incorrect entities:
- UFRR matched to soil museum instead of university
- UFS matched to academic journal instead of university
Solution:
- Implemented metadata verification step
- Re-searched with more specific queries (full Portuguese names)
- Verified descriptions confirm institution type
Challenge 2: Missing Wikidata Labels
Problem: UFRN initially matched Q107617217 with no label/description
Solution:
- Used SPARQL query to find universities in Rio Grande do Norte state
- Found correct entity Q3847505 with proper metadata
- Validated via Portuguese label and state location
Challenge 3: Abbreviation Ambiguity
Problem: Brazilian federal universities use standard abbreviations (UFX format) that may match multiple entities
Solution:
- Always verify state/location matches expected state
- Check description mentions "universidade federal" (federal university)
- Use SPARQL with geographic filters when needed
Lessons Learned
- Always verify metadata: Search API can return partial matches; metadata validation is essential
- SPARQL is powerful: When search fails, SPARQL with property filters (P31, P17, P131) yields accurate results
- Federal university pattern: Brazilian federal universities follow naming convention "Universidade Federal de [State]" - use full name for better matches
- Empty name bug fixed: Batch 11 fix (checking for non-empty names) prevented false positives in Batch 12
Next Steps (Batch 13)
Priority Candidates (54 remaining institutions)
High Priority (likely in Wikidata)
- Major state museums: Museu de [State] institutions
- State universities: UNESP, UNICAMP branches
- National libraries/archives: Biblioteca Nacional branches
- Federal heritage agencies: IPHAN regional offices
Medium Priority (may exist in Wikidata)
- Municipal museums with Wikipedia articles
- Historical societies (Sociedade Histórica)
- Religious archives with notable collections
Low Priority (unlikely in Wikidata)
- Small municipal archives
- Personal collections
- Recently established institutions
- Digital-only repositories
Recommended Batch 13 Targets
Focus on state museums and major cultural institutions:
- Target: 10-12 institutions
- Search strategy: "[Institution name] Brazil [State]"
- Expected success rate: 70-80% (some may not exist in Wikidata)
Appendix: Q-number Verification Log
All Q-numbers verified on 2025-11-11:
Q1232831 ✅ Label: "Universidade Federal do Paraná" (pt)
Q2322256 ✅ Label: "Universidade Federal de Pernambuco" (pt)
Q945699 ✅ Label: "Universidade Federal do Piauí" (pt)
Q3847505 ✅ Label: "Universidade Federal do Rio Grande do Norte" (pt) [SPARQL]
Q7894378 ✅ Label: "Universidade Federal de Roraima" (pt)
Q7894380 ✅ Label: "Universidade Federal de Sergipe" (pt)
Q4481798 ✅ Label: "Fundação Universidade Federal do Tocantins" (pt)
Q5440476 ✅ Label: "Universidade Federal do Amazonas" (pt)
Q108221092 ✅ Label: "Instituto Histórico e Geográfico de Mato Grosso" (pt)
Q5440478 ✅ Label: "Universidade Federal de Mato Grosso do Sul" (pt)
Report Metadata
- Report generated: 2025-11-11
- Batch number: 12
- Dataset version: globalglam-20251111.yaml
- Schema version: LinkML v0.2.1
- Enrichment script: enrich_brazil_batch12.py
- Total institutions in dataset: 13,411
- Brazilian institutions: 121
- Enrichment author: AI Agent (OpenCode + Claude)
- Verification method: Wikidata authenticated API + SPARQL
✅ BATCH 12 COMPLETE - 55.4% COVERAGE ACHIEVED