glam/BATCH12_ENRICHMENT_REPORT.md
2025-11-19 23:25:22 +01:00

11 KiB

BATCH 12 ENRICHMENT REPORT - Brazilian Institutions

Date: 2025-11-11
Batch: 12
Status: COMPLETE
Success Rate: 100% (10/10)


Executive Summary

Successfully enriched 10 Brazilian institutions with verified Wikidata Q-numbers using authenticated Wikidata search API and SPARQL queries. This batch focused primarily on federal universities (8 institutions) and historical/research institutes (2 institutions).

Coverage Milestone Achieved 🎉

  • Previous coverage (Batch 11): 47.1% (57/121)
  • New coverage (Batch 12): 55.4% (67/121) ← Passed 50% threshold!
  • Institutions enriched: 10
  • Remaining to enrich: 54

Enriched Institutions

Federal Universities (8 institutions)

# Institution Q-number Wikidata Label State Confidence
1 UFPR Q1232831 Universidade Federal do Paraná Paraná 95%
2 UFPE Q2322256 Universidade Federal de Pernambuco Pernambuco 95%
3 UFPI Q945699 Universidade Federal do Piauí Piauí 95%
4 UFRN Q3847505 Universidade Federal do Rio Grande do Norte Rio Grande do Norte 95%
5 UFRR Q7894378 Universidade Federal de Roraima Roraima 95%
6 UFS Q7894380 Universidade Federal de Sergipe Sergipe 95%
7 UFT Q4481798 Fundação Universidade Federal do Tocantins Tocantins 95%
8 UFAM Q5440476 Universidade Federal do Amazonas Amazonas 95%

Historical & Research Institutes (2 institutions)

# Institution Q-number Wikidata Label State Confidence
9 Instituto Histórico Q108221092 Instituto Histórico e Geográfico de Mato Grosso Mato Grosso 93%
10 UFMS Repositories Q5440478 Universidade Federal de Mato Grosso do Sul Mato Grosso do Sul 95%

Methodology

Search Strategy

  1. Primary Search: Wikidata authenticated search API

    • Query format: Full Portuguese institution name
    • Fallback: English translation with abbreviation
  2. Verification: SPARQL queries when needed

    • Example: UFRN required SPARQL query to disambiguate from library entity
    • Query pattern: Universities (P31: Q3918) in Brazilian states (P17: Q155, P131: state)
  3. Metadata Validation: All Q-numbers verified via get_metadata() API

    • Confirmed Portuguese labels match expected institution names
    • Verified descriptions indicate correct institution type (university, not library/archive/museum)

Data Quality Issues Resolved

  • UFRR initial match: Q118133039 (Museu de Solos - Soil Museum)

    • Correct match: Q7894378 (Universidade Federal de Roraima)
  • UFS initial match: Q50811482 (Tomo - Academic journal)

    • Correct match: Q7894380 (Universidade Federal de Sergipe)
  • UFRN initial match: Q107617217 (No label/description found)

    • Correct match via SPARQL: Q3847505 (Universidade Federal do Rio Grande do Norte)

SPARQL Query Example

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q3918 .           # Instance of: university
  ?item wdt:P17 wd:Q155 .            # Country: Brazil
  ?item wdt:P131* wd:Q43255 .        # Located in: Rio Grande do Norte state
  SERVICE wikibase:label { bd:serviceParam wikibase:language "pt,en". }
}

Result: Q3847505 - Universidade Federal do Rio Grande do Norte


Enrichment Statistics

Success Metrics

  • Total attempts: 10
  • Successful enrichments: 10
  • Failed matches: 0
  • Success rate: 100%

Confidence Distribution

  • 95% confidence: 9 institutions (exact name matches)
  • 93% confidence: 1 institution (Instituto Histórico - partial name match)

Geographic Distribution (States Covered)

Region States Count
Northeast Pernambuco, Piauí, Rio Grande do Norte, Sergipe 4
North Amazonas, Roraima, Tocantins 3
South Paraná 1
Central-West Mato Grosso, Mato Grosso do Sul 2

Impact Analysis

Coverage Progress

Batch 11 (47.1%) ████████████████████░░░░░░░░░░░░░░░░░░
                  57/121 institutions

Batch 12 (55.4%) ██████████████████████░░░░░░░░░░░░░░░░
                  67/121 institutions (+10)
                  
Target (80%)     ████████████████████████████████░░░░░░
                  97/121 institutions (30 more needed)

Enrichment Velocity

  • Batch 11: 10 institutions (from 47.1% to 47.1% baseline)
  • Batch 12: 10 institutions (from 47.1% to 55.4%)
  • Increase: +8.3 percentage points
  • Average per batch: 10 institutions

Projection to 80% Coverage

  • Current: 67/121 (55.4%)
  • Target: 97/121 (80%)
  • Remaining: 30 institutions
  • Estimated batches needed: 3 batches (13-15)
  • Estimated completion: Mid-late November 2025

Technical Implementation

Files Modified

  • Main dataset: data/instances/all/globalglam-20251111.yaml

    • Added 10 Wikidata identifiers
    • Updated provenance metadata (enrichment_history, last_updated)
    • Created backup: globalglam-20251111.batch12_backup
  • Batch output: data/instances/brazil/batch12_enriched.yaml

    • Summary file with 10 enriched institutions
    • Includes Q-numbers, labels, confidence scores

Enrichment Script

  • File: enrich_brazil_batch12.py
  • Features:
    • Fuzzy name matching (exact and partial)
    • Empty name string bug fix (from Batch 11)
    • Provenance tracking with timestamps
    • Enrichment history entries
    • Automatic backup creation
    • Coverage statistics reporting

Provenance Metadata Example

provenance:
  enrichment_history:
    - enrichment_date: "2025-11-11T..."
      enrichment_type: WIKIDATA_IDENTIFIER
      enrichment_method: WIKIDATA_AUTHENTICATED_SEARCH
      match_score: 0.95
      verified: true
      enrichment_source: https://www.wikidata.org
      enrichment_notes: "Batch 12: Federal University of Paraná - exact match. Wikidata label: Universidade Federal do Paraná"
  last_updated: "2025-11-11T..."

Data Quality Assurance

Verification Checklist

  • All Q-numbers are real Wikidata entities (no synthetic identifiers)
  • All Q-numbers verified via get_metadata() API
  • Portuguese labels match expected institution names
  • Descriptions confirm correct institution types
  • All institutions verified as Brazilian (country: BR)
  • No duplicate Q-numbers across dataset
  • Confidence scores accurately reflect match quality

Name Matching Quality

Match Type Count Example
Exact abbreviation match 9 UFPR → UFPR
Partial name match 1 Instituto Histórico → Instituto Histórico

Challenges & Solutions

Challenge 1: False Positive Matches

Problem: Initial Wikidata searches returned incorrect entities:

  • UFRR matched to soil museum instead of university
  • UFS matched to academic journal instead of university

Solution:

  1. Implemented metadata verification step
  2. Re-searched with more specific queries (full Portuguese names)
  3. Verified descriptions confirm institution type

Challenge 2: Missing Wikidata Labels

Problem: UFRN initially matched Q107617217 with no label/description

Solution:

  1. Used SPARQL query to find universities in Rio Grande do Norte state
  2. Found correct entity Q3847505 with proper metadata
  3. Validated via Portuguese label and state location

Challenge 3: Abbreviation Ambiguity

Problem: Brazilian federal universities use standard abbreviations (UFX format) that may match multiple entities

Solution:

  1. Always verify state/location matches expected state
  2. Check description mentions "universidade federal" (federal university)
  3. Use SPARQL with geographic filters when needed

Lessons Learned

  1. Always verify metadata: Search API can return partial matches; metadata validation is essential
  2. SPARQL is powerful: When search fails, SPARQL with property filters (P31, P17, P131) yields accurate results
  3. Federal university pattern: Brazilian federal universities follow naming convention "Universidade Federal de [State]" - use full name for better matches
  4. Empty name bug fixed: Batch 11 fix (checking for non-empty names) prevented false positives in Batch 12

Next Steps (Batch 13)

Priority Candidates (54 remaining institutions)

High Priority (likely in Wikidata)

  1. Major state museums: Museu de [State] institutions
  2. State universities: UNESP, UNICAMP branches
  3. National libraries/archives: Biblioteca Nacional branches
  4. Federal heritage agencies: IPHAN regional offices

Medium Priority (may exist in Wikidata)

  1. Municipal museums with Wikipedia articles
  2. Historical societies (Sociedade Histórica)
  3. Religious archives with notable collections

Low Priority (unlikely in Wikidata)

  1. Small municipal archives
  2. Personal collections
  3. Recently established institutions
  4. Digital-only repositories

Focus on state museums and major cultural institutions:

  • Target: 10-12 institutions
  • Search strategy: "[Institution name] Brazil [State]"
  • Expected success rate: 70-80% (some may not exist in Wikidata)

Appendix: Q-number Verification Log

All Q-numbers verified on 2025-11-11:

Q1232831  ✅ Label: "Universidade Federal do Paraná" (pt)
Q2322256  ✅ Label: "Universidade Federal de Pernambuco" (pt)
Q945699   ✅ Label: "Universidade Federal do Piauí" (pt)
Q3847505  ✅ Label: "Universidade Federal do Rio Grande do Norte" (pt) [SPARQL]
Q7894378  ✅ Label: "Universidade Federal de Roraima" (pt)
Q7894380  ✅ Label: "Universidade Federal de Sergipe" (pt)
Q4481798  ✅ Label: "Fundação Universidade Federal do Tocantins" (pt)
Q5440476  ✅ Label: "Universidade Federal do Amazonas" (pt)
Q108221092 ✅ Label: "Instituto Histórico e Geográfico de Mato Grosso" (pt)
Q5440478  ✅ Label: "Universidade Federal de Mato Grosso do Sul" (pt)

Report Metadata

  • Report generated: 2025-11-11
  • Batch number: 12
  • Dataset version: globalglam-20251111.yaml
  • Schema version: LinkML v0.2.1
  • Enrichment script: enrich_brazil_batch12.py
  • Total institutions in dataset: 13,411
  • Brazilian institutions: 121
  • Enrichment author: AI Agent (OpenCode + Claude)
  • Verification method: Wikidata authenticated API + SPARQL

BATCH 12 COMPLETE - 55.4% COVERAGE ACHIEVED