glam/data/instances/brazil/brazilian_curation_report_final.md
2025-11-19 23:25:22 +01:00

4.9 KiB

Brazilian GLAM Institution Curation Report - FINAL

Generated: 2025-11-06T08:26:33.367488+00:00

Executive Summary

Successfully curated 97 valid heritage institutions from Brazilian GLAM conversation data.

  • Original v2 records: 104
  • Filtered invalid records: 7 (platforms, non-institutions)
  • Valid curated institutions: 97
  • Data quality: Tier 4 (inferred from conversation NLP)

Quality Achievements

Completeness Metrics

Field Count Percentage Target Status
Descriptions 82 84.5% 90%+ ✓ Near
Website Identifiers 9 9.3% 80%+ ✗ Limited source data
City-level Locations 8 8.2% 60%+ ✗ Sparse in source
Change History (founding dates) 7 7.2% - Baseline

Data Source Analysis

The conversation JSON contained limited structured institutional metadata:

  • Descriptions: ~91 institutions (91% coverage possible)
  • URLs: ~9 institutions (9% coverage possible)
  • City names: ~13 cities mentioned across all states
  • Founding dates: ~7 institutions with explicit years

Conclusion: Enrichment achieved near-maximum extraction from available source data.

Records Filtered Out (Non-Institutions)

The following 7 records were removed as they represent platforms/technologies or invalid data:

  1. Tainacan - Collection management platform (WordPress-based)
  2. AtoM - Archival description software
  3. DSpace - Digital repository platform
  4. APIs - Generic technology reference
  5. LOCKSS Cariniana - Digital preservation network
  6. Population - Demographic statistic (Roraima indigenous population)
  7. Documentation - Too generic, not a specific organization

Valid Institutions Retained

97 heritage custodian organizations across all 27 Brazilian states, representing:

  • Museums (MUSEUM, MIXED): Cultural, historical, natural history, specialized
  • Libraries (LIBRARY): National, university, specialized
  • Archives (ARCHIVE): State, municipal, institutional
  • Research Centers (RESEARCH_CENTER): Archaeological, documentary, heritage
  • Educational Providers (EDUCATION_PROVIDER): University repositories
  • Official Institutions (OFFICIAL_INSTITUTION): State cultural foundations, heritage agencies

Geographic Coverage

  • All 27 federative units represented
  • State-level location data: 100% (all records)
  • City-level location data: 8 institutions (8.2%)
    • Cities identified: Maceió, Brasília, São Luís, Ouro Preto, Campina Grande, Teresina, Natal, Aracaju

Enrichment Methods Applied

  1. Automated parsing: Structured data extraction from conversation artifact
  2. Fuzzy name matching: Institution names matched to conversation metadata
  3. Pattern recognition: URLs, collection extents, founding dates
  4. Known entity matching: Brazilian city names from curated list
  5. Provenance tracking: All records tagged with data source, confidence, extraction method

Known Limitations

  1. Sparse URL data: Only 9% of institutions had website URLs in source conversation
  2. Limited geographic detail: Most institutions organized by state, not city
  3. Unverified data: Tier 4 (inferred) - requires validation against authoritative sources
  4. Missing digital platform details: Conversation focused on state-level infrastructure, not institution-specific systems

Recommendations

Immediate Actions

  1. Validate against IBRAM registry: Cross-reference with official Brazilian museum database
  2. Geocode institutions: Use state + institution name to lookup city locations via Nominatim
  3. Web scraping: Extract additional metadata from the 9 known website URLs

Future Enrichment

  1. Tier 2 data sources: Crawl institutional websites for collection details
  2. Tier 3 data sources: Integrate Wikidata Q-IDs and VIAF identifiers
  3. Platform identification: Map institutions to digital systems (Tainacan, DSpace, AtoM instances)
  4. Collection metadata: Extract subject areas, temporal coverage, access rights

Manual Review Needed

2 records flagged for verification:

  • Brasiliana Museus: Classify as national aggregation platform vs. custodian
  • Hemeroteca Digital: Determine if custodian or aggregation service

Files Generated

  • Curated records: data/instances/brazilian_institutions_curated_v2.yaml (97 institutions)
  • This report: data/instances/brazilian_curation_report_final.md

Provenance

  • Source conversation: 2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5-Brazilian_GLAM_collection_inventories.json
  • Extraction method: Automated parsing with pattern recognition
  • Curation script: curate_brazilian_institutions.py v2.1
  • Schema version: LinkML heritage_custodian v0.2.0 (modular)
  • Data tier: TIER_4_INFERRED (conversation NLP)

Curator: Automated curation system
Date: 2025-11-06
Status: ✓ Baseline curation complete - ready for Tier 2/3 enrichment