# GLAM Instance Data - Authoritative Files **Last Updated**: 2025-11-06 **Status**: Consolidated and Archived ## Authoritative Dataset ### Latin American GLAM Institutions (Brazil, Chile, Mexico) **File**: `latin_american_institutions_AUTHORITATIVE.yaml` - **Total Institutions**: 304 - Brazil: 97 - Chile: 90 - Mexico: 117 - **Data Tier**: TIER_4_INFERRED (conversation NLP extraction) - **Enrichments Applied**: - ✅ Wikidata IDs: 56 institutions (18.4%) - ✅ VIAF IDs: 19 institutions (6.3%) - *API unavailable, IDs preserved* - ✅ OpenStreetMap data: 83 institutions (27.3%) - ✅ Geocoding: 187 institutions (61.5%) - ✅ ISIL Gap Documentation: All 304 institutions - **File Size**: 470 KB - **Schema Version**: LinkML v0.2.0 (modular) - **Last Enrichment**: 2025-11-06 (OpenStreetMap enrichment) **Enrichment Details**: | Enrichment Type | Count | Examples | |----------------|-------|----------| | Street addresses | 33 | "Avenida Feliciano Coelho 1502" | | Contact info | 19 | Phone numbers, email addresses | | Websites | 16 | Institutional URLs from OSM | | Alternative names | 13 | Multilingual, official names | | Opening hours | 10 | OSM opening_hours format | **Use This File For**: - Production data pipelines - Export generation (JSON-LD, CSV, GeoJSON) - Geographic visualization - Cross-linking with other datasets - Schema validation - Research and analysis ## Archived Files All superseded files have been archived to maintain data provenance and enable rollback if needed. ### Archive Location `archive/2025-11-06_pre-consolidation/` ### Archive Structure ``` archive/2025-11-06_pre-consolidation/ ├── intermediate_versions/ # Enrichment pipeline stages │ ├── latin_american_institutions.yaml # Original combined (313 KB) │ ├── latin_american_institutions_documented.yaml # + ISIL gap notes (444 KB) │ ├── latin_american_institutions_enriched.yaml # + Wikidata (329 KB) │ ├── latin_american_institutions_viaf_enriched.yaml # + VIAF IDs (446 KB) │ └── latin_american_institutions_osm_enriched.yaml # + OSM data (470 KB) ← SOURCE OF AUTHORITATIVE ├── individual_countries/ # Pre-combination country files │ ├── brazilian_institutions.yaml # 97 institutions (84 KB) │ ├── chilean_institutions.yaml # 90 institutions (107 KB) │ └── mexican_institutions.yaml # 117 institutions (122 KB) ├── backup_files/ # Temporary backup files │ ├── mexican_institutions.yaml.bak │ └── mexican_institutions.yaml.bak2 ├── latin_american_combination_report.md # Country combination report └── latin_american_validation_report.md # Validation report ``` ### Enrichment Pipeline History The authoritative file represents the final stage of a 5-phase enrichment pipeline: 1. **Phase 1: Wikidata Enrichment** (2025-11-06) - Script: `scripts/enrich_from_wikidata.py` - Result: 56 Wikidata IDs added - Output: `latin_american_institutions_enriched.yaml` 2. **Phase 2: ISIL Gap Documentation** (2025-11-06) - Script: `scripts/add_isil_gap_notes.py` - Result: All 304 institutions documented - Output: `latin_american_institutions_documented.yaml` 3. **Phase 3: National Library Outreach** (2025-11-06) - Script: `scripts/draft_national_library_emails.py` - Result: 3 bilingual emails drafted - Documentation: `docs/national_library_outreach_emails.md` 4. **Phase 4: VIAF Enrichment** (2025-11-06) ❌ BLOCKED - Script: `scripts/enrich_from_viaf.py` - Status: VIAF XML/JSON API returns HTTP 404 - Result: 19 existing VIAF IDs preserved - Output: `latin_american_institutions_viaf_enriched.yaml` 5. **Phase 5: OpenStreetMap Enrichment** (2025-11-06) ✅ - Scripts: - `scripts/enrich_from_osm_batched.py` - `scripts/resume_osm_enrichment.py` - Result: 83 institutions enriched with OSM data - Output: `latin_american_institutions_osm_enriched.yaml` → **AUTHORITATIVE** See `PROGRESS.md` for detailed enrichment statistics and `docs/osm_enrichment_report.md` for Phase 5 analysis. ## Export Files All exports are generated from the authoritative file. **Location**: `exports/` **Generated Files**: 1. `latin_american_institutions_osm_enriched.jsonld` (576 KB) - Linked Data format 2. `latin_american_institutions_osm_enriched.csv` (113 KB) - Spreadsheet format 3. `latin_american_institutions_osm_enriched.geojson` (124 KB) - Geographic format (187 institutions) 4. `latin_american_osm_enriched_statistics.json` (0.9 KB) - Summary statistics **Export Script**: `scripts/export_latin_american_datasets.py` ## Other Directories ### `brazil/`, `chile/`, `mexico/` Individual country extraction workspaces. Superseded by consolidated file. ### `cache/` Geocoding and API response caches. Used for performance optimization. ### `reports/` Validation reports, quality checks, and analysis documents. ### `test_outputs/` Development and testing outputs. Not for production use. ### `backups/` Timestamped backup archives from previous processing stages: - `2025-11-06_pre-geocoding.tar.gz` - `2025-11-06_chilean-geocoded-v2.tar.gz` - `2025-11-06_mexican-geocoded-final.tar.gz` - etc. ## Data Quality Notes ### Known Limitations 1. **VIAF Enrichment Incomplete** - VIAF XML/JSON API unavailable (HTTP 404) - Only 19 VIAF IDs from original extractions - See `PROGRESS.md` Phase 4 for details 2. **OSM Enrichment Partial** - 186 institutions have OSM IDs (61.2%) - Only 83 successfully enriched (44.6% enrichment rate) - 34 fetch errors (504 gateway timeouts) - Missing OSM tags for many heritage institutions 3. **ISIL Codes Missing** - No public ISIL registries for BR/MX/CL - National library outreach in progress - Deadline: 2025-11-13 4. **Geocoding Coverage** - 61.5% geocoded (187/304 institutions) - 117 institutions lack coordinates - Opportunities: Google Places API, manual verification ### Confidence Scores All extractions include provenance metadata with confidence scores: - **0.9-1.0**: Explicit mentions with authoritative sources - **0.7-0.9**: Clear mentions with context - **0.5-0.7**: Inferred from context - **0.3-0.5**: Low confidence, needs verification ### Data Tiers - **TIER_1_AUTHORITATIVE**: CSV registries (not applicable to Latin America) - **TIER_2_VERIFIED**: Institutional websites (not yet applied) - **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap (56 + 83 institutions) - **TIER_4_INFERRED**: NLP extraction from conversations (all 304 institutions) ## Usage Guidelines ### Reading the Authoritative File ```python import yaml with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f: institutions = yaml.safe_load(f) print(f"Total institutions: {len(institutions)}") ``` ### Validating Against Schema ```bash linkml-validate -s schemas/heritage_custodian.yaml \ data/instances/latin_american_institutions_AUTHORITATIVE.yaml ``` ### Generating Exports ```bash python scripts/export_latin_american_datasets.py ``` ### Filtering by Country ```python import yaml with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f: institutions = yaml.safe_load(f) brazilian_institutions = [ inst for inst in institutions if inst.get('locations') and any(loc.get('country') == 'BR' for loc in inst['locations']) ] ``` ## Rollback Instructions If you need to revert to a previous version: 1. Identify the desired version in `archive/2025-11-06_pre-consolidation/intermediate_versions/` 2. Copy to instances directory: ```bash cp archive/2025-11-06_pre-consolidation/intermediate_versions/latin_american_institutions_enriched.yaml \ latin_american_institutions_AUTHORITATIVE.yaml ``` 3. Regenerate exports if needed ## Next Steps ### Immediate (By 2025-11-13) 1. **National Library Outreach**: Submit 3 email drafts for ISIL codes 2. **Data Quality Review**: Verify fuzzy Wikidata matches (37 < 95% confidence) 3. **Geographic Visualization**: Create interactive map from GeoJSON ### Future Enhancements 1. **Web Scraping**: Crawl institutional websites (126 URLs available) 2. **Google Places API**: Enrich 117 non-geocoded institutions 3. **OSM Contribution**: Add missing heritage institutions to OpenStreetMap 4. **Schema Validation**: Run linkml-validate on all 304 records 5. **Relationship Extraction**: Map institutional partnerships and networks ## Contact **Project**: GLAM Data Extraction **Schema**: LinkML v0.2.0 (modular) **Documentation**: `/docs/plan/global_glam/` **Issues**: See `PROGRESS.md` for known issues and blockers --- **Archive Date**: 2025-11-06 **Archival Reason**: Consolidation to single authoritative file **Archived Files**: 12 YAML files, 2 MD reports **Archive Size**: ~2.5 MB total