261 lines
8.7 KiB
Markdown
261 lines
8.7 KiB
Markdown
# GLAM Instance Data - Authoritative Files
|
|
|
|
**Last Updated**: 2025-11-06
|
|
**Status**: Consolidated and Archived
|
|
|
|
## Authoritative Dataset
|
|
|
|
### Latin American GLAM Institutions (Brazil, Chile, Mexico)
|
|
|
|
**File**: `latin_american_institutions_AUTHORITATIVE.yaml`
|
|
|
|
- **Total Institutions**: 304
|
|
- Brazil: 97
|
|
- Chile: 90
|
|
- Mexico: 117
|
|
- **Data Tier**: TIER_4_INFERRED (conversation NLP extraction)
|
|
- **Enrichments Applied**:
|
|
- ✅ Wikidata IDs: 56 institutions (18.4%)
|
|
- ✅ VIAF IDs: 19 institutions (6.3%) - *API unavailable, IDs preserved*
|
|
- ✅ OpenStreetMap data: 83 institutions (27.3%)
|
|
- ✅ Geocoding: 187 institutions (61.5%)
|
|
- ✅ ISIL Gap Documentation: All 304 institutions
|
|
- **File Size**: 470 KB
|
|
- **Schema Version**: LinkML v0.2.0 (modular)
|
|
- **Last Enrichment**: 2025-11-06 (OpenStreetMap enrichment)
|
|
|
|
**Enrichment Details**:
|
|
| Enrichment Type | Count | Examples |
|
|
|----------------|-------|----------|
|
|
| Street addresses | 33 | "Avenida Feliciano Coelho 1502" |
|
|
| Contact info | 19 | Phone numbers, email addresses |
|
|
| Websites | 16 | Institutional URLs from OSM |
|
|
| Alternative names | 13 | Multilingual, official names |
|
|
| Opening hours | 10 | OSM opening_hours format |
|
|
|
|
**Use This File For**:
|
|
- Production data pipelines
|
|
- Export generation (JSON-LD, CSV, GeoJSON)
|
|
- Geographic visualization
|
|
- Cross-linking with other datasets
|
|
- Schema validation
|
|
- Research and analysis
|
|
|
|
## Archived Files
|
|
|
|
All superseded files have been archived to maintain data provenance and enable rollback if needed.
|
|
|
|
### Archive Location
|
|
|
|
`archive/2025-11-06_pre-consolidation/`
|
|
|
|
### Archive Structure
|
|
|
|
```
|
|
archive/2025-11-06_pre-consolidation/
|
|
├── intermediate_versions/ # Enrichment pipeline stages
|
|
│ ├── latin_american_institutions.yaml # Original combined (313 KB)
|
|
│ ├── latin_american_institutions_documented.yaml # + ISIL gap notes (444 KB)
|
|
│ ├── latin_american_institutions_enriched.yaml # + Wikidata (329 KB)
|
|
│ ├── latin_american_institutions_viaf_enriched.yaml # + VIAF IDs (446 KB)
|
|
│ └── latin_american_institutions_osm_enriched.yaml # + OSM data (470 KB) ← SOURCE OF AUTHORITATIVE
|
|
├── individual_countries/ # Pre-combination country files
|
|
│ ├── brazilian_institutions.yaml # 97 institutions (84 KB)
|
|
│ ├── chilean_institutions.yaml # 90 institutions (107 KB)
|
|
│ └── mexican_institutions.yaml # 117 institutions (122 KB)
|
|
├── backup_files/ # Temporary backup files
|
|
│ ├── mexican_institutions.yaml.bak
|
|
│ └── mexican_institutions.yaml.bak2
|
|
├── latin_american_combination_report.md # Country combination report
|
|
└── latin_american_validation_report.md # Validation report
|
|
```
|
|
|
|
### Enrichment Pipeline History
|
|
|
|
The authoritative file represents the final stage of a 5-phase enrichment pipeline:
|
|
|
|
1. **Phase 1: Wikidata Enrichment** (2025-11-06)
|
|
- Script: `scripts/enrich_from_wikidata.py`
|
|
- Result: 56 Wikidata IDs added
|
|
- Output: `latin_american_institutions_enriched.yaml`
|
|
|
|
2. **Phase 2: ISIL Gap Documentation** (2025-11-06)
|
|
- Script: `scripts/add_isil_gap_notes.py`
|
|
- Result: All 304 institutions documented
|
|
- Output: `latin_american_institutions_documented.yaml`
|
|
|
|
3. **Phase 3: National Library Outreach** (2025-11-06)
|
|
- Script: `scripts/draft_national_library_emails.py`
|
|
- Result: 3 bilingual emails drafted
|
|
- Documentation: `docs/national_library_outreach_emails.md`
|
|
|
|
4. **Phase 4: VIAF Enrichment** (2025-11-06) ❌ BLOCKED
|
|
- Script: `scripts/enrich_from_viaf.py`
|
|
- Status: VIAF XML/JSON API returns HTTP 404
|
|
- Result: 19 existing VIAF IDs preserved
|
|
- Output: `latin_american_institutions_viaf_enriched.yaml`
|
|
|
|
5. **Phase 5: OpenStreetMap Enrichment** (2025-11-06) ✅
|
|
- Scripts:
|
|
- `scripts/enrich_from_osm_batched.py`
|
|
- `scripts/resume_osm_enrichment.py`
|
|
- Result: 83 institutions enriched with OSM data
|
|
- Output: `latin_american_institutions_osm_enriched.yaml` → **AUTHORITATIVE**
|
|
|
|
See `PROGRESS.md` for detailed enrichment statistics and `docs/osm_enrichment_report.md` for Phase 5 analysis.
|
|
|
|
## Export Files
|
|
|
|
All exports are generated from the authoritative file.
|
|
|
|
**Location**: `exports/`
|
|
|
|
**Generated Files**:
|
|
1. `latin_american_institutions_osm_enriched.jsonld` (576 KB) - Linked Data format
|
|
2. `latin_american_institutions_osm_enriched.csv` (113 KB) - Spreadsheet format
|
|
3. `latin_american_institutions_osm_enriched.geojson` (124 KB) - Geographic format (187 institutions)
|
|
4. `latin_american_osm_enriched_statistics.json` (0.9 KB) - Summary statistics
|
|
|
|
**Export Script**: `scripts/export_latin_american_datasets.py`
|
|
|
|
## Other Directories
|
|
|
|
### `brazil/`, `chile/`, `mexico/`
|
|
Individual country extraction workspaces. Superseded by consolidated file.
|
|
|
|
### `cache/`
|
|
Geocoding and API response caches. Used for performance optimization.
|
|
|
|
### `reports/`
|
|
Validation reports, quality checks, and analysis documents.
|
|
|
|
### `test_outputs/`
|
|
Development and testing outputs. Not for production use.
|
|
|
|
### `backups/`
|
|
Timestamped backup archives from previous processing stages:
|
|
- `2025-11-06_pre-geocoding.tar.gz`
|
|
- `2025-11-06_chilean-geocoded-v2.tar.gz`
|
|
- `2025-11-06_mexican-geocoded-final.tar.gz`
|
|
- etc.
|
|
|
|
## Data Quality Notes
|
|
|
|
### Known Limitations
|
|
|
|
1. **VIAF Enrichment Incomplete**
|
|
- VIAF XML/JSON API unavailable (HTTP 404)
|
|
- Only 19 VIAF IDs from original extractions
|
|
- See `PROGRESS.md` Phase 4 for details
|
|
|
|
2. **OSM Enrichment Partial**
|
|
- 186 institutions have OSM IDs (61.2%)
|
|
- Only 83 successfully enriched (44.6% enrichment rate)
|
|
- 34 fetch errors (504 gateway timeouts)
|
|
- Missing OSM tags for many heritage institutions
|
|
|
|
3. **ISIL Codes Missing**
|
|
- No public ISIL registries for BR/MX/CL
|
|
- National library outreach in progress
|
|
- Deadline: 2025-11-13
|
|
|
|
4. **Geocoding Coverage**
|
|
- 61.5% geocoded (187/304 institutions)
|
|
- 117 institutions lack coordinates
|
|
- Opportunities: Google Places API, manual verification
|
|
|
|
### Confidence Scores
|
|
|
|
All extractions include provenance metadata with confidence scores:
|
|
- **0.9-1.0**: Explicit mentions with authoritative sources
|
|
- **0.7-0.9**: Clear mentions with context
|
|
- **0.5-0.7**: Inferred from context
|
|
- **0.3-0.5**: Low confidence, needs verification
|
|
|
|
### Data Tiers
|
|
|
|
- **TIER_1_AUTHORITATIVE**: CSV registries (not applicable to Latin America)
|
|
- **TIER_2_VERIFIED**: Institutional websites (not yet applied)
|
|
- **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap (56 + 83 institutions)
|
|
- **TIER_4_INFERRED**: NLP extraction from conversations (all 304 institutions)
|
|
|
|
## Usage Guidelines
|
|
|
|
### Reading the Authoritative File
|
|
|
|
```python
|
|
import yaml
|
|
|
|
with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f:
|
|
institutions = yaml.safe_load(f)
|
|
|
|
print(f"Total institutions: {len(institutions)}")
|
|
```
|
|
|
|
### Validating Against Schema
|
|
|
|
```bash
|
|
linkml-validate -s schemas/heritage_custodian.yaml \
|
|
data/instances/latin_american_institutions_AUTHORITATIVE.yaml
|
|
```
|
|
|
|
### Generating Exports
|
|
|
|
```bash
|
|
python scripts/export_latin_american_datasets.py
|
|
```
|
|
|
|
### Filtering by Country
|
|
|
|
```python
|
|
import yaml
|
|
|
|
with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f:
|
|
institutions = yaml.safe_load(f)
|
|
|
|
brazilian_institutions = [
|
|
inst for inst in institutions
|
|
if inst.get('locations') and
|
|
any(loc.get('country') == 'BR' for loc in inst['locations'])
|
|
]
|
|
```
|
|
|
|
## Rollback Instructions
|
|
|
|
If you need to revert to a previous version:
|
|
|
|
1. Identify the desired version in `archive/2025-11-06_pre-consolidation/intermediate_versions/`
|
|
2. Copy to instances directory:
|
|
```bash
|
|
cp archive/2025-11-06_pre-consolidation/intermediate_versions/latin_american_institutions_enriched.yaml \
|
|
latin_american_institutions_AUTHORITATIVE.yaml
|
|
```
|
|
3. Regenerate exports if needed
|
|
|
|
## Next Steps
|
|
|
|
### Immediate (By 2025-11-13)
|
|
1. **National Library Outreach**: Submit 3 email drafts for ISIL codes
|
|
2. **Data Quality Review**: Verify fuzzy Wikidata matches (37 < 95% confidence)
|
|
3. **Geographic Visualization**: Create interactive map from GeoJSON
|
|
|
|
### Future Enhancements
|
|
1. **Web Scraping**: Crawl institutional websites (126 URLs available)
|
|
2. **Google Places API**: Enrich 117 non-geocoded institutions
|
|
3. **OSM Contribution**: Add missing heritage institutions to OpenStreetMap
|
|
4. **Schema Validation**: Run linkml-validate on all 304 records
|
|
5. **Relationship Extraction**: Map institutional partnerships and networks
|
|
|
|
## Contact
|
|
|
|
**Project**: GLAM Data Extraction
|
|
**Schema**: LinkML v0.2.0 (modular)
|
|
**Documentation**: `/docs/plan/global_glam/`
|
|
**Issues**: See `PROGRESS.md` for known issues and blockers
|
|
|
|
---
|
|
|
|
**Archive Date**: 2025-11-06
|
|
**Archival Reason**: Consolidation to single authoritative file
|
|
**Archived Files**: 12 YAML files, 2 MD reports
|
|
**Archive Size**: ~2.5 MB total
|