glam/data/instances/README.md
2025-11-19 23:25:22 +01:00

261 lines
8.7 KiB
Markdown

# GLAM Instance Data - Authoritative Files
**Last Updated**: 2025-11-06
**Status**: Consolidated and Archived
## Authoritative Dataset
### Latin American GLAM Institutions (Brazil, Chile, Mexico)
**File**: `latin_american_institutions_AUTHORITATIVE.yaml`
- **Total Institutions**: 304
- Brazil: 97
- Chile: 90
- Mexico: 117
- **Data Tier**: TIER_4_INFERRED (conversation NLP extraction)
- **Enrichments Applied**:
- ✅ Wikidata IDs: 56 institutions (18.4%)
- ✅ VIAF IDs: 19 institutions (6.3%) - *API unavailable, IDs preserved*
- ✅ OpenStreetMap data: 83 institutions (27.3%)
- ✅ Geocoding: 187 institutions (61.5%)
- ✅ ISIL Gap Documentation: All 304 institutions
- **File Size**: 470 KB
- **Schema Version**: LinkML v0.2.0 (modular)
- **Last Enrichment**: 2025-11-06 (OpenStreetMap enrichment)
**Enrichment Details**:
| Enrichment Type | Count | Examples |
|----------------|-------|----------|
| Street addresses | 33 | "Avenida Feliciano Coelho 1502" |
| Contact info | 19 | Phone numbers, email addresses |
| Websites | 16 | Institutional URLs from OSM |
| Alternative names | 13 | Multilingual, official names |
| Opening hours | 10 | OSM opening_hours format |
**Use This File For**:
- Production data pipelines
- Export generation (JSON-LD, CSV, GeoJSON)
- Geographic visualization
- Cross-linking with other datasets
- Schema validation
- Research and analysis
## Archived Files
All superseded files have been archived to maintain data provenance and enable rollback if needed.
### Archive Location
`archive/2025-11-06_pre-consolidation/`
### Archive Structure
```
archive/2025-11-06_pre-consolidation/
├── intermediate_versions/ # Enrichment pipeline stages
│ ├── latin_american_institutions.yaml # Original combined (313 KB)
│ ├── latin_american_institutions_documented.yaml # + ISIL gap notes (444 KB)
│ ├── latin_american_institutions_enriched.yaml # + Wikidata (329 KB)
│ ├── latin_american_institutions_viaf_enriched.yaml # + VIAF IDs (446 KB)
│ └── latin_american_institutions_osm_enriched.yaml # + OSM data (470 KB) ← SOURCE OF AUTHORITATIVE
├── individual_countries/ # Pre-combination country files
│ ├── brazilian_institutions.yaml # 97 institutions (84 KB)
│ ├── chilean_institutions.yaml # 90 institutions (107 KB)
│ └── mexican_institutions.yaml # 117 institutions (122 KB)
├── backup_files/ # Temporary backup files
│ ├── mexican_institutions.yaml.bak
│ └── mexican_institutions.yaml.bak2
├── latin_american_combination_report.md # Country combination report
└── latin_american_validation_report.md # Validation report
```
### Enrichment Pipeline History
The authoritative file represents the final stage of a 5-phase enrichment pipeline:
1. **Phase 1: Wikidata Enrichment** (2025-11-06)
- Script: `scripts/enrich_from_wikidata.py`
- Result: 56 Wikidata IDs added
- Output: `latin_american_institutions_enriched.yaml`
2. **Phase 2: ISIL Gap Documentation** (2025-11-06)
- Script: `scripts/add_isil_gap_notes.py`
- Result: All 304 institutions documented
- Output: `latin_american_institutions_documented.yaml`
3. **Phase 3: National Library Outreach** (2025-11-06)
- Script: `scripts/draft_national_library_emails.py`
- Result: 3 bilingual emails drafted
- Documentation: `docs/national_library_outreach_emails.md`
4. **Phase 4: VIAF Enrichment** (2025-11-06) ❌ BLOCKED
- Script: `scripts/enrich_from_viaf.py`
- Status: VIAF XML/JSON API returns HTTP 404
- Result: 19 existing VIAF IDs preserved
- Output: `latin_american_institutions_viaf_enriched.yaml`
5. **Phase 5: OpenStreetMap Enrichment** (2025-11-06) ✅
- Scripts:
- `scripts/enrich_from_osm_batched.py`
- `scripts/resume_osm_enrichment.py`
- Result: 83 institutions enriched with OSM data
- Output: `latin_american_institutions_osm_enriched.yaml`**AUTHORITATIVE**
See `PROGRESS.md` for detailed enrichment statistics and `docs/osm_enrichment_report.md` for Phase 5 analysis.
## Export Files
All exports are generated from the authoritative file.
**Location**: `exports/`
**Generated Files**:
1. `latin_american_institutions_osm_enriched.jsonld` (576 KB) - Linked Data format
2. `latin_american_institutions_osm_enriched.csv` (113 KB) - Spreadsheet format
3. `latin_american_institutions_osm_enriched.geojson` (124 KB) - Geographic format (187 institutions)
4. `latin_american_osm_enriched_statistics.json` (0.9 KB) - Summary statistics
**Export Script**: `scripts/export_latin_american_datasets.py`
## Other Directories
### `brazil/`, `chile/`, `mexico/`
Individual country extraction workspaces. Superseded by consolidated file.
### `cache/`
Geocoding and API response caches. Used for performance optimization.
### `reports/`
Validation reports, quality checks, and analysis documents.
### `test_outputs/`
Development and testing outputs. Not for production use.
### `backups/`
Timestamped backup archives from previous processing stages:
- `2025-11-06_pre-geocoding.tar.gz`
- `2025-11-06_chilean-geocoded-v2.tar.gz`
- `2025-11-06_mexican-geocoded-final.tar.gz`
- etc.
## Data Quality Notes
### Known Limitations
1. **VIAF Enrichment Incomplete**
- VIAF XML/JSON API unavailable (HTTP 404)
- Only 19 VIAF IDs from original extractions
- See `PROGRESS.md` Phase 4 for details
2. **OSM Enrichment Partial**
- 186 institutions have OSM IDs (61.2%)
- Only 83 successfully enriched (44.6% enrichment rate)
- 34 fetch errors (504 gateway timeouts)
- Missing OSM tags for many heritage institutions
3. **ISIL Codes Missing**
- No public ISIL registries for BR/MX/CL
- National library outreach in progress
- Deadline: 2025-11-13
4. **Geocoding Coverage**
- 61.5% geocoded (187/304 institutions)
- 117 institutions lack coordinates
- Opportunities: Google Places API, manual verification
### Confidence Scores
All extractions include provenance metadata with confidence scores:
- **0.9-1.0**: Explicit mentions with authoritative sources
- **0.7-0.9**: Clear mentions with context
- **0.5-0.7**: Inferred from context
- **0.3-0.5**: Low confidence, needs verification
### Data Tiers
- **TIER_1_AUTHORITATIVE**: CSV registries (not applicable to Latin America)
- **TIER_2_VERIFIED**: Institutional websites (not yet applied)
- **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap (56 + 83 institutions)
- **TIER_4_INFERRED**: NLP extraction from conversations (all 304 institutions)
## Usage Guidelines
### Reading the Authoritative File
```python
import yaml
with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f:
institutions = yaml.safe_load(f)
print(f"Total institutions: {len(institutions)}")
```
### Validating Against Schema
```bash
linkml-validate -s schemas/heritage_custodian.yaml \
data/instances/latin_american_institutions_AUTHORITATIVE.yaml
```
### Generating Exports
```bash
python scripts/export_latin_american_datasets.py
```
### Filtering by Country
```python
import yaml
with open('latin_american_institutions_AUTHORITATIVE.yaml', 'r', encoding='utf-8') as f:
institutions = yaml.safe_load(f)
brazilian_institutions = [
inst for inst in institutions
if inst.get('locations') and
any(loc.get('country') == 'BR' for loc in inst['locations'])
]
```
## Rollback Instructions
If you need to revert to a previous version:
1. Identify the desired version in `archive/2025-11-06_pre-consolidation/intermediate_versions/`
2. Copy to instances directory:
```bash
cp archive/2025-11-06_pre-consolidation/intermediate_versions/latin_american_institutions_enriched.yaml \
latin_american_institutions_AUTHORITATIVE.yaml
```
3. Regenerate exports if needed
## Next Steps
### Immediate (By 2025-11-13)
1. **National Library Outreach**: Submit 3 email drafts for ISIL codes
2. **Data Quality Review**: Verify fuzzy Wikidata matches (37 < 95% confidence)
3. **Geographic Visualization**: Create interactive map from GeoJSON
### Future Enhancements
1. **Web Scraping**: Crawl institutional websites (126 URLs available)
2. **Google Places API**: Enrich 117 non-geocoded institutions
3. **OSM Contribution**: Add missing heritage institutions to OpenStreetMap
4. **Schema Validation**: Run linkml-validate on all 304 records
5. **Relationship Extraction**: Map institutional partnerships and networks
## Contact
**Project**: GLAM Data Extraction
**Schema**: LinkML v0.2.0 (modular)
**Documentation**: `/docs/plan/global_glam/`
**Issues**: See `PROGRESS.md` for known issues and blockers
---
**Archive Date**: 2025-11-06
**Archival Reason**: Consolidation to single authoritative file
**Archived Files**: 12 YAML files, 2 MD reports
**Archive Size**: ~2.5 MB total