glam/data/instances/all/README.md
2025-11-19 23:25:22 +01:00

300 lines
9.9 KiB
Markdown

# GLAM Data Extraction - Quick Reference Index
**Last Updated**: November 11, 2025
---
## 📂 Directory: data/instances/all/
This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project.
### Files in this Directory
| File | Description | Size | Use Case |
|------|-------------|------|----------|
| **globalglam-20251111.yaml** | Master dataset (PRIMARY) | 24MB | All analysis and exports |
| **UNIFIED_OVERVIEW.md** | Complete project documentation | 22KB | Read for comprehensive understanding |
| **FILE_STATUS.md** | File authority reference | 11KB | Determine which files to use |
| **ENRICHMENT_PROGRESS.md** | Wikidata enrichment tracking | 11KB | Track batch-by-batch progress |
| **ENRICHMENT_CANDIDATES.yaml** | Institutions needing enrichment | 2.8MB | Enrichment planning |
| **DATASET_STATISTICS.yaml** | Machine-readable metrics | 3.3KB | Parse for programmatic analysis |
| **README.md** | This file (quick reference) | 5KB | Start here for navigation |
---
## 🚀 Quick Start
### For Humans: Read the Overview
Start with **UNIFIED_OVERVIEW.md** for:
- Executive summary of 13,502 institutions across 18 countries
- Schema architecture (LinkML v0.2.1)
- Geographic coverage by region
- Data quality framework
- Technical implementation details
- Known issues and solutions
### For File Navigation: Check File Status
Read **FILE_STATUS.md** for:
- Which files are authoritative (use these)
- Which files are archived (don't use)
- Relationship between master dataset and enrichment files
- Merge workflow documentation
- Quick reference commands
### For Tracking Progress: Check Enrichment Status
Read **ENRICHMENT_PROGRESS.md** for:
- Current Wikidata coverage by country
- Completed batches (Brazil, Chile)
- Next steps and timelines
- Batch execution workflow
- Quality metrics and accuracy rates
### For Programmatic Access: Parse Statistics
Use **DATASET_STATISTICS.yaml** for:
- Institution counts by country
- Wikidata/geocoding coverage percentages
- Institution type distributions
- Data tier classifications
- Regional totals
---
## 📊 Key Statistics (as of Nov 11, 2025)
```
Global Dataset: 13,502 institutions
Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others)
Wikidata Enriched: 7,520 (55.7%)
Geocoded: 8,178 (60.6%)
Master File: globalglam-20251111.yaml (24 MB)
Top Countries:
Japan: 12,065 institutions (89.4% of dataset)
Netherlands: 622 institutions
Tunisia: 69 institutions
Mexico: 57 institutions
Chile: 56 institutions
```
---
## 🎯 Current Focus: Enrichment and Merge Workflow
### Tunisia (Enrichment Complete, Pending Merge)
- **Status**: Enhanced file ready (252 KB)
- **Location**: `data/instances/tunisia/tunisian_institutions_enhanced.yaml`
- **Action**: Need to merge enriched data back into master dataset
### Georgia (Enrichment Complete, Pending Merge)
- **Status**: Batch 3 complete (22 KB)
- **Location**: `data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml`
- **Action**: Need to merge enriched data back into master dataset
### Latin America (Active Enrichment)
- **Chile**: Batch 20 enriched (53.9% Wikidata coverage)
- **Brazil**: Batch 6 enriched (13.7% Wikidata coverage)
- **Mexico**: Geocoding complete, enrichment starting
---
## 📁 Project File Structure
```
data/instances/
├── all/ ← YOU ARE HERE
│ ├── globalglam-20251111.yaml ← Master dataset (PRIMARY - 13,502 institutions)
│ ├── UNIFIED_OVERVIEW.md ← Complete documentation
│ ├── FILE_STATUS.md ← File authority reference
│ ├── ENRICHMENT_PROGRESS.md ← Wikidata enrichment tracking
│ ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment
│ ├── DATASET_STATISTICS.yaml ← Machine-readable metrics
│ └── README.md ← This file
├── brazil/
│ └── brazilian_institutions_batch6_enriched.yaml (current)
├── chile/
│ └── chilean_institutions_batch20_enriched.yaml (current)
├── mexico/
│ └── mexican_institutions_geocoded.yaml (current)
├── tunisia/
│ └── tunisian_institutions_enhanced.yaml (enriched, pending merge)
├── georgia/
│ └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge)
├── japan/
│ └── jp_institutions_resolved.yaml (12,065 institutions)
├── global/
│ └── global_heritage_institutions_merged.yaml (older version)
└── exports/
├── *.jsonld (JSON-LD)
├── *.csv (CSV)
└── *.geojson (GeoJSON)
```
---
## 🔗 Related Documentation
### Core Documentation
- **FILE_STATUS.md**: Which files are authoritative (START HERE for file navigation)
- **UNIFIED_OVERVIEW.md**: Complete project documentation
- **ENRICHMENT_PROGRESS.md**: Enrichment tracking
- **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md`
- **Project Progress**: `/Users/kempersc/apps/glam/PROGRESS.md`
- **Schema Modules**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
### Technical Specifications
- **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
- **UUID Strategy**: `/Users/kempersc/apps/glam/docs/UUID_STRATEGY.md`
- **GHCID Collisions**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md`
### Schema Files
- **Main Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml`
- **Core Classes**: `/Users/kempersc/apps/glam/schemas/core.yaml`
- **Enumerations**: `/Users/kempersc/apps/glam/schemas/enums.yaml`
- **Provenance**: `/Users/kempersc/apps/glam/schemas/provenance.yaml`
---
## 🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT)
19 types with single-letter GHCID codes:
| Code | Type | Example |
|------|------|---------|
| G | GALLERY | Commercial galleries |
| L | LIBRARY | National libraries |
| A | ARCHIVE | National archives |
| M | MUSEUM | Art/history museums |
| O | OFFICIAL_INSTITUTION | Government agencies |
| R | RESEARCH_CENTER | Research institutes |
| C | CORPORATION | Corporate archives |
| U | UNKNOWN | Type undetermined |
| B | BOTANICAL_ZOO | Botanical gardens, zoos |
| E | EDUCATION_PROVIDER | Universities, schools |
| S | COLLECTING_SOCIETY | Historical societies |
| F | FEATURES | Physical landmarks (monuments, statues, memorials) |
| I | INTANGIBLE_HERITAGE_GROUP | Traditional performance groups, oral history |
| X | MIXED | Multiple types |
| P | PERSONAL_COLLECTION | Private collections |
| H | HOLY_SITES | Religious heritage sites |
| D | DIGITAL_PLATFORM | Online archives, digital libraries |
| N | NGO | Heritage advocacy organizations |
| T | TASTE_SMELL | Historic restaurants, parfumeries, distilleries |
**Mnemonic**: **GLAMORCUBESFIXPHDNT**
---
## 📈 Coverage Goals
### Latin America (Priority)
| Country | Current | Goal | Status |
|---------|---------|------|--------|
| Brazil | 6.1% | 17.4% | 🟡 Batch 6 complete |
| Chile | 6.7% | 22.2% | 🟢 Batch 2 complete |
| Mexico | 0.0% | 17.1% | 🔴 Not started |
**Regional Goal**: 60+ institutions (19% coverage)
### Global
- **Short-term**: 8,000 institutions (31% coverage)
- **Long-term**: 13,000 institutions (50% coverage)
---
## 💡 Quick Tips
### For Reading
1. Start with **FILE_STATUS.md** to understand file relationships
2. Read **UNIFIED_OVERVIEW.md** for big picture
3. Consult **ENRICHMENT_PROGRESS.md** for detailed enrichment status
4. Reference **DATASET_STATISTICS.yaml** for metrics
### For Analysis
```python
import yaml
# Load master dataset
with open('data/instances/all/globalglam-20251111.yaml', 'r') as f:
institutions = yaml.safe_load(f)
print(f"Total institutions: {len(institutions)}")
# Load statistics
with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f:
stats = yaml.safe_load(f)
# Access coverage by country
japan_coverage = stats['datasets']['japan']['wikidata_coverage']
print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}")
```
### For Navigation
- **Country datasets**: `data/instances/{country}/`
- **Global merged**: `data/instances/global/`
- **Exports**: `data/instances/exports/`
- **Scripts**: `scripts/enrich_*.py`, `scripts/geocode_*.py`
---
## 🏆 Recent Achievements
### November 11, 2025
-**Master dataset unified**: 13,502 institutions from 18 countries
-**Major deduplication**: 48.0% duplicate rate (12,461 duplicates removed)
-**File status documented**: Created FILE_STATUS.md for clarity
-**Documentation updated**: All metrics reflect actual merged dataset
### November 10-11, 2025
-**Tunisia enrichment**: Enhanced file with Wikidata data (pending merge)
-**Georgia enrichment**: Batch 3 complete (pending merge)
### November 6-9, 2025
-**Chilean enrichment**: Batch 20 complete (53.9% coverage)
-**Brazilian enrichment**: Batch 6 complete (13.7% coverage)
-**Mexican geocoding**: 117 institutions (100% geocoded)
---
## 🚦 Next Steps
### Immediate (High Priority)
1. **Merge Tunisia enrichment** → Update master dataset with enhanced data
2. **Merge Georgia enrichment** → Update master dataset with batch 3 results
3. **Update statistics** → Recalculate coverage after merges
### Short-Term (This Week)
- Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21)
- Start Japan enrichment pilot (50 institutions)
- Document merge workflow for reproducibility
### Long-Term (This Month)
- Reach 8,000 Wikidata enriched institutions (60% coverage)
- Implement automated merge pipeline
- Create export formats (JSON-LD, RDF, CSV)
---
## 📞 Reference Links
- **Wikidata**: https://www.wikidata.org/
- **VIAF**: https://viaf.org/
- **LinkML**: https://linkml.io/
- **Nominatim**: https://nominatim.openstreetmap.org/
---
**Document Version**: 1.0
**Created**: November 9, 2025
**Maintained By**: GLAM Data Extraction Project