glam/data/instances/all/README.md

# GLAM Data Extraction - Quick Reference Index

**Last Updated**: November 11, 2025

---

## 📂 Directory: data/instances/all/

This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project.

### Files in this Directory

| File | Description | Size | Use Case |
|------|-------------|------|----------|
| **globalglam-20251111.yaml** | Master dataset (PRIMARY) | 24MB | All analysis and exports |
| **UNIFIED_OVERVIEW.md** | Complete project documentation | 22KB | Read for comprehensive understanding |
| **FILE_STATUS.md** | File authority reference | 11KB | Determine which files to use |
| **ENRICHMENT_PROGRESS.md** | Wikidata enrichment tracking | 11KB | Track batch-by-batch progress |
| **ENRICHMENT_CANDIDATES.yaml** | Institutions needing enrichment | 2.8MB | Enrichment planning |
| **DATASET_STATISTICS.yaml** | Machine-readable metrics | 3.3KB | Parse for programmatic analysis |
| **README.md** | This file (quick reference) | 5KB | Start here for navigation |

---

## 🚀 Quick Start

### For Humans: Read the Overview

Start with **UNIFIED_OVERVIEW.md** for:
- Executive summary of 13,502 institutions across 18 countries
- Schema architecture (LinkML v0.2.1)
- Geographic coverage by region
- Data quality framework
- Technical implementation details
- Known issues and solutions

### For File Navigation: Check File Status

Read **FILE_STATUS.md** for:
- Which files are authoritative (use these)
- Which files are archived (don't use)
- Relationship between master dataset and enrichment files
- Merge workflow documentation
- Quick reference commands

### For Tracking Progress: Check Enrichment Status

Read **ENRICHMENT_PROGRESS.md** for:
- Current Wikidata coverage by country
- Completed batches (Brazil, Chile)
- Next steps and timelines
- Batch execution workflow
- Quality metrics and accuracy rates

### For Programmatic Access: Parse Statistics

Use **DATASET_STATISTICS.yaml** for:
- Institution counts by country
- Wikidata/geocoding coverage percentages
- Institution type distributions
- Data tier classifications
- Regional totals

---

## 📊 Key Statistics (as of Nov 11, 2025)

```
Global Dataset:       13,502 institutions
Countries:            18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others)
Wikidata Enriched:    7,520 (55.7%)
Geocoded:             8,178 (60.6%)
Master File:          globalglam-20251111.yaml (24 MB)

Top Countries:
  Japan:      12,065 institutions (89.4% of dataset)
  Netherlands:   622 institutions
  Tunisia:        69 institutions
  Mexico:         57 institutions
  Chile:          56 institutions
```

---

## 🎯 Current Focus: Enrichment and Merge Workflow

### Tunisia (Enrichment Complete, Pending Merge)
- **Status**: Enhanced file ready (252 KB)
- **Location**: `data/instances/tunisia/tunisian_institutions_enhanced.yaml`
- **Action**: Need to merge enriched data back into master dataset

### Georgia (Enrichment Complete, Pending Merge)
- **Status**: Batch 3 complete (22 KB)
- **Location**: `data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml`
- **Action**: Need to merge enriched data back into master dataset

### Latin America (Active Enrichment)
- **Chile**: Batch 20 enriched (53.9% Wikidata coverage)
- **Brazil**: Batch 6 enriched (13.7% Wikidata coverage)
- **Mexico**: Geocoding complete, enrichment starting

---

## 📁 Project File Structure

```
data/instances/
├── all/                          ← YOU ARE HERE
│   ├── globalglam-20251111.yaml  ← Master dataset (PRIMARY - 13,502 institutions)
│   ├── UNIFIED_OVERVIEW.md       ← Complete documentation
│   ├── FILE_STATUS.md            ← File authority reference
│   ├── ENRICHMENT_PROGRESS.md    ← Wikidata enrichment tracking
│   ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment
│   ├── DATASET_STATISTICS.yaml   ← Machine-readable metrics
│   └── README.md                 ← This file
│
├── brazil/
│   └── brazilian_institutions_batch6_enriched.yaml (current)
│
├── chile/
│   └── chilean_institutions_batch20_enriched.yaml (current)
│
├── mexico/
│   └── mexican_institutions_geocoded.yaml (current)
│
├── tunisia/
│   └── tunisian_institutions_enhanced.yaml (enriched, pending merge)
│
├── georgia/
│   └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge)
│
├── japan/
│   └── jp_institutions_resolved.yaml (12,065 institutions)
│
├── global/
│   └── global_heritage_institutions_merged.yaml (older version)
│
└── exports/
    ├── *.jsonld (JSON-LD)
    ├── *.csv (CSV)
    └── *.geojson (GeoJSON)
```

---

## 🔗 Related Documentation

### Core Documentation
- **FILE_STATUS.md**: Which files are authoritative (START HERE for file navigation)
- **UNIFIED_OVERVIEW.md**: Complete project documentation
- **ENRICHMENT_PROGRESS.md**: Enrichment tracking
- **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md`
- **Project Progress**: `/Users/kempersc/apps/glam/PROGRESS.md`
- **Schema Modules**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`

### Technical Specifications
- **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
- **UUID Strategy**: `/Users/kempersc/apps/glam/docs/UUID_STRATEGY.md`
- **GHCID Collisions**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md`

### Schema Files
- **Main Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml`
- **Core Classes**: `/Users/kempersc/apps/glam/schemas/core.yaml`
- **Enumerations**: `/Users/kempersc/apps/glam/schemas/enums.yaml`
- **Provenance**: `/Users/kempersc/apps/glam/schemas/provenance.yaml`

---

## 🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT)

19 types with single-letter GHCID codes:

| Code | Type | Example |
|------|------|---------|
| G | GALLERY | Commercial galleries |
| L | LIBRARY | National libraries |
| A | ARCHIVE | National archives |
| M | MUSEUM | Art/history museums |
| O | OFFICIAL_INSTITUTION | Government agencies |
| R | RESEARCH_CENTER | Research institutes |
| C | CORPORATION | Corporate archives |
| U | UNKNOWN | Type undetermined |
| B | BOTANICAL_ZOO | Botanical gardens, zoos |
| E | EDUCATION_PROVIDER | Universities, schools |
| S | COLLECTING_SOCIETY | Historical societies |
| F | FEATURES | Physical landmarks (monuments, statues, memorials) |
| I | INTANGIBLE_HERITAGE_GROUP | Traditional performance groups, oral history |
| X | MIXED | Multiple types |
| P | PERSONAL_COLLECTION | Private collections |
| H | HOLY_SITES | Religious heritage sites |
| D | DIGITAL_PLATFORM | Online archives, digital libraries |
| N | NGO | Heritage advocacy organizations |
| T | TASTE_SMELL | Historic restaurants, parfumeries, distilleries |

**Mnemonic**: **GLAMORCUBESFIXPHDNT**

---

## 📈 Coverage Goals

### Latin America (Priority)
| Country | Current | Goal | Status |
|---------|---------|------|--------|
| Brazil | 6.1% | 17.4% | 🟡 Batch 6 complete |
| Chile | 6.7% | 22.2% | 🟢 Batch 2 complete |
| Mexico | 0.0% | 17.1% | 🔴 Not started |

**Regional Goal**: 60+ institutions (19% coverage)

### Global
- **Short-term**: 8,000 institutions (31% coverage)
- **Long-term**: 13,000 institutions (50% coverage)

---

## 💡 Quick Tips

### For Reading
1. Start with **FILE_STATUS.md** to understand file relationships
2. Read **UNIFIED_OVERVIEW.md** for big picture
3. Consult **ENRICHMENT_PROGRESS.md** for detailed enrichment status
4. Reference **DATASET_STATISTICS.yaml** for metrics

### For Analysis
```python
import yaml

# Load master dataset
with open('data/instances/all/globalglam-20251111.yaml', 'r') as f:
    institutions = yaml.safe_load(f)

print(f"Total institutions: {len(institutions)}")

# Load statistics
with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f:
    stats = yaml.safe_load(f)

# Access coverage by country
japan_coverage = stats['datasets']['japan']['wikidata_coverage']
print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}")
```

### For Navigation
- **Country datasets**: `data/instances/{country}/`
- **Global merged**: `data/instances/global/`
- **Exports**: `data/instances/exports/`
- **Scripts**: `scripts/enrich_*.py`, `scripts/geocode_*.py`

---

## 🏆 Recent Achievements

### November 11, 2025
- ✅ **Master dataset unified**: 13,502 institutions from 18 countries
- ✅ **Major deduplication**: 48.0% duplicate rate (12,461 duplicates removed)
- ✅ **File status documented**: Created FILE_STATUS.md for clarity
- ✅ **Documentation updated**: All metrics reflect actual merged dataset

### November 10-11, 2025
- ✅ **Tunisia enrichment**: Enhanced file with Wikidata data (pending merge)
- ✅ **Georgia enrichment**: Batch 3 complete (pending merge)

### November 6-9, 2025
- ✅ **Chilean enrichment**: Batch 20 complete (53.9% coverage)
- ✅ **Brazilian enrichment**: Batch 6 complete (13.7% coverage)
- ✅ **Mexican geocoding**: 117 institutions (100% geocoded)

---

## 🚦 Next Steps

### Immediate (High Priority)
1. **Merge Tunisia enrichment** → Update master dataset with enhanced data
2. **Merge Georgia enrichment** → Update master dataset with batch 3 results
3. **Update statistics** → Recalculate coverage after merges

### Short-Term (This Week)
- Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21)
- Start Japan enrichment pilot (50 institutions)
- Document merge workflow for reproducibility

### Long-Term (This Month)
- Reach 8,000 Wikidata enriched institutions (60% coverage)
- Implement automated merge pipeline
- Create export formats (JSON-LD, RDF, CSV)

---

## 📞 Reference Links

- **Wikidata**: https://www.wikidata.org/
- **VIAF**: https://viaf.org/
- **LinkML**: https://linkml.io/
- **Nominatim**: https://nominatim.openstreetmap.org/

---

**Document Version**: 1.0
**Created**: November 9, 2025
**Maintained By**: GLAM Data Extraction Project