# GLAM Data Extraction - Quick Reference Index **Last Updated**: November 11, 2025 --- ## 📂 Directory: data/instances/all/ This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project. ### Files in this Directory | File | Description | Size | Use Case | |------|-------------|------|----------| | **globalglam-20251111.yaml** | Master dataset (PRIMARY) | 24MB | All analysis and exports | | **UNIFIED_OVERVIEW.md** | Complete project documentation | 22KB | Read for comprehensive understanding | | **FILE_STATUS.md** | File authority reference | 11KB | Determine which files to use | | **ENRICHMENT_PROGRESS.md** | Wikidata enrichment tracking | 11KB | Track batch-by-batch progress | | **ENRICHMENT_CANDIDATES.yaml** | Institutions needing enrichment | 2.8MB | Enrichment planning | | **DATASET_STATISTICS.yaml** | Machine-readable metrics | 3.3KB | Parse for programmatic analysis | | **README.md** | This file (quick reference) | 5KB | Start here for navigation | --- ## 🚀 Quick Start ### For Humans: Read the Overview Start with **UNIFIED_OVERVIEW.md** for: - Executive summary of 13,502 institutions across 18 countries - Schema architecture (LinkML v0.2.1) - Geographic coverage by region - Data quality framework - Technical implementation details - Known issues and solutions ### For File Navigation: Check File Status Read **FILE_STATUS.md** for: - Which files are authoritative (use these) - Which files are archived (don't use) - Relationship between master dataset and enrichment files - Merge workflow documentation - Quick reference commands ### For Tracking Progress: Check Enrichment Status Read **ENRICHMENT_PROGRESS.md** for: - Current Wikidata coverage by country - Completed batches (Brazil, Chile) - Next steps and timelines - Batch execution workflow - Quality metrics and accuracy rates ### For Programmatic Access: Parse Statistics Use **DATASET_STATISTICS.yaml** for: - Institution counts by country - Wikidata/geocoding coverage percentages - Institution type distributions - Data tier classifications - Regional totals --- ## 📊 Key Statistics (as of Nov 11, 2025) ``` Global Dataset: 13,502 institutions Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others) Wikidata Enriched: 7,520 (55.7%) Geocoded: 8,178 (60.6%) Master File: globalglam-20251111.yaml (24 MB) Top Countries: Japan: 12,065 institutions (89.4% of dataset) Netherlands: 622 institutions Tunisia: 69 institutions Mexico: 57 institutions Chile: 56 institutions ``` --- ## 🎯 Current Focus: Enrichment and Merge Workflow ### Tunisia (Enrichment Complete, Pending Merge) - **Status**: Enhanced file ready (252 KB) - **Location**: `data/instances/tunisia/tunisian_institutions_enhanced.yaml` - **Action**: Need to merge enriched data back into master dataset ### Georgia (Enrichment Complete, Pending Merge) - **Status**: Batch 3 complete (22 KB) - **Location**: `data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml` - **Action**: Need to merge enriched data back into master dataset ### Latin America (Active Enrichment) - **Chile**: Batch 20 enriched (53.9% Wikidata coverage) - **Brazil**: Batch 6 enriched (13.7% Wikidata coverage) - **Mexico**: Geocoding complete, enrichment starting --- ## 📁 Project File Structure ``` data/instances/ ├── all/ ← YOU ARE HERE │ ├── globalglam-20251111.yaml ← Master dataset (PRIMARY - 13,502 institutions) │ ├── UNIFIED_OVERVIEW.md ← Complete documentation │ ├── FILE_STATUS.md ← File authority reference │ ├── ENRICHMENT_PROGRESS.md ← Wikidata enrichment tracking │ ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment │ ├── DATASET_STATISTICS.yaml ← Machine-readable metrics │ └── README.md ← This file │ ├── brazil/ │ └── brazilian_institutions_batch6_enriched.yaml (current) │ ├── chile/ │ └── chilean_institutions_batch20_enriched.yaml (current) │ ├── mexico/ │ └── mexican_institutions_geocoded.yaml (current) │ ├── tunisia/ │ └── tunisian_institutions_enhanced.yaml (enriched, pending merge) │ ├── georgia/ │ └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge) │ ├── japan/ │ └── jp_institutions_resolved.yaml (12,065 institutions) │ ├── global/ │ └── global_heritage_institutions_merged.yaml (older version) │ └── exports/ ├── *.jsonld (JSON-LD) ├── *.csv (CSV) └── *.geojson (GeoJSON) ``` --- ## 🔗 Related Documentation ### Core Documentation - **FILE_STATUS.md**: Which files are authoritative (START HERE for file navigation) - **UNIFIED_OVERVIEW.md**: Complete project documentation - **ENRICHMENT_PROGRESS.md**: Enrichment tracking - **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md` - **Project Progress**: `/Users/kempersc/apps/glam/PROGRESS.md` - **Schema Modules**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md` ### Technical Specifications - **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md` - **UUID Strategy**: `/Users/kempersc/apps/glam/docs/UUID_STRATEGY.md` - **GHCID Collisions**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md` ### Schema Files - **Main Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml` - **Core Classes**: `/Users/kempersc/apps/glam/schemas/core.yaml` - **Enumerations**: `/Users/kempersc/apps/glam/schemas/enums.yaml` - **Provenance**: `/Users/kempersc/apps/glam/schemas/provenance.yaml` --- ## 🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT) 19 types with single-letter GHCID codes: | Code | Type | Example | |------|------|---------| | G | GALLERY | Commercial galleries | | L | LIBRARY | National libraries | | A | ARCHIVE | National archives | | M | MUSEUM | Art/history museums | | O | OFFICIAL_INSTITUTION | Government agencies | | R | RESEARCH_CENTER | Research institutes | | C | CORPORATION | Corporate archives | | U | UNKNOWN | Type undetermined | | B | BOTANICAL_ZOO | Botanical gardens, zoos | | E | EDUCATION_PROVIDER | Universities, schools | | S | COLLECTING_SOCIETY | Historical societies | | F | FEATURES | Physical landmarks (monuments, statues, memorials) | | I | INTANGIBLE_HERITAGE_GROUP | Traditional performance groups, oral history | | X | MIXED | Multiple types | | P | PERSONAL_COLLECTION | Private collections | | H | HOLY_SITES | Religious heritage sites | | D | DIGITAL_PLATFORM | Online archives, digital libraries | | N | NGO | Heritage advocacy organizations | | T | TASTE_SMELL | Historic restaurants, parfumeries, distilleries | **Mnemonic**: **GLAMORCUBESFIXPHDNT** --- ## 📈 Coverage Goals ### Latin America (Priority) | Country | Current | Goal | Status | |---------|---------|------|--------| | Brazil | 6.1% | 17.4% | 🟡 Batch 6 complete | | Chile | 6.7% | 22.2% | 🟢 Batch 2 complete | | Mexico | 0.0% | 17.1% | 🔴 Not started | **Regional Goal**: 60+ institutions (19% coverage) ### Global - **Short-term**: 8,000 institutions (31% coverage) - **Long-term**: 13,000 institutions (50% coverage) --- ## 💡 Quick Tips ### For Reading 1. Start with **FILE_STATUS.md** to understand file relationships 2. Read **UNIFIED_OVERVIEW.md** for big picture 3. Consult **ENRICHMENT_PROGRESS.md** for detailed enrichment status 4. Reference **DATASET_STATISTICS.yaml** for metrics ### For Analysis ```python import yaml # Load master dataset with open('data/instances/all/globalglam-20251111.yaml', 'r') as f: institutions = yaml.safe_load(f) print(f"Total institutions: {len(institutions)}") # Load statistics with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f: stats = yaml.safe_load(f) # Access coverage by country japan_coverage = stats['datasets']['japan']['wikidata_coverage'] print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}") ``` ### For Navigation - **Country datasets**: `data/instances/{country}/` - **Global merged**: `data/instances/global/` - **Exports**: `data/instances/exports/` - **Scripts**: `scripts/enrich_*.py`, `scripts/geocode_*.py` --- ## 🏆 Recent Achievements ### November 11, 2025 - ✅ **Master dataset unified**: 13,502 institutions from 18 countries - ✅ **Major deduplication**: 48.0% duplicate rate (12,461 duplicates removed) - ✅ **File status documented**: Created FILE_STATUS.md for clarity - ✅ **Documentation updated**: All metrics reflect actual merged dataset ### November 10-11, 2025 - ✅ **Tunisia enrichment**: Enhanced file with Wikidata data (pending merge) - ✅ **Georgia enrichment**: Batch 3 complete (pending merge) ### November 6-9, 2025 - ✅ **Chilean enrichment**: Batch 20 complete (53.9% coverage) - ✅ **Brazilian enrichment**: Batch 6 complete (13.7% coverage) - ✅ **Mexican geocoding**: 117 institutions (100% geocoded) --- ## 🚦 Next Steps ### Immediate (High Priority) 1. **Merge Tunisia enrichment** → Update master dataset with enhanced data 2. **Merge Georgia enrichment** → Update master dataset with batch 3 results 3. **Update statistics** → Recalculate coverage after merges ### Short-Term (This Week) - Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21) - Start Japan enrichment pilot (50 institutions) - Document merge workflow for reproducibility ### Long-Term (This Month) - Reach 8,000 Wikidata enriched institutions (60% coverage) - Implement automated merge pipeline - Create export formats (JSON-LD, RDF, CSV) --- ## 📞 Reference Links - **Wikidata**: https://www.wikidata.org/ - **VIAF**: https://viaf.org/ - **LinkML**: https://linkml.io/ - **Nominatim**: https://nominatim.openstreetmap.org/ --- **Document Version**: 1.0 **Created**: November 9, 2025 **Maintained By**: GLAM Data Extraction Project