| .. | ||
| archive | ||
| DATASET_STATISTICS.yaml | ||
| ENRICHMENT_PROGRESS.md | ||
| FILE_STATUS.md | ||
| globalglam-20251111.batch9_backup | ||
| README.md | ||
| TASK5_COMPLETION_SUMMARY.md | ||
| TASK6_COMPLETION_SUMMARY.md | ||
| UNIFICATION_REPORT.md | ||
| UNIFICATION_SUMMARY.md | ||
| UNIFIED_OVERVIEW.md | ||
GLAM Data Extraction - Quick Reference Index
Last Updated: November 11, 2025
📂 Directory: data/instances/all/
This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project.
Files in this Directory
| File | Description | Size | Use Case |
|---|---|---|---|
| globalglam-20251111.yaml | Master dataset (PRIMARY) | 24MB | All analysis and exports |
| UNIFIED_OVERVIEW.md | Complete project documentation | 22KB | Read for comprehensive understanding |
| FILE_STATUS.md | File authority reference | 11KB | Determine which files to use |
| ENRICHMENT_PROGRESS.md | Wikidata enrichment tracking | 11KB | Track batch-by-batch progress |
| ENRICHMENT_CANDIDATES.yaml | Institutions needing enrichment | 2.8MB | Enrichment planning |
| DATASET_STATISTICS.yaml | Machine-readable metrics | 3.3KB | Parse for programmatic analysis |
| README.md | This file (quick reference) | 5KB | Start here for navigation |
🚀 Quick Start
For Humans: Read the Overview
Start with UNIFIED_OVERVIEW.md for:
- Executive summary of 13,502 institutions across 18 countries
- Schema architecture (LinkML v0.2.1)
- Geographic coverage by region
- Data quality framework
- Technical implementation details
- Known issues and solutions
For File Navigation: Check File Status
Read FILE_STATUS.md for:
- Which files are authoritative (use these)
- Which files are archived (don't use)
- Relationship between master dataset and enrichment files
- Merge workflow documentation
- Quick reference commands
For Tracking Progress: Check Enrichment Status
Read ENRICHMENT_PROGRESS.md for:
- Current Wikidata coverage by country
- Completed batches (Brazil, Chile)
- Next steps and timelines
- Batch execution workflow
- Quality metrics and accuracy rates
For Programmatic Access: Parse Statistics
Use DATASET_STATISTICS.yaml for:
- Institution counts by country
- Wikidata/geocoding coverage percentages
- Institution type distributions
- Data tier classifications
- Regional totals
📊 Key Statistics (as of Nov 11, 2025)
Global Dataset: 13,502 institutions
Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others)
Wikidata Enriched: 7,520 (55.7%)
Geocoded: 8,178 (60.6%)
Master File: globalglam-20251111.yaml (24 MB)
Top Countries:
Japan: 12,065 institutions (89.4% of dataset)
Netherlands: 622 institutions
Tunisia: 69 institutions
Mexico: 57 institutions
Chile: 56 institutions
🎯 Current Focus: Enrichment and Merge Workflow
Tunisia (Enrichment Complete, Pending Merge)
- Status: Enhanced file ready (252 KB)
- Location:
data/instances/tunisia/tunisian_institutions_enhanced.yaml - Action: Need to merge enriched data back into master dataset
Georgia (Enrichment Complete, Pending Merge)
- Status: Batch 3 complete (22 KB)
- Location:
data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml - Action: Need to merge enriched data back into master dataset
Latin America (Active Enrichment)
- Chile: Batch 20 enriched (53.9% Wikidata coverage)
- Brazil: Batch 6 enriched (13.7% Wikidata coverage)
- Mexico: Geocoding complete, enrichment starting
📁 Project File Structure
data/instances/
├── all/ ← YOU ARE HERE
│ ├── globalglam-20251111.yaml ← Master dataset (PRIMARY - 13,502 institutions)
│ ├── UNIFIED_OVERVIEW.md ← Complete documentation
│ ├── FILE_STATUS.md ← File authority reference
│ ├── ENRICHMENT_PROGRESS.md ← Wikidata enrichment tracking
│ ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment
│ ├── DATASET_STATISTICS.yaml ← Machine-readable metrics
│ └── README.md ← This file
│
├── brazil/
│ └── brazilian_institutions_batch6_enriched.yaml (current)
│
├── chile/
│ └── chilean_institutions_batch20_enriched.yaml (current)
│
├── mexico/
│ └── mexican_institutions_geocoded.yaml (current)
│
├── tunisia/
│ └── tunisian_institutions_enhanced.yaml (enriched, pending merge)
│
├── georgia/
│ └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge)
│
├── japan/
│ └── jp_institutions_resolved.yaml (12,065 institutions)
│
├── global/
│ └── global_heritage_institutions_merged.yaml (older version)
│
└── exports/
├── *.jsonld (JSON-LD)
├── *.csv (CSV)
└── *.geojson (GeoJSON)
🔗 Related Documentation
Core Documentation
- FILE_STATUS.md: Which files are authoritative (START HERE for file navigation)
- UNIFIED_OVERVIEW.md: Complete project documentation
- ENRICHMENT_PROGRESS.md: Enrichment tracking
- Agent Instructions:
/Users/kempersc/apps/glam/AGENTS.md - Project Progress:
/Users/kempersc/apps/glam/PROGRESS.md - Schema Modules:
/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md
Technical Specifications
- Persistent Identifiers:
/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md - UUID Strategy:
/Users/kempersc/apps/glam/docs/UUID_STRATEGY.md - GHCID Collisions:
/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md
Schema Files
- Main Schema:
/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml - Core Classes:
/Users/kempersc/apps/glam/schemas/core.yaml - Enumerations:
/Users/kempersc/apps/glam/schemas/enums.yaml - Provenance:
/Users/kempersc/apps/glam/schemas/provenance.yaml
🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT)
19 types with single-letter GHCID codes:
| Code | Type | Example |
|---|---|---|
| G | GALLERY | Commercial galleries |
| L | LIBRARY | National libraries |
| A | ARCHIVE | National archives |
| M | MUSEUM | Art/history museums |
| O | OFFICIAL_INSTITUTION | Government agencies |
| R | RESEARCH_CENTER | Research institutes |
| C | CORPORATION | Corporate archives |
| U | UNKNOWN | Type undetermined |
| B | BOTANICAL_ZOO | Botanical gardens, zoos |
| E | EDUCATION_PROVIDER | Universities, schools |
| S | COLLECTING_SOCIETY | Historical societies |
| F | FEATURES | Physical landmarks (monuments, statues, memorials) |
| I | INTANGIBLE_HERITAGE_GROUP | Traditional performance groups, oral history |
| X | MIXED | Multiple types |
| P | PERSONAL_COLLECTION | Private collections |
| H | HOLY_SITES | Religious heritage sites |
| D | DIGITAL_PLATFORM | Online archives, digital libraries |
| N | NGO | Heritage advocacy organizations |
| T | TASTE_SMELL | Historic restaurants, parfumeries, distilleries |
Mnemonic: GLAMORCUBESFIXPHDNT
📈 Coverage Goals
Latin America (Priority)
| Country | Current | Goal | Status |
|---|---|---|---|
| Brazil | 6.1% | 17.4% | 🟡 Batch 6 complete |
| Chile | 6.7% | 22.2% | 🟢 Batch 2 complete |
| Mexico | 0.0% | 17.1% | 🔴 Not started |
Regional Goal: 60+ institutions (19% coverage)
Global
- Short-term: 8,000 institutions (31% coverage)
- Long-term: 13,000 institutions (50% coverage)
💡 Quick Tips
For Reading
- Start with FILE_STATUS.md to understand file relationships
- Read UNIFIED_OVERVIEW.md for big picture
- Consult ENRICHMENT_PROGRESS.md for detailed enrichment status
- Reference DATASET_STATISTICS.yaml for metrics
For Analysis
import yaml
# Load master dataset
with open('data/instances/all/globalglam-20251111.yaml', 'r') as f:
institutions = yaml.safe_load(f)
print(f"Total institutions: {len(institutions)}")
# Load statistics
with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f:
stats = yaml.safe_load(f)
# Access coverage by country
japan_coverage = stats['datasets']['japan']['wikidata_coverage']
print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}")
For Navigation
- Country datasets:
data/instances/{country}/ - Global merged:
data/instances/global/ - Exports:
data/instances/exports/ - Scripts:
scripts/enrich_*.py,scripts/geocode_*.py
🏆 Recent Achievements
November 11, 2025
- ✅ Master dataset unified: 13,502 institutions from 18 countries
- ✅ Major deduplication: 48.0% duplicate rate (12,461 duplicates removed)
- ✅ File status documented: Created FILE_STATUS.md for clarity
- ✅ Documentation updated: All metrics reflect actual merged dataset
November 10-11, 2025
- ✅ Tunisia enrichment: Enhanced file with Wikidata data (pending merge)
- ✅ Georgia enrichment: Batch 3 complete (pending merge)
November 6-9, 2025
- ✅ Chilean enrichment: Batch 20 complete (53.9% coverage)
- ✅ Brazilian enrichment: Batch 6 complete (13.7% coverage)
- ✅ Mexican geocoding: 117 institutions (100% geocoded)
🚦 Next Steps
Immediate (High Priority)
- Merge Tunisia enrichment → Update master dataset with enhanced data
- Merge Georgia enrichment → Update master dataset with batch 3 results
- Update statistics → Recalculate coverage after merges
Short-Term (This Week)
- Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21)
- Start Japan enrichment pilot (50 institutions)
- Document merge workflow for reproducibility
Long-Term (This Month)
- Reach 8,000 Wikidata enriched institutions (60% coverage)
- Implement automated merge pipeline
- Create export formats (JSON-LD, RDF, CSV)
📞 Reference Links
- Wikidata: https://www.wikidata.org/
- VIAF: https://viaf.org/
- LinkML: https://linkml.io/
- Nominatim: https://nominatim.openstreetmap.org/
Document Version: 1.0
Created: November 9, 2025
Maintained By: GLAM Data Extraction Project