300 lines
9.9 KiB
Markdown
300 lines
9.9 KiB
Markdown
# GLAM Data Extraction - Quick Reference Index
|
|
|
|
**Last Updated**: November 11, 2025
|
|
|
|
---
|
|
|
|
## 📂 Directory: data/instances/all/
|
|
|
|
This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project.
|
|
|
|
### Files in this Directory
|
|
|
|
| File | Description | Size | Use Case |
|
|
|------|-------------|------|----------|
|
|
| **globalglam-20251111.yaml** | Master dataset (PRIMARY) | 24MB | All analysis and exports |
|
|
| **UNIFIED_OVERVIEW.md** | Complete project documentation | 22KB | Read for comprehensive understanding |
|
|
| **FILE_STATUS.md** | File authority reference | 11KB | Determine which files to use |
|
|
| **ENRICHMENT_PROGRESS.md** | Wikidata enrichment tracking | 11KB | Track batch-by-batch progress |
|
|
| **ENRICHMENT_CANDIDATES.yaml** | Institutions needing enrichment | 2.8MB | Enrichment planning |
|
|
| **DATASET_STATISTICS.yaml** | Machine-readable metrics | 3.3KB | Parse for programmatic analysis |
|
|
| **README.md** | This file (quick reference) | 5KB | Start here for navigation |
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### For Humans: Read the Overview
|
|
|
|
Start with **UNIFIED_OVERVIEW.md** for:
|
|
- Executive summary of 13,502 institutions across 18 countries
|
|
- Schema architecture (LinkML v0.2.1)
|
|
- Geographic coverage by region
|
|
- Data quality framework
|
|
- Technical implementation details
|
|
- Known issues and solutions
|
|
|
|
### For File Navigation: Check File Status
|
|
|
|
Read **FILE_STATUS.md** for:
|
|
- Which files are authoritative (use these)
|
|
- Which files are archived (don't use)
|
|
- Relationship between master dataset and enrichment files
|
|
- Merge workflow documentation
|
|
- Quick reference commands
|
|
|
|
### For Tracking Progress: Check Enrichment Status
|
|
|
|
Read **ENRICHMENT_PROGRESS.md** for:
|
|
- Current Wikidata coverage by country
|
|
- Completed batches (Brazil, Chile)
|
|
- Next steps and timelines
|
|
- Batch execution workflow
|
|
- Quality metrics and accuracy rates
|
|
|
|
### For Programmatic Access: Parse Statistics
|
|
|
|
Use **DATASET_STATISTICS.yaml** for:
|
|
- Institution counts by country
|
|
- Wikidata/geocoding coverage percentages
|
|
- Institution type distributions
|
|
- Data tier classifications
|
|
- Regional totals
|
|
|
|
---
|
|
|
|
## 📊 Key Statistics (as of Nov 11, 2025)
|
|
|
|
```
|
|
Global Dataset: 13,502 institutions
|
|
Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others)
|
|
Wikidata Enriched: 7,520 (55.7%)
|
|
Geocoded: 8,178 (60.6%)
|
|
Master File: globalglam-20251111.yaml (24 MB)
|
|
|
|
Top Countries:
|
|
Japan: 12,065 institutions (89.4% of dataset)
|
|
Netherlands: 622 institutions
|
|
Tunisia: 69 institutions
|
|
Mexico: 57 institutions
|
|
Chile: 56 institutions
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 Current Focus: Enrichment and Merge Workflow
|
|
|
|
### Tunisia (Enrichment Complete, Pending Merge)
|
|
- **Status**: Enhanced file ready (252 KB)
|
|
- **Location**: `data/instances/tunisia/tunisian_institutions_enhanced.yaml`
|
|
- **Action**: Need to merge enriched data back into master dataset
|
|
|
|
### Georgia (Enrichment Complete, Pending Merge)
|
|
- **Status**: Batch 3 complete (22 KB)
|
|
- **Location**: `data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml`
|
|
- **Action**: Need to merge enriched data back into master dataset
|
|
|
|
### Latin America (Active Enrichment)
|
|
- **Chile**: Batch 20 enriched (53.9% Wikidata coverage)
|
|
- **Brazil**: Batch 6 enriched (13.7% Wikidata coverage)
|
|
- **Mexico**: Geocoding complete, enrichment starting
|
|
|
|
---
|
|
|
|
## 📁 Project File Structure
|
|
|
|
```
|
|
data/instances/
|
|
├── all/ ← YOU ARE HERE
|
|
│ ├── globalglam-20251111.yaml ← Master dataset (PRIMARY - 13,502 institutions)
|
|
│ ├── UNIFIED_OVERVIEW.md ← Complete documentation
|
|
│ ├── FILE_STATUS.md ← File authority reference
|
|
│ ├── ENRICHMENT_PROGRESS.md ← Wikidata enrichment tracking
|
|
│ ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment
|
|
│ ├── DATASET_STATISTICS.yaml ← Machine-readable metrics
|
|
│ └── README.md ← This file
|
|
│
|
|
├── brazil/
|
|
│ └── brazilian_institutions_batch6_enriched.yaml (current)
|
|
│
|
|
├── chile/
|
|
│ └── chilean_institutions_batch20_enriched.yaml (current)
|
|
│
|
|
├── mexico/
|
|
│ └── mexican_institutions_geocoded.yaml (current)
|
|
│
|
|
├── tunisia/
|
|
│ └── tunisian_institutions_enhanced.yaml (enriched, pending merge)
|
|
│
|
|
├── georgia/
|
|
│ └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge)
|
|
│
|
|
├── japan/
|
|
│ └── jp_institutions_resolved.yaml (12,065 institutions)
|
|
│
|
|
├── global/
|
|
│ └── global_heritage_institutions_merged.yaml (older version)
|
|
│
|
|
└── exports/
|
|
├── *.jsonld (JSON-LD)
|
|
├── *.csv (CSV)
|
|
└── *.geojson (GeoJSON)
|
|
```
|
|
|
|
---
|
|
|
|
## 🔗 Related Documentation
|
|
|
|
### Core Documentation
|
|
- **FILE_STATUS.md**: Which files are authoritative (START HERE for file navigation)
|
|
- **UNIFIED_OVERVIEW.md**: Complete project documentation
|
|
- **ENRICHMENT_PROGRESS.md**: Enrichment tracking
|
|
- **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md`
|
|
- **Project Progress**: `/Users/kempersc/apps/glam/PROGRESS.md`
|
|
- **Schema Modules**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
|
|
|
|
### Technical Specifications
|
|
- **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
|
|
- **UUID Strategy**: `/Users/kempersc/apps/glam/docs/UUID_STRATEGY.md`
|
|
- **GHCID Collisions**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
|
|
|
### Schema Files
|
|
- **Main Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml`
|
|
- **Core Classes**: `/Users/kempersc/apps/glam/schemas/core.yaml`
|
|
- **Enumerations**: `/Users/kempersc/apps/glam/schemas/enums.yaml`
|
|
- **Provenance**: `/Users/kempersc/apps/glam/schemas/provenance.yaml`
|
|
|
|
---
|
|
|
|
## 🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT)
|
|
|
|
19 types with single-letter GHCID codes:
|
|
|
|
| Code | Type | Example |
|
|
|------|------|---------|
|
|
| G | GALLERY | Commercial galleries |
|
|
| L | LIBRARY | National libraries |
|
|
| A | ARCHIVE | National archives |
|
|
| M | MUSEUM | Art/history museums |
|
|
| O | OFFICIAL_INSTITUTION | Government agencies |
|
|
| R | RESEARCH_CENTER | Research institutes |
|
|
| C | CORPORATION | Corporate archives |
|
|
| U | UNKNOWN | Type undetermined |
|
|
| B | BOTANICAL_ZOO | Botanical gardens, zoos |
|
|
| E | EDUCATION_PROVIDER | Universities, schools |
|
|
| S | COLLECTING_SOCIETY | Historical societies |
|
|
| F | FEATURES | Physical landmarks (monuments, statues, memorials) |
|
|
| I | INTANGIBLE_HERITAGE_GROUP | Traditional performance groups, oral history |
|
|
| X | MIXED | Multiple types |
|
|
| P | PERSONAL_COLLECTION | Private collections |
|
|
| H | HOLY_SITES | Religious heritage sites |
|
|
| D | DIGITAL_PLATFORM | Online archives, digital libraries |
|
|
| N | NGO | Heritage advocacy organizations |
|
|
| T | TASTE_SMELL | Historic restaurants, parfumeries, distilleries |
|
|
|
|
**Mnemonic**: **GLAMORCUBESFIXPHDNT**
|
|
|
|
---
|
|
|
|
## 📈 Coverage Goals
|
|
|
|
### Latin America (Priority)
|
|
| Country | Current | Goal | Status |
|
|
|---------|---------|------|--------|
|
|
| Brazil | 6.1% | 17.4% | 🟡 Batch 6 complete |
|
|
| Chile | 6.7% | 22.2% | 🟢 Batch 2 complete |
|
|
| Mexico | 0.0% | 17.1% | 🔴 Not started |
|
|
|
|
**Regional Goal**: 60+ institutions (19% coverage)
|
|
|
|
### Global
|
|
- **Short-term**: 8,000 institutions (31% coverage)
|
|
- **Long-term**: 13,000 institutions (50% coverage)
|
|
|
|
---
|
|
|
|
## 💡 Quick Tips
|
|
|
|
### For Reading
|
|
1. Start with **FILE_STATUS.md** to understand file relationships
|
|
2. Read **UNIFIED_OVERVIEW.md** for big picture
|
|
3. Consult **ENRICHMENT_PROGRESS.md** for detailed enrichment status
|
|
4. Reference **DATASET_STATISTICS.yaml** for metrics
|
|
|
|
### For Analysis
|
|
```python
|
|
import yaml
|
|
|
|
# Load master dataset
|
|
with open('data/instances/all/globalglam-20251111.yaml', 'r') as f:
|
|
institutions = yaml.safe_load(f)
|
|
|
|
print(f"Total institutions: {len(institutions)}")
|
|
|
|
# Load statistics
|
|
with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f:
|
|
stats = yaml.safe_load(f)
|
|
|
|
# Access coverage by country
|
|
japan_coverage = stats['datasets']['japan']['wikidata_coverage']
|
|
print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}")
|
|
```
|
|
|
|
### For Navigation
|
|
- **Country datasets**: `data/instances/{country}/`
|
|
- **Global merged**: `data/instances/global/`
|
|
- **Exports**: `data/instances/exports/`
|
|
- **Scripts**: `scripts/enrich_*.py`, `scripts/geocode_*.py`
|
|
|
|
---
|
|
|
|
## 🏆 Recent Achievements
|
|
|
|
### November 11, 2025
|
|
- ✅ **Master dataset unified**: 13,502 institutions from 18 countries
|
|
- ✅ **Major deduplication**: 48.0% duplicate rate (12,461 duplicates removed)
|
|
- ✅ **File status documented**: Created FILE_STATUS.md for clarity
|
|
- ✅ **Documentation updated**: All metrics reflect actual merged dataset
|
|
|
|
### November 10-11, 2025
|
|
- ✅ **Tunisia enrichment**: Enhanced file with Wikidata data (pending merge)
|
|
- ✅ **Georgia enrichment**: Batch 3 complete (pending merge)
|
|
|
|
### November 6-9, 2025
|
|
- ✅ **Chilean enrichment**: Batch 20 complete (53.9% coverage)
|
|
- ✅ **Brazilian enrichment**: Batch 6 complete (13.7% coverage)
|
|
- ✅ **Mexican geocoding**: 117 institutions (100% geocoded)
|
|
|
|
---
|
|
|
|
## 🚦 Next Steps
|
|
|
|
### Immediate (High Priority)
|
|
1. **Merge Tunisia enrichment** → Update master dataset with enhanced data
|
|
2. **Merge Georgia enrichment** → Update master dataset with batch 3 results
|
|
3. **Update statistics** → Recalculate coverage after merges
|
|
|
|
### Short-Term (This Week)
|
|
- Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21)
|
|
- Start Japan enrichment pilot (50 institutions)
|
|
- Document merge workflow for reproducibility
|
|
|
|
### Long-Term (This Month)
|
|
- Reach 8,000 Wikidata enriched institutions (60% coverage)
|
|
- Implement automated merge pipeline
|
|
- Create export formats (JSON-LD, RDF, CSV)
|
|
|
|
---
|
|
|
|
## 📞 Reference Links
|
|
|
|
- **Wikidata**: https://www.wikidata.org/
|
|
- **VIAF**: https://viaf.org/
|
|
- **LinkML**: https://linkml.io/
|
|
- **Nominatim**: https://nominatim.openstreetmap.org/
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Created**: November 9, 2025
|
|
**Maintained By**: GLAM Data Extraction Project
|