375 lines
12 KiB
Markdown
375 lines
12 KiB
Markdown
# File Status Reference - GLAM Data Extraction Project
|
|
|
|
**Last Updated**: November 11, 2025
|
|
**Purpose**: Document which files are authoritative, archived, or superseded
|
|
|
|
---
|
|
|
|
## 🎯 Current Authoritative Files
|
|
|
|
### Master Dataset (PRIMARY)
|
|
|
|
**File**: `globalglam-20251111.yaml`
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
|
|
**Size**: 24 MB
|
|
**Institutions**: 13,502
|
|
**Created**: 2025-11-11 15:17 UTC
|
|
**Status**: ✅ **AUTHORITATIVE - Use this file**
|
|
|
|
**Description**:
|
|
This is the current master dataset containing all unified heritage institutions from 18 countries. This file represents the November 11, 2025 merge of all regional datasets with deduplication applied.
|
|
|
|
**Source Records**:
|
|
- Raw records merged: 25,963
|
|
- Duplicates removed: 12,461 (48.0% duplicate rate)
|
|
- Final unique institutions: 13,502
|
|
|
|
**Coverage**:
|
|
- Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, Belgium, Algeria, Norway, United States, Great Britain, Vietnam, Thailand, Cambodia, Malaysia, Indonesia)
|
|
- Wikidata coverage: 55.7% (7,520 institutions)
|
|
- Geocoding coverage: 60.6% (8,178 institutions)
|
|
|
|
**Use Cases**:
|
|
- ✅ All data analysis and reporting
|
|
- ✅ Export generation (JSON-LD, RDF, CSV)
|
|
- ✅ Enrichment pipeline inputs
|
|
- ✅ Geographic visualization
|
|
- ✅ Statistical analysis
|
|
|
|
---
|
|
|
|
### Enrichment Tracking
|
|
|
|
**File**: `ENRICHMENT_CANDIDATES.yaml`
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
|
|
**Size**: 2.8 MB
|
|
**Records**: 13,461 institutions (99.7% of master dataset)
|
|
**Status**: ✅ **AUTHORITATIVE - Enrichment planning**
|
|
|
|
**Description**:
|
|
Machine-readable list of institutions that need Wikidata/geocoding enrichment. Excludes 41 institutions that already have complete metadata.
|
|
|
|
**Use Cases**:
|
|
- ✅ Planning batch enrichment workflows
|
|
- ✅ Prioritizing countries for enrichment
|
|
- ✅ Generating candidate lists for SPARQL queries
|
|
- ✅ Tracking enrichment progress over time
|
|
|
|
---
|
|
|
|
### Statistics and Metadata
|
|
|
|
**File**: `DATASET_STATISTICS.yaml`
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
|
|
**Size**: 3.0 KB
|
|
**Status**: ✅ **AUTHORITATIVE - Metrics**
|
|
|
|
**Description**:
|
|
Machine-readable statistics for programmatic access. Includes country-by-country breakdowns, coverage percentages, and regional totals.
|
|
|
|
**Use Cases**:
|
|
- ✅ Programmatic access to dataset metrics
|
|
- ✅ Dashboard generation
|
|
- ✅ Progress tracking over time
|
|
- ✅ API responses for dataset statistics
|
|
|
|
---
|
|
|
|
## 🗄️ Archived/Superseded Files
|
|
|
|
### Obsolete Master Dataset
|
|
|
|
**Files**:
|
|
- `unified_global_heritage_institutions.yaml.backup` (24 MB)
|
|
- `unified_global_heritage_institutions.yaml.backup2` (24 MB)
|
|
- `unified_global_heritage_institutions_backup_20251111_092645.yaml` (24 MB)
|
|
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/all/archive/`
|
|
**Archived**: November 11, 2025
|
|
**Status**: ⚠️ **ARCHIVED - Do not use**
|
|
|
|
**Description**:
|
|
These are backup copies from the November 11 merge process. They contain incomplete data (4,036 institutions) from earlier stages of the unification workflow. **These files have been moved to the archive directory.**
|
|
|
|
**Why Archived**:
|
|
- Superseded by `globalglam-20251111.yaml` (13,502 institutions)
|
|
- Incomplete geographic coverage (only 4,036 institutions vs. 13,502)
|
|
- Missing major countries (Japan, Tunisia, Georgia, etc.)
|
|
- Created during intermediate merge steps
|
|
- Scripts updated to reference new master dataset
|
|
|
|
**Archive Details**:
|
|
- See `archive/ARCHIVE_NOTES.md` for complete documentation
|
|
- Total archive size: 72 MB (3 files)
|
|
- Can be safely deleted after December 11, 2025
|
|
|
|
**Action Required**:
|
|
- ❌ Do NOT use for analysis
|
|
- ❌ Do NOT reference in documentation
|
|
- ❌ Do NOT use as merge input
|
|
- ✅ May be deleted after 30-day retention period (December 11, 2025)
|
|
|
|
---
|
|
|
|
## 📂 Country-Specific Enrichment Files
|
|
|
|
These are **separate work products** that contain enriched data for specific countries. They are NOT part of the master dataset yet and need to be merged/re-applied.
|
|
|
|
### Tunisia Enrichment
|
|
|
|
**File**: `tunisian_institutions_enhanced.yaml`
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/tunisia/`
|
|
**Size**: 252 KB
|
|
**Institutions**: 69
|
|
**Status**: 🟡 **ENRICHED - Not yet merged into master**
|
|
|
|
**Enrichment Work**:
|
|
- Wikidata enrichment performed (November 10-11, 2025)
|
|
- Geographic data enhanced
|
|
- Institution descriptions expanded
|
|
|
|
**Current State in Master**:
|
|
- Master dataset (`globalglam-20251111.yaml`) shows 1.4% Wikidata coverage
|
|
- Enrichment file shows higher coverage
|
|
- **Action needed**: Re-merge enriched data into master dataset
|
|
|
|
---
|
|
|
|
### Georgia Enrichment
|
|
|
|
**File**: `georgian_institutions_enriched_batch3_final.yaml`
|
|
**Location**: `/Users/kempersc/apps/glam/data/instances/georgia/`
|
|
**Size**: 22 KB
|
|
**Institutions**: 14
|
|
**Status**: 🟡 **ENRICHED - Not yet merged into master**
|
|
|
|
**Enrichment Work**:
|
|
- Batch 1-3 enrichment completed (November 9-10, 2025)
|
|
- Wikidata identifiers added
|
|
- Geographic coordinates verified
|
|
|
|
**Current State in Master**:
|
|
- Master dataset (`globalglam-20251111.yaml`) shows 0% Wikidata coverage
|
|
- Enrichment files show significant progress
|
|
- **Action needed**: Re-merge enriched data into master dataset
|
|
|
|
---
|
|
|
|
### Latin America Enrichment (In Progress)
|
|
|
|
**Locations**:
|
|
- `/Users/kempersc/apps/glam/data/instances/brazil/` - Batch enrichment ongoing
|
|
- `/Users/kempersc/apps/glam/data/instances/chile/` - Batch enrichment ongoing
|
|
- `/Users/kempersc/apps/glam/data/instances/mexico/` - Geocoding complete
|
|
|
|
**Status**: 🔄 **ACTIVE ENRICHMENT - Partially merged**
|
|
|
|
**Description**:
|
|
These directories contain batch-by-batch enrichment work. The baseline data IS in the master dataset, but ongoing enrichment updates may not be reflected yet.
|
|
|
|
**Current State**:
|
|
- Master dataset contains baseline Latin America data
|
|
- Enrichment work continues in subdirectories
|
|
- Periodic merges update master dataset
|
|
|
|
---
|
|
|
|
## 🔄 Merge Workflow Relationship
|
|
|
|
```
|
|
Regional Datasets Enrichment Work Master Dataset
|
|
───────────────── ─────────────── ──────────────
|
|
|
|
japan/
|
|
tunisia/ → tunisia/enhanced.yaml →
|
|
georgia/ → georgia/batch3.yaml → globalglam-20251111.yaml
|
|
netherlands/ (13,502 institutions)
|
|
chile/ → chile/batch20.yaml →
|
|
brazil/ → brazil/batch6.yaml →
|
|
mexico/
|
|
libya/
|
|
...
|
|
|
|
↓
|
|
Enrichment applied
|
|
separately, then
|
|
merged back in
|
|
```
|
|
|
|
### Merge Process
|
|
|
|
1. **Initial Unification** (November 11, 2025):
|
|
- Regional datasets merged → `globalglam-20251111.yaml`
|
|
- Deduplication applied
|
|
- Baseline statistics captured
|
|
|
|
2. **Ongoing Enrichment**:
|
|
- Country-specific directories contain enrichment work
|
|
- Batch processing adds Wikidata/geocoding data
|
|
- Files remain in subdirectories until merge
|
|
|
|
3. **Re-Merge Enrichment**:
|
|
- Enriched data merged back into master dataset
|
|
- Master dataset updated with new identifiers
|
|
- Statistics recalculated
|
|
|
|
4. **Current Gap**:
|
|
- Tunisia and Georgia enrichment files exist but not merged
|
|
- Master dataset shows pre-enrichment statistics
|
|
- Need to run merge workflow to update master
|
|
|
|
---
|
|
|
|
## 🚨 Important Notes
|
|
|
|
### DO NOT Use These Files
|
|
|
|
❌ `unified_global_heritage_institutions.yaml.backup`
|
|
❌ `unified_global_heritage_institutions.yaml.backup2`
|
|
❌ `unified_global_heritage_institutions_backup_20251111_092645.yaml`
|
|
|
|
**Reason**: Incomplete data (4,036 institutions vs. 13,502 in master)
|
|
|
|
### DO Use These Files
|
|
|
|
✅ `globalglam-20251111.yaml` - Master dataset (PRIMARY)
|
|
✅ `ENRICHMENT_CANDIDATES.yaml` - Enrichment planning
|
|
✅ `DATASET_STATISTICS.yaml` - Metrics and statistics
|
|
✅ Country-specific enrichment files (for their specific enrichment data)
|
|
|
|
---
|
|
|
|
## 📋 File Naming Conventions
|
|
|
|
### Master Dataset Naming
|
|
|
|
**Format**: `globalglam-YYYYMMDD.yaml`
|
|
|
|
**Example**: `globalglam-20251111.yaml`
|
|
|
|
**Rationale**:
|
|
- Date-stamped for version control
|
|
- "globalglam" prefix indicates unified dataset
|
|
- ISO 8601 date format (sortable, unambiguous)
|
|
|
|
### Country-Specific Files
|
|
|
|
**Format**: `{country}_institutions_enriched_batch{N}.yaml`
|
|
|
|
**Examples**:
|
|
- `tunisian_institutions_enhanced.yaml`
|
|
- `georgian_institutions_enriched_batch3_final.yaml`
|
|
- `chilean_institutions_batch20_enriched.yaml`
|
|
|
|
### Backup Files
|
|
|
|
**Format**: `{original_filename}.backup` or `{original_filename}_backup_YYYYMMDD_HHMMSS.yaml`
|
|
|
|
**Examples**:
|
|
- `unified_global_heritage_institutions.yaml.backup`
|
|
- `unified_global_heritage_institutions_backup_20251111_092645.yaml`
|
|
|
|
---
|
|
|
|
## 🔍 Quick Reference Commands
|
|
|
|
### Check Master Dataset Size
|
|
|
|
```bash
|
|
ls -lh /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
|
|
# Output: 24M globalglam-20251111.yaml
|
|
```
|
|
|
|
### Count Institutions in Master
|
|
|
|
```bash
|
|
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
|
|
# Output: 13502
|
|
```
|
|
|
|
### List All Archived Files
|
|
|
|
```bash
|
|
ls -lh /Users/kempersc/apps/glam/data/instances/all/archive/
|
|
# Output: 3 backup files (24 MB each) + ARCHIVE_NOTES.md
|
|
```
|
|
|
|
### Check Enrichment Candidate Count
|
|
|
|
```bash
|
|
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/ENRICHMENT_CANDIDATES.yaml
|
|
# Output: 13461
|
|
```
|
|
|
|
### List Country-Specific Enrichment Files
|
|
|
|
```bash
|
|
find /Users/kempersc/apps/glam/data/instances -name "*enriched*.yaml" -type f | sort
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Current Status Summary
|
|
|
|
| File | Status | Size | Records | Use Case |
|
|
|------|--------|------|---------|----------|
|
|
| `globalglam-20251111.yaml` | ✅ **ACTIVE** | 24 MB | 13,502 | Master dataset |
|
|
| `ENRICHMENT_CANDIDATES.yaml` | ✅ **ACTIVE** | 2.8 MB | 13,461 | Enrichment planning |
|
|
| `DATASET_STATISTICS.yaml` | ✅ **ACTIVE** | 3.0 KB | - | Metrics |
|
|
| `archive/*.backup*` files | ⚠️ **ARCHIVED** | 72 MB | 4,036 | Moved to archive/ |
|
|
| `tunisia/enhanced.yaml` | 🟡 **PENDING MERGE** | 252 KB | 69 | Tunisia enrichment |
|
|
| `georgia/batch3.yaml` | 🟡 **PENDING MERGE** | 22 KB | 14 | Georgia enrichment |
|
|
| Country enrichment files | 🔄 **ACTIVE** | Varies | Varies | Ongoing enrichment |
|
|
|
|
---
|
|
|
|
## 🛠️ Recommended Actions
|
|
|
|
### For Data Analysis
|
|
1. Always use `globalglam-20251111.yaml` as the source of truth
|
|
2. Reference `DATASET_STATISTICS.yaml` for metrics
|
|
3. Check country subdirectories for latest enrichment status
|
|
|
|
### For Enrichment Work
|
|
1. Use `ENRICHMENT_CANDIDATES.yaml` to identify targets
|
|
2. Work in country-specific subdirectories
|
|
3. Merge enriched data back to master when batch complete
|
|
|
|
### For Cleanup
|
|
1. Verify master dataset integrity
|
|
2. Delete or archive `*.backup*` files (save disk space)
|
|
3. Document any new merge workflows
|
|
|
|
### For Documentation
|
|
1. Update `UNIFIED_OVERVIEW.md` when master dataset changes
|
|
2. Update `DATASET_STATISTICS.yaml` after merges
|
|
3. Update this file when new authoritative files created
|
|
|
|
---
|
|
|
|
## 📅 Version History
|
|
|
|
| Date | Action | Files Affected |
|
|
|------|--------|----------------|
|
|
| 2025-11-11 15:17 UTC | Master dataset created | `globalglam-20251111.yaml` |
|
|
| 2025-11-11 (later) | Scripts updated to new filename | 13 enrichment/merge scripts |
|
|
| 2025-11-11 (later) | Backup files archived | Moved to `archive/` directory |
|
|
| 2025-11-11 09:26 UTC | Backup files created | `*.backup*` files (now archived) |
|
|
| 2025-11-10-11 | Tunisia enrichment | `tunisia/enhanced.yaml` |
|
|
| 2025-11-09-10 | Georgia enrichment | `georgia/batch3.yaml` |
|
|
|
|
---
|
|
|
|
## 🔗 Related Documentation
|
|
|
|
- **UNIFIED_OVERVIEW.md** - Complete project documentation
|
|
- **ENRICHMENT_PROGRESS.md** - Enrichment tracking and batch status
|
|
- **UNIFICATION_REPORT.md** - November 11 merge technical details
|
|
- **UNIFICATION_SUMMARY.md** - Merge process summary
|
|
- **README.md** - Quick reference and navigation
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Created**: November 11, 2025
|
|
**Maintained By**: GLAM Data Extraction Project
|