glam/data/instances/all/FILE_STATUS.md
2025-11-19 23:25:22 +01:00

375 lines
12 KiB
Markdown

# File Status Reference - GLAM Data Extraction Project
**Last Updated**: November 11, 2025
**Purpose**: Document which files are authoritative, archived, or superseded
---
## 🎯 Current Authoritative Files
### Master Dataset (PRIMARY)
**File**: `globalglam-20251111.yaml`
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
**Size**: 24 MB
**Institutions**: 13,502
**Created**: 2025-11-11 15:17 UTC
**Status**: ✅ **AUTHORITATIVE - Use this file**
**Description**:
This is the current master dataset containing all unified heritage institutions from 18 countries. This file represents the November 11, 2025 merge of all regional datasets with deduplication applied.
**Source Records**:
- Raw records merged: 25,963
- Duplicates removed: 12,461 (48.0% duplicate rate)
- Final unique institutions: 13,502
**Coverage**:
- Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, Belgium, Algeria, Norway, United States, Great Britain, Vietnam, Thailand, Cambodia, Malaysia, Indonesia)
- Wikidata coverage: 55.7% (7,520 institutions)
- Geocoding coverage: 60.6% (8,178 institutions)
**Use Cases**:
- ✅ All data analysis and reporting
- ✅ Export generation (JSON-LD, RDF, CSV)
- ✅ Enrichment pipeline inputs
- ✅ Geographic visualization
- ✅ Statistical analysis
---
### Enrichment Tracking
**File**: `ENRICHMENT_CANDIDATES.yaml`
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
**Size**: 2.8 MB
**Records**: 13,461 institutions (99.7% of master dataset)
**Status**: ✅ **AUTHORITATIVE - Enrichment planning**
**Description**:
Machine-readable list of institutions that need Wikidata/geocoding enrichment. Excludes 41 institutions that already have complete metadata.
**Use Cases**:
- ✅ Planning batch enrichment workflows
- ✅ Prioritizing countries for enrichment
- ✅ Generating candidate lists for SPARQL queries
- ✅ Tracking enrichment progress over time
---
### Statistics and Metadata
**File**: `DATASET_STATISTICS.yaml`
**Location**: `/Users/kempersc/apps/glam/data/instances/all/`
**Size**: 3.0 KB
**Status**: ✅ **AUTHORITATIVE - Metrics**
**Description**:
Machine-readable statistics for programmatic access. Includes country-by-country breakdowns, coverage percentages, and regional totals.
**Use Cases**:
- ✅ Programmatic access to dataset metrics
- ✅ Dashboard generation
- ✅ Progress tracking over time
- ✅ API responses for dataset statistics
---
## 🗄️ Archived/Superseded Files
### Obsolete Master Dataset
**Files**:
- `unified_global_heritage_institutions.yaml.backup` (24 MB)
- `unified_global_heritage_institutions.yaml.backup2` (24 MB)
- `unified_global_heritage_institutions_backup_20251111_092645.yaml` (24 MB)
**Location**: `/Users/kempersc/apps/glam/data/instances/all/archive/`
**Archived**: November 11, 2025
**Status**: ⚠️ **ARCHIVED - Do not use**
**Description**:
These are backup copies from the November 11 merge process. They contain incomplete data (4,036 institutions) from earlier stages of the unification workflow. **These files have been moved to the archive directory.**
**Why Archived**:
- Superseded by `globalglam-20251111.yaml` (13,502 institutions)
- Incomplete geographic coverage (only 4,036 institutions vs. 13,502)
- Missing major countries (Japan, Tunisia, Georgia, etc.)
- Created during intermediate merge steps
- Scripts updated to reference new master dataset
**Archive Details**:
- See `archive/ARCHIVE_NOTES.md` for complete documentation
- Total archive size: 72 MB (3 files)
- Can be safely deleted after December 11, 2025
**Action Required**:
- ❌ Do NOT use for analysis
- ❌ Do NOT reference in documentation
- ❌ Do NOT use as merge input
- ✅ May be deleted after 30-day retention period (December 11, 2025)
---
## 📂 Country-Specific Enrichment Files
These are **separate work products** that contain enriched data for specific countries. They are NOT part of the master dataset yet and need to be merged/re-applied.
### Tunisia Enrichment
**File**: `tunisian_institutions_enhanced.yaml`
**Location**: `/Users/kempersc/apps/glam/data/instances/tunisia/`
**Size**: 252 KB
**Institutions**: 69
**Status**: 🟡 **ENRICHED - Not yet merged into master**
**Enrichment Work**:
- Wikidata enrichment performed (November 10-11, 2025)
- Geographic data enhanced
- Institution descriptions expanded
**Current State in Master**:
- Master dataset (`globalglam-20251111.yaml`) shows 1.4% Wikidata coverage
- Enrichment file shows higher coverage
- **Action needed**: Re-merge enriched data into master dataset
---
### Georgia Enrichment
**File**: `georgian_institutions_enriched_batch3_final.yaml`
**Location**: `/Users/kempersc/apps/glam/data/instances/georgia/`
**Size**: 22 KB
**Institutions**: 14
**Status**: 🟡 **ENRICHED - Not yet merged into master**
**Enrichment Work**:
- Batch 1-3 enrichment completed (November 9-10, 2025)
- Wikidata identifiers added
- Geographic coordinates verified
**Current State in Master**:
- Master dataset (`globalglam-20251111.yaml`) shows 0% Wikidata coverage
- Enrichment files show significant progress
- **Action needed**: Re-merge enriched data into master dataset
---
### Latin America Enrichment (In Progress)
**Locations**:
- `/Users/kempersc/apps/glam/data/instances/brazil/` - Batch enrichment ongoing
- `/Users/kempersc/apps/glam/data/instances/chile/` - Batch enrichment ongoing
- `/Users/kempersc/apps/glam/data/instances/mexico/` - Geocoding complete
**Status**: 🔄 **ACTIVE ENRICHMENT - Partially merged**
**Description**:
These directories contain batch-by-batch enrichment work. The baseline data IS in the master dataset, but ongoing enrichment updates may not be reflected yet.
**Current State**:
- Master dataset contains baseline Latin America data
- Enrichment work continues in subdirectories
- Periodic merges update master dataset
---
## 🔄 Merge Workflow Relationship
```
Regional Datasets Enrichment Work Master Dataset
───────────────── ─────────────── ──────────────
japan/
tunisia/ → tunisia/enhanced.yaml →
georgia/ → georgia/batch3.yaml → globalglam-20251111.yaml
netherlands/ (13,502 institutions)
chile/ → chile/batch20.yaml →
brazil/ → brazil/batch6.yaml →
mexico/
libya/
...
Enrichment applied
separately, then
merged back in
```
### Merge Process
1. **Initial Unification** (November 11, 2025):
- Regional datasets merged → `globalglam-20251111.yaml`
- Deduplication applied
- Baseline statistics captured
2. **Ongoing Enrichment**:
- Country-specific directories contain enrichment work
- Batch processing adds Wikidata/geocoding data
- Files remain in subdirectories until merge
3. **Re-Merge Enrichment**:
- Enriched data merged back into master dataset
- Master dataset updated with new identifiers
- Statistics recalculated
4. **Current Gap**:
- Tunisia and Georgia enrichment files exist but not merged
- Master dataset shows pre-enrichment statistics
- Need to run merge workflow to update master
---
## 🚨 Important Notes
### DO NOT Use These Files
`unified_global_heritage_institutions.yaml.backup`
`unified_global_heritage_institutions.yaml.backup2`
`unified_global_heritage_institutions_backup_20251111_092645.yaml`
**Reason**: Incomplete data (4,036 institutions vs. 13,502 in master)
### DO Use These Files
`globalglam-20251111.yaml` - Master dataset (PRIMARY)
`ENRICHMENT_CANDIDATES.yaml` - Enrichment planning
`DATASET_STATISTICS.yaml` - Metrics and statistics
✅ Country-specific enrichment files (for their specific enrichment data)
---
## 📋 File Naming Conventions
### Master Dataset Naming
**Format**: `globalglam-YYYYMMDD.yaml`
**Example**: `globalglam-20251111.yaml`
**Rationale**:
- Date-stamped for version control
- "globalglam" prefix indicates unified dataset
- ISO 8601 date format (sortable, unambiguous)
### Country-Specific Files
**Format**: `{country}_institutions_enriched_batch{N}.yaml`
**Examples**:
- `tunisian_institutions_enhanced.yaml`
- `georgian_institutions_enriched_batch3_final.yaml`
- `chilean_institutions_batch20_enriched.yaml`
### Backup Files
**Format**: `{original_filename}.backup` or `{original_filename}_backup_YYYYMMDD_HHMMSS.yaml`
**Examples**:
- `unified_global_heritage_institutions.yaml.backup`
- `unified_global_heritage_institutions_backup_20251111_092645.yaml`
---
## 🔍 Quick Reference Commands
### Check Master Dataset Size
```bash
ls -lh /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 24M globalglam-20251111.yaml
```
### Count Institutions in Master
```bash
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 13502
```
### List All Archived Files
```bash
ls -lh /Users/kempersc/apps/glam/data/instances/all/archive/
# Output: 3 backup files (24 MB each) + ARCHIVE_NOTES.md
```
### Check Enrichment Candidate Count
```bash
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/ENRICHMENT_CANDIDATES.yaml
# Output: 13461
```
### List Country-Specific Enrichment Files
```bash
find /Users/kempersc/apps/glam/data/instances -name "*enriched*.yaml" -type f | sort
```
---
## 📊 Current Status Summary
| File | Status | Size | Records | Use Case |
|------|--------|------|---------|----------|
| `globalglam-20251111.yaml` | ✅ **ACTIVE** | 24 MB | 13,502 | Master dataset |
| `ENRICHMENT_CANDIDATES.yaml` | ✅ **ACTIVE** | 2.8 MB | 13,461 | Enrichment planning |
| `DATASET_STATISTICS.yaml` | ✅ **ACTIVE** | 3.0 KB | - | Metrics |
| `archive/*.backup*` files | ⚠️ **ARCHIVED** | 72 MB | 4,036 | Moved to archive/ |
| `tunisia/enhanced.yaml` | 🟡 **PENDING MERGE** | 252 KB | 69 | Tunisia enrichment |
| `georgia/batch3.yaml` | 🟡 **PENDING MERGE** | 22 KB | 14 | Georgia enrichment |
| Country enrichment files | 🔄 **ACTIVE** | Varies | Varies | Ongoing enrichment |
---
## 🛠️ Recommended Actions
### For Data Analysis
1. Always use `globalglam-20251111.yaml` as the source of truth
2. Reference `DATASET_STATISTICS.yaml` for metrics
3. Check country subdirectories for latest enrichment status
### For Enrichment Work
1. Use `ENRICHMENT_CANDIDATES.yaml` to identify targets
2. Work in country-specific subdirectories
3. Merge enriched data back to master when batch complete
### For Cleanup
1. Verify master dataset integrity
2. Delete or archive `*.backup*` files (save disk space)
3. Document any new merge workflows
### For Documentation
1. Update `UNIFIED_OVERVIEW.md` when master dataset changes
2. Update `DATASET_STATISTICS.yaml` after merges
3. Update this file when new authoritative files created
---
## 📅 Version History
| Date | Action | Files Affected |
|------|--------|----------------|
| 2025-11-11 15:17 UTC | Master dataset created | `globalglam-20251111.yaml` |
| 2025-11-11 (later) | Scripts updated to new filename | 13 enrichment/merge scripts |
| 2025-11-11 (later) | Backup files archived | Moved to `archive/` directory |
| 2025-11-11 09:26 UTC | Backup files created | `*.backup*` files (now archived) |
| 2025-11-10-11 | Tunisia enrichment | `tunisia/enhanced.yaml` |
| 2025-11-09-10 | Georgia enrichment | `georgia/batch3.yaml` |
---
## 🔗 Related Documentation
- **UNIFIED_OVERVIEW.md** - Complete project documentation
- **ENRICHMENT_PROGRESS.md** - Enrichment tracking and batch status
- **UNIFICATION_REPORT.md** - November 11 merge technical details
- **UNIFICATION_SUMMARY.md** - Merge process summary
- **README.md** - Quick reference and navigation
---
**Document Version**: 1.0
**Created**: November 11, 2025
**Maintained By**: GLAM Data Extraction Project