# File Status Reference - GLAM Data Extraction Project **Last Updated**: November 11, 2025 **Purpose**: Document which files are authoritative, archived, or superseded --- ## 🎯 Current Authoritative Files ### Master Dataset (PRIMARY) **File**: `globalglam-20251111.yaml` **Location**: `/Users/kempersc/apps/glam/data/instances/all/` **Size**: 24 MB **Institutions**: 13,502 **Created**: 2025-11-11 15:17 UTC **Status**: ✅ **AUTHORITATIVE - Use this file** **Description**: This is the current master dataset containing all unified heritage institutions from 18 countries. This file represents the November 11, 2025 merge of all regional datasets with deduplication applied. **Source Records**: - Raw records merged: 25,963 - Duplicates removed: 12,461 (48.0% duplicate rate) - Final unique institutions: 13,502 **Coverage**: - Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, Belgium, Algeria, Norway, United States, Great Britain, Vietnam, Thailand, Cambodia, Malaysia, Indonesia) - Wikidata coverage: 55.7% (7,520 institutions) - Geocoding coverage: 60.6% (8,178 institutions) **Use Cases**: - ✅ All data analysis and reporting - ✅ Export generation (JSON-LD, RDF, CSV) - ✅ Enrichment pipeline inputs - ✅ Geographic visualization - ✅ Statistical analysis --- ### Enrichment Tracking **File**: `ENRICHMENT_CANDIDATES.yaml` **Location**: `/Users/kempersc/apps/glam/data/instances/all/` **Size**: 2.8 MB **Records**: 13,461 institutions (99.7% of master dataset) **Status**: ✅ **AUTHORITATIVE - Enrichment planning** **Description**: Machine-readable list of institutions that need Wikidata/geocoding enrichment. Excludes 41 institutions that already have complete metadata. **Use Cases**: - ✅ Planning batch enrichment workflows - ✅ Prioritizing countries for enrichment - ✅ Generating candidate lists for SPARQL queries - ✅ Tracking enrichment progress over time --- ### Statistics and Metadata **File**: `DATASET_STATISTICS.yaml` **Location**: `/Users/kempersc/apps/glam/data/instances/all/` **Size**: 3.0 KB **Status**: ✅ **AUTHORITATIVE - Metrics** **Description**: Machine-readable statistics for programmatic access. Includes country-by-country breakdowns, coverage percentages, and regional totals. **Use Cases**: - ✅ Programmatic access to dataset metrics - ✅ Dashboard generation - ✅ Progress tracking over time - ✅ API responses for dataset statistics --- ## 🗄️ Archived/Superseded Files ### Obsolete Master Dataset **Files**: - `unified_global_heritage_institutions.yaml.backup` (24 MB) - `unified_global_heritage_institutions.yaml.backup2` (24 MB) - `unified_global_heritage_institutions_backup_20251111_092645.yaml` (24 MB) **Location**: `/Users/kempersc/apps/glam/data/instances/all/archive/` **Archived**: November 11, 2025 **Status**: ⚠️ **ARCHIVED - Do not use** **Description**: These are backup copies from the November 11 merge process. They contain incomplete data (4,036 institutions) from earlier stages of the unification workflow. **These files have been moved to the archive directory.** **Why Archived**: - Superseded by `globalglam-20251111.yaml` (13,502 institutions) - Incomplete geographic coverage (only 4,036 institutions vs. 13,502) - Missing major countries (Japan, Tunisia, Georgia, etc.) - Created during intermediate merge steps - Scripts updated to reference new master dataset **Archive Details**: - See `archive/ARCHIVE_NOTES.md` for complete documentation - Total archive size: 72 MB (3 files) - Can be safely deleted after December 11, 2025 **Action Required**: - ❌ Do NOT use for analysis - ❌ Do NOT reference in documentation - ❌ Do NOT use as merge input - ✅ May be deleted after 30-day retention period (December 11, 2025) --- ## 📂 Country-Specific Enrichment Files These are **separate work products** that contain enriched data for specific countries. They are NOT part of the master dataset yet and need to be merged/re-applied. ### Tunisia Enrichment **File**: `tunisian_institutions_enhanced.yaml` **Location**: `/Users/kempersc/apps/glam/data/instances/tunisia/` **Size**: 252 KB **Institutions**: 69 **Status**: 🟡 **ENRICHED - Not yet merged into master** **Enrichment Work**: - Wikidata enrichment performed (November 10-11, 2025) - Geographic data enhanced - Institution descriptions expanded **Current State in Master**: - Master dataset (`globalglam-20251111.yaml`) shows 1.4% Wikidata coverage - Enrichment file shows higher coverage - **Action needed**: Re-merge enriched data into master dataset --- ### Georgia Enrichment **File**: `georgian_institutions_enriched_batch3_final.yaml` **Location**: `/Users/kempersc/apps/glam/data/instances/georgia/` **Size**: 22 KB **Institutions**: 14 **Status**: 🟡 **ENRICHED - Not yet merged into master** **Enrichment Work**: - Batch 1-3 enrichment completed (November 9-10, 2025) - Wikidata identifiers added - Geographic coordinates verified **Current State in Master**: - Master dataset (`globalglam-20251111.yaml`) shows 0% Wikidata coverage - Enrichment files show significant progress - **Action needed**: Re-merge enriched data into master dataset --- ### Latin America Enrichment (In Progress) **Locations**: - `/Users/kempersc/apps/glam/data/instances/brazil/` - Batch enrichment ongoing - `/Users/kempersc/apps/glam/data/instances/chile/` - Batch enrichment ongoing - `/Users/kempersc/apps/glam/data/instances/mexico/` - Geocoding complete **Status**: 🔄 **ACTIVE ENRICHMENT - Partially merged** **Description**: These directories contain batch-by-batch enrichment work. The baseline data IS in the master dataset, but ongoing enrichment updates may not be reflected yet. **Current State**: - Master dataset contains baseline Latin America data - Enrichment work continues in subdirectories - Periodic merges update master dataset --- ## 🔄 Merge Workflow Relationship ``` Regional Datasets Enrichment Work Master Dataset ───────────────── ─────────────── ────────────── japan/ tunisia/ → tunisia/enhanced.yaml → georgia/ → georgia/batch3.yaml → globalglam-20251111.yaml netherlands/ (13,502 institutions) chile/ → chile/batch20.yaml → brazil/ → brazil/batch6.yaml → mexico/ libya/ ... ↓ Enrichment applied separately, then merged back in ``` ### Merge Process 1. **Initial Unification** (November 11, 2025): - Regional datasets merged → `globalglam-20251111.yaml` - Deduplication applied - Baseline statistics captured 2. **Ongoing Enrichment**: - Country-specific directories contain enrichment work - Batch processing adds Wikidata/geocoding data - Files remain in subdirectories until merge 3. **Re-Merge Enrichment**: - Enriched data merged back into master dataset - Master dataset updated with new identifiers - Statistics recalculated 4. **Current Gap**: - Tunisia and Georgia enrichment files exist but not merged - Master dataset shows pre-enrichment statistics - Need to run merge workflow to update master --- ## 🚨 Important Notes ### DO NOT Use These Files ❌ `unified_global_heritage_institutions.yaml.backup` ❌ `unified_global_heritage_institutions.yaml.backup2` ❌ `unified_global_heritage_institutions_backup_20251111_092645.yaml` **Reason**: Incomplete data (4,036 institutions vs. 13,502 in master) ### DO Use These Files ✅ `globalglam-20251111.yaml` - Master dataset (PRIMARY) ✅ `ENRICHMENT_CANDIDATES.yaml` - Enrichment planning ✅ `DATASET_STATISTICS.yaml` - Metrics and statistics ✅ Country-specific enrichment files (for their specific enrichment data) --- ## 📋 File Naming Conventions ### Master Dataset Naming **Format**: `globalglam-YYYYMMDD.yaml` **Example**: `globalglam-20251111.yaml` **Rationale**: - Date-stamped for version control - "globalglam" prefix indicates unified dataset - ISO 8601 date format (sortable, unambiguous) ### Country-Specific Files **Format**: `{country}_institutions_enriched_batch{N}.yaml` **Examples**: - `tunisian_institutions_enhanced.yaml` - `georgian_institutions_enriched_batch3_final.yaml` - `chilean_institutions_batch20_enriched.yaml` ### Backup Files **Format**: `{original_filename}.backup` or `{original_filename}_backup_YYYYMMDD_HHMMSS.yaml` **Examples**: - `unified_global_heritage_institutions.yaml.backup` - `unified_global_heritage_institutions_backup_20251111_092645.yaml` --- ## 🔍 Quick Reference Commands ### Check Master Dataset Size ```bash ls -lh /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml # Output: 24M globalglam-20251111.yaml ``` ### Count Institutions in Master ```bash grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml # Output: 13502 ``` ### List All Archived Files ```bash ls -lh /Users/kempersc/apps/glam/data/instances/all/archive/ # Output: 3 backup files (24 MB each) + ARCHIVE_NOTES.md ``` ### Check Enrichment Candidate Count ```bash grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/ENRICHMENT_CANDIDATES.yaml # Output: 13461 ``` ### List Country-Specific Enrichment Files ```bash find /Users/kempersc/apps/glam/data/instances -name "*enriched*.yaml" -type f | sort ``` --- ## 📊 Current Status Summary | File | Status | Size | Records | Use Case | |------|--------|------|---------|----------| | `globalglam-20251111.yaml` | ✅ **ACTIVE** | 24 MB | 13,502 | Master dataset | | `ENRICHMENT_CANDIDATES.yaml` | ✅ **ACTIVE** | 2.8 MB | 13,461 | Enrichment planning | | `DATASET_STATISTICS.yaml` | ✅ **ACTIVE** | 3.0 KB | - | Metrics | | `archive/*.backup*` files | ⚠️ **ARCHIVED** | 72 MB | 4,036 | Moved to archive/ | | `tunisia/enhanced.yaml` | 🟡 **PENDING MERGE** | 252 KB | 69 | Tunisia enrichment | | `georgia/batch3.yaml` | 🟡 **PENDING MERGE** | 22 KB | 14 | Georgia enrichment | | Country enrichment files | 🔄 **ACTIVE** | Varies | Varies | Ongoing enrichment | --- ## 🛠️ Recommended Actions ### For Data Analysis 1. Always use `globalglam-20251111.yaml` as the source of truth 2. Reference `DATASET_STATISTICS.yaml` for metrics 3. Check country subdirectories for latest enrichment status ### For Enrichment Work 1. Use `ENRICHMENT_CANDIDATES.yaml` to identify targets 2. Work in country-specific subdirectories 3. Merge enriched data back to master when batch complete ### For Cleanup 1. Verify master dataset integrity 2. Delete or archive `*.backup*` files (save disk space) 3. Document any new merge workflows ### For Documentation 1. Update `UNIFIED_OVERVIEW.md` when master dataset changes 2. Update `DATASET_STATISTICS.yaml` after merges 3. Update this file when new authoritative files created --- ## 📅 Version History | Date | Action | Files Affected | |------|--------|----------------| | 2025-11-11 15:17 UTC | Master dataset created | `globalglam-20251111.yaml` | | 2025-11-11 (later) | Scripts updated to new filename | 13 enrichment/merge scripts | | 2025-11-11 (later) | Backup files archived | Moved to `archive/` directory | | 2025-11-11 09:26 UTC | Backup files created | `*.backup*` files (now archived) | | 2025-11-10-11 | Tunisia enrichment | `tunisia/enhanced.yaml` | | 2025-11-09-10 | Georgia enrichment | `georgia/batch3.yaml` | --- ## 🔗 Related Documentation - **UNIFIED_OVERVIEW.md** - Complete project documentation - **ENRICHMENT_PROGRESS.md** - Enrichment tracking and batch status - **UNIFICATION_REPORT.md** - November 11 merge technical details - **UNIFICATION_SUMMARY.md** - Merge process summary - **README.md** - Quick reference and navigation --- **Document Version**: 1.0 **Created**: November 11, 2025 **Maintained By**: GLAM Data Extraction Project