12 KiB
File Status Reference - GLAM Data Extraction Project
Last Updated: November 11, 2025
Purpose: Document which files are authoritative, archived, or superseded
🎯 Current Authoritative Files
Master Dataset (PRIMARY)
File: globalglam-20251111.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 24 MB
Institutions: 13,502
Created: 2025-11-11 15:17 UTC
Status: ✅ AUTHORITATIVE - Use this file
Description: This is the current master dataset containing all unified heritage institutions from 18 countries. This file represents the November 11, 2025 merge of all regional datasets with deduplication applied.
Source Records:
- Raw records merged: 25,963
- Duplicates removed: 12,461 (48.0% duplicate rate)
- Final unique institutions: 13,502
Coverage:
- Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, Belgium, Algeria, Norway, United States, Great Britain, Vietnam, Thailand, Cambodia, Malaysia, Indonesia)
- Wikidata coverage: 55.7% (7,520 institutions)
- Geocoding coverage: 60.6% (8,178 institutions)
Use Cases:
- ✅ All data analysis and reporting
- ✅ Export generation (JSON-LD, RDF, CSV)
- ✅ Enrichment pipeline inputs
- ✅ Geographic visualization
- ✅ Statistical analysis
Enrichment Tracking
File: ENRICHMENT_CANDIDATES.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 2.8 MB
Records: 13,461 institutions (99.7% of master dataset)
Status: ✅ AUTHORITATIVE - Enrichment planning
Description: Machine-readable list of institutions that need Wikidata/geocoding enrichment. Excludes 41 institutions that already have complete metadata.
Use Cases:
- ✅ Planning batch enrichment workflows
- ✅ Prioritizing countries for enrichment
- ✅ Generating candidate lists for SPARQL queries
- ✅ Tracking enrichment progress over time
Statistics and Metadata
File: DATASET_STATISTICS.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 3.0 KB
Status: ✅ AUTHORITATIVE - Metrics
Description: Machine-readable statistics for programmatic access. Includes country-by-country breakdowns, coverage percentages, and regional totals.
Use Cases:
- ✅ Programmatic access to dataset metrics
- ✅ Dashboard generation
- ✅ Progress tracking over time
- ✅ API responses for dataset statistics
🗄️ Archived/Superseded Files
Obsolete Master Dataset
Files:
unified_global_heritage_institutions.yaml.backup(24 MB)unified_global_heritage_institutions.yaml.backup2(24 MB)unified_global_heritage_institutions_backup_20251111_092645.yaml(24 MB)
Location: /Users/kempersc/apps/glam/data/instances/all/archive/
Archived: November 11, 2025
Status: ⚠️ ARCHIVED - Do not use
Description: These are backup copies from the November 11 merge process. They contain incomplete data (4,036 institutions) from earlier stages of the unification workflow. These files have been moved to the archive directory.
Why Archived:
- Superseded by
globalglam-20251111.yaml(13,502 institutions) - Incomplete geographic coverage (only 4,036 institutions vs. 13,502)
- Missing major countries (Japan, Tunisia, Georgia, etc.)
- Created during intermediate merge steps
- Scripts updated to reference new master dataset
Archive Details:
- See
archive/ARCHIVE_NOTES.mdfor complete documentation - Total archive size: 72 MB (3 files)
- Can be safely deleted after December 11, 2025
Action Required:
- ❌ Do NOT use for analysis
- ❌ Do NOT reference in documentation
- ❌ Do NOT use as merge input
- ✅ May be deleted after 30-day retention period (December 11, 2025)
📂 Country-Specific Enrichment Files
These are separate work products that contain enriched data for specific countries. They are NOT part of the master dataset yet and need to be merged/re-applied.
Tunisia Enrichment
File: tunisian_institutions_enhanced.yaml
Location: /Users/kempersc/apps/glam/data/instances/tunisia/
Size: 252 KB
Institutions: 69
Status: 🟡 ENRICHED - Not yet merged into master
Enrichment Work:
- Wikidata enrichment performed (November 10-11, 2025)
- Geographic data enhanced
- Institution descriptions expanded
Current State in Master:
- Master dataset (
globalglam-20251111.yaml) shows 1.4% Wikidata coverage - Enrichment file shows higher coverage
- Action needed: Re-merge enriched data into master dataset
Georgia Enrichment
File: georgian_institutions_enriched_batch3_final.yaml
Location: /Users/kempersc/apps/glam/data/instances/georgia/
Size: 22 KB
Institutions: 14
Status: 🟡 ENRICHED - Not yet merged into master
Enrichment Work:
- Batch 1-3 enrichment completed (November 9-10, 2025)
- Wikidata identifiers added
- Geographic coordinates verified
Current State in Master:
- Master dataset (
globalglam-20251111.yaml) shows 0% Wikidata coverage - Enrichment files show significant progress
- Action needed: Re-merge enriched data into master dataset
Latin America Enrichment (In Progress)
Locations:
/Users/kempersc/apps/glam/data/instances/brazil/- Batch enrichment ongoing/Users/kempersc/apps/glam/data/instances/chile/- Batch enrichment ongoing/Users/kempersc/apps/glam/data/instances/mexico/- Geocoding complete
Status: 🔄 ACTIVE ENRICHMENT - Partially merged
Description: These directories contain batch-by-batch enrichment work. The baseline data IS in the master dataset, but ongoing enrichment updates may not be reflected yet.
Current State:
- Master dataset contains baseline Latin America data
- Enrichment work continues in subdirectories
- Periodic merges update master dataset
🔄 Merge Workflow Relationship
Regional Datasets Enrichment Work Master Dataset
───────────────── ─────────────── ──────────────
japan/
tunisia/ → tunisia/enhanced.yaml →
georgia/ → georgia/batch3.yaml → globalglam-20251111.yaml
netherlands/ (13,502 institutions)
chile/ → chile/batch20.yaml →
brazil/ → brazil/batch6.yaml →
mexico/
libya/
...
↓
Enrichment applied
separately, then
merged back in
Merge Process
-
Initial Unification (November 11, 2025):
- Regional datasets merged →
globalglam-20251111.yaml - Deduplication applied
- Baseline statistics captured
- Regional datasets merged →
-
Ongoing Enrichment:
- Country-specific directories contain enrichment work
- Batch processing adds Wikidata/geocoding data
- Files remain in subdirectories until merge
-
Re-Merge Enrichment:
- Enriched data merged back into master dataset
- Master dataset updated with new identifiers
- Statistics recalculated
-
Current Gap:
- Tunisia and Georgia enrichment files exist but not merged
- Master dataset shows pre-enrichment statistics
- Need to run merge workflow to update master
🚨 Important Notes
DO NOT Use These Files
❌ unified_global_heritage_institutions.yaml.backup
❌ unified_global_heritage_institutions.yaml.backup2
❌ unified_global_heritage_institutions_backup_20251111_092645.yaml
Reason: Incomplete data (4,036 institutions vs. 13,502 in master)
DO Use These Files
✅ globalglam-20251111.yaml - Master dataset (PRIMARY)
✅ ENRICHMENT_CANDIDATES.yaml - Enrichment planning
✅ DATASET_STATISTICS.yaml - Metrics and statistics
✅ Country-specific enrichment files (for their specific enrichment data)
📋 File Naming Conventions
Master Dataset Naming
Format: globalglam-YYYYMMDD.yaml
Example: globalglam-20251111.yaml
Rationale:
- Date-stamped for version control
- "globalglam" prefix indicates unified dataset
- ISO 8601 date format (sortable, unambiguous)
Country-Specific Files
Format: {country}_institutions_enriched_batch{N}.yaml
Examples:
tunisian_institutions_enhanced.yamlgeorgian_institutions_enriched_batch3_final.yamlchilean_institutions_batch20_enriched.yaml
Backup Files
Format: {original_filename}.backup or {original_filename}_backup_YYYYMMDD_HHMMSS.yaml
Examples:
unified_global_heritage_institutions.yaml.backupunified_global_heritage_institutions_backup_20251111_092645.yaml
🔍 Quick Reference Commands
Check Master Dataset Size
ls -lh /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 24M globalglam-20251111.yaml
Count Institutions in Master
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 13502
List All Archived Files
ls -lh /Users/kempersc/apps/glam/data/instances/all/archive/
# Output: 3 backup files (24 MB each) + ARCHIVE_NOTES.md
Check Enrichment Candidate Count
grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/ENRICHMENT_CANDIDATES.yaml
# Output: 13461
List Country-Specific Enrichment Files
find /Users/kempersc/apps/glam/data/instances -name "*enriched*.yaml" -type f | sort
📊 Current Status Summary
| File | Status | Size | Records | Use Case |
|---|---|---|---|---|
globalglam-20251111.yaml |
✅ ACTIVE | 24 MB | 13,502 | Master dataset |
ENRICHMENT_CANDIDATES.yaml |
✅ ACTIVE | 2.8 MB | 13,461 | Enrichment planning |
DATASET_STATISTICS.yaml |
✅ ACTIVE | 3.0 KB | - | Metrics |
archive/*.backup* files |
⚠️ ARCHIVED | 72 MB | 4,036 | Moved to archive/ |
tunisia/enhanced.yaml |
🟡 PENDING MERGE | 252 KB | 69 | Tunisia enrichment |
georgia/batch3.yaml |
🟡 PENDING MERGE | 22 KB | 14 | Georgia enrichment |
| Country enrichment files | 🔄 ACTIVE | Varies | Varies | Ongoing enrichment |
🛠️ Recommended Actions
For Data Analysis
- Always use
globalglam-20251111.yamlas the source of truth - Reference
DATASET_STATISTICS.yamlfor metrics - Check country subdirectories for latest enrichment status
For Enrichment Work
- Use
ENRICHMENT_CANDIDATES.yamlto identify targets - Work in country-specific subdirectories
- Merge enriched data back to master when batch complete
For Cleanup
- Verify master dataset integrity
- Delete or archive
*.backup*files (save disk space) - Document any new merge workflows
For Documentation
- Update
UNIFIED_OVERVIEW.mdwhen master dataset changes - Update
DATASET_STATISTICS.yamlafter merges - Update this file when new authoritative files created
📅 Version History
| Date | Action | Files Affected |
|---|---|---|
| 2025-11-11 15:17 UTC | Master dataset created | globalglam-20251111.yaml |
| 2025-11-11 (later) | Scripts updated to new filename | 13 enrichment/merge scripts |
| 2025-11-11 (later) | Backup files archived | Moved to archive/ directory |
| 2025-11-11 09:26 UTC | Backup files created | *.backup* files (now archived) |
| 2025-11-10-11 | Tunisia enrichment | tunisia/enhanced.yaml |
| 2025-11-09-10 | Georgia enrichment | georgia/batch3.yaml |
🔗 Related Documentation
- UNIFIED_OVERVIEW.md - Complete project documentation
- ENRICHMENT_PROGRESS.md - Enrichment tracking and batch status
- UNIFICATION_REPORT.md - November 11 merge technical details
- UNIFICATION_SUMMARY.md - Merge process summary
- README.md - Quick reference and navigation
Document Version: 1.0
Created: November 11, 2025
Maintained By: GLAM Data Extraction Project