glam/data/instances/all/FILE_STATUS.md
2025-11-19 23:25:22 +01:00

12 KiB

File Status Reference - GLAM Data Extraction Project

Last Updated: November 11, 2025
Purpose: Document which files are authoritative, archived, or superseded


🎯 Current Authoritative Files

Master Dataset (PRIMARY)

File: globalglam-20251111.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 24 MB
Institutions: 13,502
Created: 2025-11-11 15:17 UTC
Status: AUTHORITATIVE - Use this file

Description: This is the current master dataset containing all unified heritage institutions from 18 countries. This file represents the November 11, 2025 merge of all regional datasets with deduplication applied.

Source Records:

  • Raw records merged: 25,963
  • Duplicates removed: 12,461 (48.0% duplicate rate)
  • Final unique institutions: 13,502

Coverage:

  • Countries: 18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, Belgium, Algeria, Norway, United States, Great Britain, Vietnam, Thailand, Cambodia, Malaysia, Indonesia)
  • Wikidata coverage: 55.7% (7,520 institutions)
  • Geocoding coverage: 60.6% (8,178 institutions)

Use Cases:

  • All data analysis and reporting
  • Export generation (JSON-LD, RDF, CSV)
  • Enrichment pipeline inputs
  • Geographic visualization
  • Statistical analysis

Enrichment Tracking

File: ENRICHMENT_CANDIDATES.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 2.8 MB
Records: 13,461 institutions (99.7% of master dataset)
Status: AUTHORITATIVE - Enrichment planning

Description: Machine-readable list of institutions that need Wikidata/geocoding enrichment. Excludes 41 institutions that already have complete metadata.

Use Cases:

  • Planning batch enrichment workflows
  • Prioritizing countries for enrichment
  • Generating candidate lists for SPARQL queries
  • Tracking enrichment progress over time

Statistics and Metadata

File: DATASET_STATISTICS.yaml
Location: /Users/kempersc/apps/glam/data/instances/all/
Size: 3.0 KB
Status: AUTHORITATIVE - Metrics

Description: Machine-readable statistics for programmatic access. Includes country-by-country breakdowns, coverage percentages, and regional totals.

Use Cases:

  • Programmatic access to dataset metrics
  • Dashboard generation
  • Progress tracking over time
  • API responses for dataset statistics

🗄️ Archived/Superseded Files

Obsolete Master Dataset

Files:

  • unified_global_heritage_institutions.yaml.backup (24 MB)
  • unified_global_heritage_institutions.yaml.backup2 (24 MB)
  • unified_global_heritage_institutions_backup_20251111_092645.yaml (24 MB)

Location: /Users/kempersc/apps/glam/data/instances/all/archive/
Archived: November 11, 2025
Status: ⚠️ ARCHIVED - Do not use

Description: These are backup copies from the November 11 merge process. They contain incomplete data (4,036 institutions) from earlier stages of the unification workflow. These files have been moved to the archive directory.

Why Archived:

  • Superseded by globalglam-20251111.yaml (13,502 institutions)
  • Incomplete geographic coverage (only 4,036 institutions vs. 13,502)
  • Missing major countries (Japan, Tunisia, Georgia, etc.)
  • Created during intermediate merge steps
  • Scripts updated to reference new master dataset

Archive Details:

  • See archive/ARCHIVE_NOTES.md for complete documentation
  • Total archive size: 72 MB (3 files)
  • Can be safely deleted after December 11, 2025

Action Required:

  • Do NOT use for analysis
  • Do NOT reference in documentation
  • Do NOT use as merge input
  • May be deleted after 30-day retention period (December 11, 2025)

📂 Country-Specific Enrichment Files

These are separate work products that contain enriched data for specific countries. They are NOT part of the master dataset yet and need to be merged/re-applied.

Tunisia Enrichment

File: tunisian_institutions_enhanced.yaml
Location: /Users/kempersc/apps/glam/data/instances/tunisia/
Size: 252 KB
Institutions: 69
Status: 🟡 ENRICHED - Not yet merged into master

Enrichment Work:

  • Wikidata enrichment performed (November 10-11, 2025)
  • Geographic data enhanced
  • Institution descriptions expanded

Current State in Master:

  • Master dataset (globalglam-20251111.yaml) shows 1.4% Wikidata coverage
  • Enrichment file shows higher coverage
  • Action needed: Re-merge enriched data into master dataset

Georgia Enrichment

File: georgian_institutions_enriched_batch3_final.yaml
Location: /Users/kempersc/apps/glam/data/instances/georgia/
Size: 22 KB
Institutions: 14
Status: 🟡 ENRICHED - Not yet merged into master

Enrichment Work:

  • Batch 1-3 enrichment completed (November 9-10, 2025)
  • Wikidata identifiers added
  • Geographic coordinates verified

Current State in Master:

  • Master dataset (globalglam-20251111.yaml) shows 0% Wikidata coverage
  • Enrichment files show significant progress
  • Action needed: Re-merge enriched data into master dataset

Latin America Enrichment (In Progress)

Locations:

  • /Users/kempersc/apps/glam/data/instances/brazil/ - Batch enrichment ongoing
  • /Users/kempersc/apps/glam/data/instances/chile/ - Batch enrichment ongoing
  • /Users/kempersc/apps/glam/data/instances/mexico/ - Geocoding complete

Status: 🔄 ACTIVE ENRICHMENT - Partially merged

Description: These directories contain batch-by-batch enrichment work. The baseline data IS in the master dataset, but ongoing enrichment updates may not be reflected yet.

Current State:

  • Master dataset contains baseline Latin America data
  • Enrichment work continues in subdirectories
  • Periodic merges update master dataset

🔄 Merge Workflow Relationship

Regional Datasets          Enrichment Work          Master Dataset
─────────────────          ───────────────          ──────────────

japan/                                               
tunisia/              →    tunisia/enhanced.yaml  →  
georgia/              →    georgia/batch3.yaml   →  globalglam-20251111.yaml
netherlands/                                         (13,502 institutions)
chile/                →    chile/batch20.yaml    →  
brazil/               →    brazil/batch6.yaml    →  
mexico/                                              
libya/                                               
...                                                  

                      ↓
              Enrichment applied
              separately, then
              merged back in

Merge Process

  1. Initial Unification (November 11, 2025):

    • Regional datasets merged → globalglam-20251111.yaml
    • Deduplication applied
    • Baseline statistics captured
  2. Ongoing Enrichment:

    • Country-specific directories contain enrichment work
    • Batch processing adds Wikidata/geocoding data
    • Files remain in subdirectories until merge
  3. Re-Merge Enrichment:

    • Enriched data merged back into master dataset
    • Master dataset updated with new identifiers
    • Statistics recalculated
  4. Current Gap:

    • Tunisia and Georgia enrichment files exist but not merged
    • Master dataset shows pre-enrichment statistics
    • Need to run merge workflow to update master

🚨 Important Notes

DO NOT Use These Files

unified_global_heritage_institutions.yaml.backup
unified_global_heritage_institutions.yaml.backup2
unified_global_heritage_institutions_backup_20251111_092645.yaml

Reason: Incomplete data (4,036 institutions vs. 13,502 in master)

DO Use These Files

globalglam-20251111.yaml - Master dataset (PRIMARY)
ENRICHMENT_CANDIDATES.yaml - Enrichment planning
DATASET_STATISTICS.yaml - Metrics and statistics
Country-specific enrichment files (for their specific enrichment data)


📋 File Naming Conventions

Master Dataset Naming

Format: globalglam-YYYYMMDD.yaml

Example: globalglam-20251111.yaml

Rationale:

  • Date-stamped for version control
  • "globalglam" prefix indicates unified dataset
  • ISO 8601 date format (sortable, unambiguous)

Country-Specific Files

Format: {country}_institutions_enriched_batch{N}.yaml

Examples:

  • tunisian_institutions_enhanced.yaml
  • georgian_institutions_enriched_batch3_final.yaml
  • chilean_institutions_batch20_enriched.yaml

Backup Files

Format: {original_filename}.backup or {original_filename}_backup_YYYYMMDD_HHMMSS.yaml

Examples:

  • unified_global_heritage_institutions.yaml.backup
  • unified_global_heritage_institutions_backup_20251111_092645.yaml

🔍 Quick Reference Commands

Check Master Dataset Size

ls -lh /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 24M globalglam-20251111.yaml

Count Institutions in Master

grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/globalglam-20251111.yaml
# Output: 13502

List All Archived Files

ls -lh /Users/kempersc/apps/glam/data/instances/all/archive/
# Output: 3 backup files (24 MB each) + ARCHIVE_NOTES.md

Check Enrichment Candidate Count

grep -c '^- id:' /Users/kempersc/apps/glam/data/instances/all/ENRICHMENT_CANDIDATES.yaml
# Output: 13461

List Country-Specific Enrichment Files

find /Users/kempersc/apps/glam/data/instances -name "*enriched*.yaml" -type f | sort

📊 Current Status Summary

File Status Size Records Use Case
globalglam-20251111.yaml ACTIVE 24 MB 13,502 Master dataset
ENRICHMENT_CANDIDATES.yaml ACTIVE 2.8 MB 13,461 Enrichment planning
DATASET_STATISTICS.yaml ACTIVE 3.0 KB - Metrics
archive/*.backup* files ⚠️ ARCHIVED 72 MB 4,036 Moved to archive/
tunisia/enhanced.yaml 🟡 PENDING MERGE 252 KB 69 Tunisia enrichment
georgia/batch3.yaml 🟡 PENDING MERGE 22 KB 14 Georgia enrichment
Country enrichment files 🔄 ACTIVE Varies Varies Ongoing enrichment

For Data Analysis

  1. Always use globalglam-20251111.yaml as the source of truth
  2. Reference DATASET_STATISTICS.yaml for metrics
  3. Check country subdirectories for latest enrichment status

For Enrichment Work

  1. Use ENRICHMENT_CANDIDATES.yaml to identify targets
  2. Work in country-specific subdirectories
  3. Merge enriched data back to master when batch complete

For Cleanup

  1. Verify master dataset integrity
  2. Delete or archive *.backup* files (save disk space)
  3. Document any new merge workflows

For Documentation

  1. Update UNIFIED_OVERVIEW.md when master dataset changes
  2. Update DATASET_STATISTICS.yaml after merges
  3. Update this file when new authoritative files created

📅 Version History

Date Action Files Affected
2025-11-11 15:17 UTC Master dataset created globalglam-20251111.yaml
2025-11-11 (later) Scripts updated to new filename 13 enrichment/merge scripts
2025-11-11 (later) Backup files archived Moved to archive/ directory
2025-11-11 09:26 UTC Backup files created *.backup* files (now archived)
2025-11-10-11 Tunisia enrichment tunisia/enhanced.yaml
2025-11-09-10 Georgia enrichment georgia/batch3.yaml

  • UNIFIED_OVERVIEW.md - Complete project documentation
  • ENRICHMENT_PROGRESS.md - Enrichment tracking and batch status
  • UNIFICATION_REPORT.md - November 11 merge technical details
  • UNIFICATION_SUMMARY.md - Merge process summary
  • README.md - Quick reference and navigation

Document Version: 1.0
Created: November 11, 2025
Maintained By: GLAM Data Extraction Project