glam/data/instances/all
2025-11-21 22:12:33 +01:00
..
archive add isil entries 2025-11-19 23:25:22 +01:00
DATASET_STATISTICS.yaml updated schemata 2025-11-21 22:12:33 +01:00
ENRICHMENT_PROGRESS.md add isil entries 2025-11-19 23:25:22 +01:00
FILE_STATUS.md add isil entries 2025-11-19 23:25:22 +01:00
globalglam-20251111.batch9_backup add isil entries 2025-11-19 23:25:22 +01:00
README.md add isil entries 2025-11-19 23:25:22 +01:00
TASK5_COMPLETION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
TASK6_COMPLETION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
UNIFICATION_REPORT.md updated schemata 2025-11-21 22:12:33 +01:00
UNIFICATION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
UNIFIED_OVERVIEW.md add isil entries 2025-11-19 23:25:22 +01:00

GLAM Data Extraction - Quick Reference Index

Last Updated: November 11, 2025


📂 Directory: data/instances/all/

This directory contains unified documentation and statistics for the entire GLAM Data Extraction Project.

Files in this Directory

File Description Size Use Case
globalglam-20251111.yaml Master dataset (PRIMARY) 24MB All analysis and exports
UNIFIED_OVERVIEW.md Complete project documentation 22KB Read for comprehensive understanding
FILE_STATUS.md File authority reference 11KB Determine which files to use
ENRICHMENT_PROGRESS.md Wikidata enrichment tracking 11KB Track batch-by-batch progress
ENRICHMENT_CANDIDATES.yaml Institutions needing enrichment 2.8MB Enrichment planning
DATASET_STATISTICS.yaml Machine-readable metrics 3.3KB Parse for programmatic analysis
README.md This file (quick reference) 5KB Start here for navigation

🚀 Quick Start

For Humans: Read the Overview

Start with UNIFIED_OVERVIEW.md for:

  • Executive summary of 13,502 institutions across 18 countries
  • Schema architecture (LinkML v0.2.1)
  • Geographic coverage by region
  • Data quality framework
  • Technical implementation details
  • Known issues and solutions

For File Navigation: Check File Status

Read FILE_STATUS.md for:

  • Which files are authoritative (use these)
  • Which files are archived (don't use)
  • Relationship between master dataset and enrichment files
  • Merge workflow documentation
  • Quick reference commands

For Tracking Progress: Check Enrichment Status

Read ENRICHMENT_PROGRESS.md for:

  • Current Wikidata coverage by country
  • Completed batches (Brazil, Chile)
  • Next steps and timelines
  • Batch execution workflow
  • Quality metrics and accuracy rates

For Programmatic Access: Parse Statistics

Use DATASET_STATISTICS.yaml for:

  • Institution counts by country
  • Wikidata/geocoding coverage percentages
  • Institution type distributions
  • Data tier classifications
  • Regional totals

📊 Key Statistics (as of Nov 11, 2025)

Global Dataset:       13,502 institutions
Countries:            18 (Japan, Tunisia, Georgia, Netherlands, Chile, Brazil, Mexico, Libya, and 10 others)
Wikidata Enriched:    7,520 (55.7%)
Geocoded:             8,178 (60.6%)
Master File:          globalglam-20251111.yaml (24 MB)

Top Countries:
  Japan:      12,065 institutions (89.4% of dataset)
  Netherlands:   622 institutions
  Tunisia:        69 institutions
  Mexico:         57 institutions
  Chile:          56 institutions

🎯 Current Focus: Enrichment and Merge Workflow

Tunisia (Enrichment Complete, Pending Merge)

  • Status: Enhanced file ready (252 KB)
  • Location: data/instances/tunisia/tunisian_institutions_enhanced.yaml
  • Action: Need to merge enriched data back into master dataset

Georgia (Enrichment Complete, Pending Merge)

  • Status: Batch 3 complete (22 KB)
  • Location: data/instances/georgia/georgian_institutions_enriched_batch3_final.yaml
  • Action: Need to merge enriched data back into master dataset

Latin America (Active Enrichment)

  • Chile: Batch 20 enriched (53.9% Wikidata coverage)
  • Brazil: Batch 6 enriched (13.7% Wikidata coverage)
  • Mexico: Geocoding complete, enrichment starting

📁 Project File Structure

data/instances/
├── all/                          ← YOU ARE HERE
│   ├── globalglam-20251111.yaml  ← Master dataset (PRIMARY - 13,502 institutions)
│   ├── UNIFIED_OVERVIEW.md       ← Complete documentation
│   ├── FILE_STATUS.md            ← File authority reference
│   ├── ENRICHMENT_PROGRESS.md    ← Wikidata enrichment tracking
│   ├── ENRICHMENT_CANDIDATES.yaml ← Institutions needing enrichment
│   ├── DATASET_STATISTICS.yaml   ← Machine-readable metrics
│   └── README.md                 ← This file
│
├── brazil/
│   └── brazilian_institutions_batch6_enriched.yaml (current)
│
├── chile/
│   └── chilean_institutions_batch20_enriched.yaml (current)
│
├── mexico/
│   └── mexican_institutions_geocoded.yaml (current)
│
├── tunisia/
│   └── tunisian_institutions_enhanced.yaml (enriched, pending merge)
│
├── georgia/
│   └── georgian_institutions_enriched_batch3_final.yaml (enriched, pending merge)
│
├── japan/
│   └── jp_institutions_resolved.yaml (12,065 institutions)
│
├── global/
│   └── global_heritage_institutions_merged.yaml (older version)
│
└── exports/
    ├── *.jsonld (JSON-LD)
    ├── *.csv (CSV)
    └── *.geojson (GeoJSON)

Core Documentation

  • FILE_STATUS.md: Which files are authoritative (START HERE for file navigation)
  • UNIFIED_OVERVIEW.md: Complete project documentation
  • ENRICHMENT_PROGRESS.md: Enrichment tracking
  • Agent Instructions: /Users/kempersc/apps/glam/AGENTS.md
  • Project Progress: /Users/kempersc/apps/glam/PROGRESS.md
  • Schema Modules: /Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md

Technical Specifications

  • Persistent Identifiers: /Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md
  • UUID Strategy: /Users/kempersc/apps/glam/docs/UUID_STRATEGY.md
  • GHCID Collisions: /Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md

Schema Files

  • Main Schema: /Users/kempersc/apps/glam/schemas/heritage_custodian.yaml
  • Core Classes: /Users/kempersc/apps/glam/schemas/core.yaml
  • Enumerations: /Users/kempersc/apps/glam/schemas/enums.yaml
  • Provenance: /Users/kempersc/apps/glam/schemas/provenance.yaml

🎓 Institution Type Taxonomy (GLAMORCUBESFIXPHDNT)

19 types with single-letter GHCID codes:

Code Type Example
G GALLERY Commercial galleries
L LIBRARY National libraries
A ARCHIVE National archives
M MUSEUM Art/history museums
O OFFICIAL_INSTITUTION Government agencies
R RESEARCH_CENTER Research institutes
C CORPORATION Corporate archives
U UNKNOWN Type undetermined
B BOTANICAL_ZOO Botanical gardens, zoos
E EDUCATION_PROVIDER Universities, schools
S COLLECTING_SOCIETY Historical societies
F FEATURES Physical landmarks (monuments, statues, memorials)
I INTANGIBLE_HERITAGE_GROUP Traditional performance groups, oral history
X MIXED Multiple types
P PERSONAL_COLLECTION Private collections
H HOLY_SITES Religious heritage sites
D DIGITAL_PLATFORM Online archives, digital libraries
N NGO Heritage advocacy organizations
T TASTE_SMELL Historic restaurants, parfumeries, distilleries

Mnemonic: GLAMORCUBESFIXPHDNT


📈 Coverage Goals

Latin America (Priority)

Country Current Goal Status
Brazil 6.1% 17.4% 🟡 Batch 6 complete
Chile 6.7% 22.2% 🟢 Batch 2 complete
Mexico 0.0% 17.1% 🔴 Not started

Regional Goal: 60+ institutions (19% coverage)

Global

  • Short-term: 8,000 institutions (31% coverage)
  • Long-term: 13,000 institutions (50% coverage)

💡 Quick Tips

For Reading

  1. Start with FILE_STATUS.md to understand file relationships
  2. Read UNIFIED_OVERVIEW.md for big picture
  3. Consult ENRICHMENT_PROGRESS.md for detailed enrichment status
  4. Reference DATASET_STATISTICS.yaml for metrics

For Analysis

import yaml

# Load master dataset
with open('data/instances/all/globalglam-20251111.yaml', 'r') as f:
    institutions = yaml.safe_load(f)

print(f"Total institutions: {len(institutions)}")

# Load statistics
with open('data/instances/all/DATASET_STATISTICS.yaml', 'r') as f:
    stats = yaml.safe_load(f)

# Access coverage by country
japan_coverage = stats['datasets']['japan']['wikidata_coverage']
print(f"Japan: {japan_coverage['count']}/{stats['datasets']['japan']['total_institutions']}")

For Navigation

  • Country datasets: data/instances/{country}/
  • Global merged: data/instances/global/
  • Exports: data/instances/exports/
  • Scripts: scripts/enrich_*.py, scripts/geocode_*.py

🏆 Recent Achievements

November 11, 2025

  • Master dataset unified: 13,502 institutions from 18 countries
  • Major deduplication: 48.0% duplicate rate (12,461 duplicates removed)
  • File status documented: Created FILE_STATUS.md for clarity
  • Documentation updated: All metrics reflect actual merged dataset

November 10-11, 2025

  • Tunisia enrichment: Enhanced file with Wikidata data (pending merge)
  • Georgia enrichment: Batch 3 complete (pending merge)

November 6-9, 2025

  • Chilean enrichment: Batch 20 complete (53.9% coverage)
  • Brazilian enrichment: Batch 6 complete (13.7% coverage)
  • Mexican geocoding: 117 institutions (100% geocoded)

🚦 Next Steps

Immediate (High Priority)

  1. Merge Tunisia enrichment → Update master dataset with enhanced data
  2. Merge Georgia enrichment → Update master dataset with batch 3 results
  3. Update statistics → Recalculate coverage after merges

Short-Term (This Week)

  • Continue Latin America enrichment (Brazil Batch 7, Chile Batch 21)
  • Start Japan enrichment pilot (50 institutions)
  • Document merge workflow for reproducibility

Long-Term (This Month)

  • Reach 8,000 Wikidata enriched institutions (60% coverage)
  • Implement automated merge pipeline
  • Create export formats (JSON-LD, RDF, CSV)


Document Version: 1.0
Created: November 9, 2025
Maintained By: GLAM Data Extraction Project