glam/data/unified/QUICK_START_UNIFIED_DB.md
2025-11-21 22:12:33 +01:00

5.2 KiB

Unified GLAM Database - Quick Start

Last Updated: 2025-11-20
Database Version: 1.0.0 (Phase 1)
Total Institutions: 1,678 across 8 countries


Quick Access

Database Files

# JSON format (2.5 MB, complete)
/Users/kempersc/apps/glam/data/unified/glam_unified_database.json

# SQLite format (20 KB, partial due to overflow issue)
/Users/kempersc/apps/glam/data/unified/glam_unified_database.db

Query Examples

Python (JSON)

import json

# Load database
with open('data/unified/glam_unified_database.json', 'r') as f:
    db = json.load(f)

# Get metadata
print(f"Total institutions: {db['metadata']['total_institutions']}")
print(f"Countries: {', '.join(db['metadata']['countries'])}")

# Find Finnish museums
finnish_museums = [
    inst for inst in db['institutions']
    if inst['source_country'] == 'finland' 
    and inst['institution_type'] == 'MUSEUM'
]
print(f"Finnish museums: {len(finnish_museums)}")

# Get country statistics
for country, stats in db['country_stats'].items():
    print(f"{country}: {stats['total']} institutions ({stats['with_wikidata']} with Wikidata)")

SQLite (after fixing overflow)

# Count by country
sqlite3 data/unified/glam_unified_database.db \
  "SELECT country, COUNT(*) FROM institutions GROUP BY country ORDER BY COUNT(*) DESC;"

# Find institutions with Wikidata
sqlite3 data/unified/glam_unified_database.db \
  "SELECT name, country FROM institutions WHERE has_wikidata=1 LIMIT 10;"

# Search by institution type
sqlite3 data/unified/glam_unified_database.db \
  "SELECT name, city FROM institutions WHERE institution_type='MUSEUM';"

Database Schema

JSON Structure

{
  "metadata": {
    "export_date": "2025-11-20T15:17:03+00:00",
    "total_institutions": 1678,
    "unique_ghcids": 565,
    "duplicates": 269,
    "countries": ["finland", "denmark", ...]
  },
  "country_stats": {
    "finland": {
      "total": 817,
      "with_ghcid": 817,
      "with_wikidata": 63,
      "with_website": 58,
      "by_type": {"LIBRARY": 789, "MUSEUM": 15, ...}
    }
  },
  "institutions": [
    {
      "id": "https://w3id.org/heritage/custodian/fi/...",
      "ghcid": "FI-A-A-L-ALKU-Q39176216",
      "ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
      "name": "Alakylän kirjasto",
      "institution_type": "LIBRARY",
      "country": "FI",
      "city": "Alavi",
      "has_wikidata": true,
      "has_website": false,
      "raw_record": "{...full LinkML record...}"
    }
  ]
}

SQLite Schema

CREATE TABLE institutions (
    id TEXT PRIMARY KEY,
    ghcid TEXT,
    ghcid_uuid TEXT,
    ghcid_numeric INTEGER,  -- ⚠️ Overflow issue
    name TEXT NOT NULL,
    institution_type TEXT,
    country TEXT,
    city TEXT,
    source_country TEXT,
    data_source TEXT,
    data_tier TEXT,
    extraction_date TEXT,
    has_wikidata BOOLEAN,
    has_website BOOLEAN,
    raw_record TEXT  -- Full JSON record
);

CREATE TABLE metadata (
    key TEXT PRIMARY KEY,
    value TEXT
);

Statistics at a Glance

Overall

  • Total Institutions: 1,678
  • Unique GHCIDs: 565 (33.7%)
  • Wikidata Coverage: 258 (15.4%)
  • Website Coverage: 198 (11.8%)

By Country

Country Count GHCID Wikidata Tier
🇫🇮 Finland 817 100% 7.7% TIER_1
🇧🇪 Belgium 421 0% 0% TIER_1
🇧🇾 Belarus 167 0% 3.0% TIER_1
🇳🇱 Netherlands 153 0% 73.2% TIER_1
🇨🇱 Chile 90 0% 78.9% TIER_4
🇪🇬 Egypt 29 58.6% 24.1% TIER_4

By Institution Type

  • Libraries: 1,478 (88.1%)
  • Museums: 80 (4.8%)
  • Archives: 73 (4.4%)
  • Education Providers: 12 (0.7%)
  • Official Institutions: 12 (0.7%)

Known Limitations (Phase 1)

  1. ⚠️ Denmark excluded (2,348 institutions) - parser error
  2. ⚠️ Canada excluded (9,565 institutions) - nested dict error
  3. ⚠️ SQLite incomplete - INTEGER overflow on ghcid_numeric
  4. 🔍 269 GHCID duplicates - need collision resolution
  5. 📝 Missing GHCIDs - Belgium, Netherlands, Belarus, Chile

Phase 2 will fix these issues and bring total to 13,591 institutions.


Rebuilding the Database

To rebuild with updated country datasets:

# Run the unification script
python3 scripts/build_unified_database.py

# Output will be in:
# - data/unified/glam_unified_database.json
# - data/unified/glam_unified_database.db

To add a new country dataset:

  1. Edit scripts/build_unified_database.py
  2. Add country to COUNTRY_DATASETS dict with path
  3. Run script
  4. Check UNIFIED_DATABASE_REPORT.md for results

Documentation

  • Full Report: UNIFIED_DATABASE_REPORT.md - Detailed statistics and analysis
  • Session Summary: SESSION_SUMMARY_20251120_FINLAND_UNIFIED.md - What we did today
  • Finland Report: data/finland_isil/FINLAND_ISIL_HARVEST_REPORT.md - Finnish dataset details
  • Main Progress: PROGRESS.md - Overall project status

Support

For questions or issues:

  • Check UNIFIED_DATABASE_REPORT.md for detailed documentation
  • Review AGENTS.md for extraction guidelines
  • See PROGRESS.md for project history

Version: 1.0.0 (Phase 1)
Next Update: Phase 2 (Denmark + Canada integration)