glam/data/unified/UNIFIED_DATABASE_REPORT.md
2025-11-30 23:30:29 +01:00

8.8 KiB

Unified GLAM Heritage Custodian Database Report

Note

: Any references to Q-number collision resolution in this document are superseded. Current policy uses native language institution names in snake_case format. See docs/plan/global_glam/07-ghcid-collision-resolution.md for current approach.

Date: 2025-11-20
Status: Phase 1 Complete - Initial unification successful
Total Institutions: 1,678 across 8 countries


Executive Summary

Successfully created the first unified database merging heritage institution data from 8 countries. The database combines authoritative ISIL registries (Finland), web-scraped data (Denmark), and NLP-extracted records (Canada, Chile, Egypt, etc.) into a single queryable resource.

Database Statistics

Overall Metrics

Metric Value
Total Institutions 1,678
Unique GHCIDs 565 (33.7%)
Wikidata Coverage 258 (15.4%)
Website Coverage 198 (11.8%)
Countries Covered 8

Country Breakdown

Country Institutions GHCID Wikidata Website Data Tier
Finland 🇫🇮 817 (48.7%) 817 (100%) 63 (7.7%) 58 (7.1%) TIER_1
Belgium 🇧🇪 421 (25.1%) 0 (0%) 0 (0%) 0 (0%) TIER_1
Netherlands 🇳🇱 153 (9.1%) 0 (0%) 112 (73.2%) 112 (73.2%) TIER_1
Belarus 🇧🇾 167 (10.0%) 0 (0%) 5 (3.0%) 5 (3.0%) TIER_1
Chile 🇨🇱 90 (5.4%) 0 (0%) 71 (78.9%) 0 (0%) TIER_4
Egypt 🇪🇬 29 (1.7%) 17 (58.6%) 7 (24.1%) 23 (79.3%) TIER_4
Canada 🇨🇦 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) ERROR

Note: Denmark (2,348 institutions) and Canada (9,566 institutions) encountered parsing errors and were excluded from this initial build. Will be fixed in Phase 2.

Institution Types

Type Count Percentage
LIBRARY 1,478 88.1%
MUSEUM 80 4.8%
ARCHIVE 73 4.4%
EDUCATION_PROVIDER 12 0.7%
OFFICIAL_INSTITUTION 12 0.7%
GALLERY 5 0.3%
RESEARCH_CENTER 5 0.3%
MIXED 3 0.2%

Key Insight: Libraries dominate the dataset (88%), primarily driven by Finland's comprehensive library registry.


Data Quality Analysis

GHCID Coverage

  • Finland: 100% coverage (817/817) - Gold standard
  • Egypt: 58.6% coverage (17/29) - Good progress
  • Other countries: 0% - Require GHCID generation pass

Action Required: Run GHCID generator on Netherlands, Belgium, Belarus, Chile datasets.

Wikidata Enrichment

Country Wikidata Coverage Quality
Chile 🇨🇱 78.9% (71/90) Excellent
Netherlands 🇳🇱 73.2% (112/153) Excellent
Egypt 🇪🇬 24.1% (7/29) Good
Finland 🇫🇮 7.7% (63/817) Fair
Belarus 🇧🇾 3.0% (5/167) Poor
Belgium 🇧🇪 0% (0/421) None

Key Insight: Smaller NLP-extracted datasets (Chile, Netherlands) have better Wikidata coverage than large ISIL registries (Finland) due to manual curation.


Technical Issues Encountered

1. Denmark Parsing Error

Error: 'str' object has no attribute 'get'
Root Cause: Denmark dataset uses different schema structure (possibly RDF-exported format)
Impact: 2,348 institutions excluded
Fix: Update load_json_dataset() to handle Denmark's schema

2. Canada Parsing Error

Error: unhashable type: 'dict'
Root Cause: Nested dict structures in identifiers/locations fields
Impact: 9,565 institutions excluded (only 1 loaded)
Fix: Update extract_key_metadata() to handle nested dicts properly

3. SQLite INTEGER Overflow

Error: Python int too large to convert to SQLite INTEGER
Root Cause: ghcid_numeric field uses 64-bit integers, SQLite default is 32-bit
Impact: Database creation failed (but JSON export succeeded)
Fix: Change SQLite column type from INTEGER to BIGINT or store as TEXT


Database Exports

JSON Format

File: /Users/kempersc/apps/glam/data/unified/glam_unified_database.json
Size: 2.5 MB
Status: Complete

Structure:

{
  "metadata": {
    "export_date": "2025-11-20T15:17:03+00:00",
    "total_institutions": 1678,
    "unique_ghcids": 565,
    "duplicates": 269,
    "countries": ["finland", "denmark", ...]
  },
  "country_stats": { ... },
  "institutions": [ ... ]
}

SQLite Format

File: /Users/kempersc/apps/glam/data/unified/glam_unified_database.db
Size: 20 KB
Status: ⚠️ Partial (incomplete due to INTEGER overflow)

Schema:

CREATE TABLE institutions (
    id TEXT PRIMARY KEY,
    ghcid TEXT,
    ghcid_uuid TEXT,
    ghcid_numeric INTEGER,  -- ⚠️ Needs BIGINT
    name TEXT NOT NULL,
    institution_type TEXT,
    country TEXT,
    city TEXT,
    source_country TEXT,
    data_source TEXT,
    data_tier TEXT,
    extraction_date TEXT,
    has_wikidata BOOLEAN,
    has_website BOOLEAN,
    raw_record TEXT
);

Duplicate GHCID Analysis

Total Duplicates: 269 (47.6% of unique GHCIDs)
Impact: Data quality concern - indicates name/location collision issues

Sample Duplicates

  • FI-A-A-L-ALKU-Q39176216: Multiple institutions abbreviated "ALKU"
  • Similar collision patterns in Finnish library abbreviations

Recommended Actions:

  1. Review GHCID generation algorithm for Finnish institutions
  2. Implement Q-number collision resolution (per AGENTS.md Section "GHCID Collision Handling")
  3. Add temporal priority handling for historical additions

Next Steps (Phase 2)

Immediate Priorities

  1. Fix Denmark Parser CRITICAL

    • Debug schema structure differences
    • Add 2,348 institutions to unified database
  2. Fix Canada Parser CRITICAL

    • Handle nested dict structures properly
    • Add 9,565 institutions to unified database
  3. Fix SQLite INTEGER Overflow HIGH

    • Change ghcid_numeric column to BIGINT
    • Complete SQLite database export
  4. Generate Missing GHCIDs 🔄 HIGH

    • Run GHCID generator on Netherlands (153 institutions)
    • Run GHCID generator on Belgium (421 institutions)
    • Run GHCID generator on Belarus (167 institutions)
    • Run GHCID generator on Chile (90 institutions)

Secondary Priorities

  1. Resolve GHCID Duplicates 🔄 MEDIUM

    • Implement collision resolution strategy
    • Add Q-numbers to colliding Finnish institutions
  2. Add More Countries 🔄 MEDIUM

    • Japan (12,065 institutions) - file exists but not included
    • Argentina, Brazil, Mexico, Tunisia, Libya, Algeria
  3. Enhance Wikidata Coverage 🔄 LOW

    • Run Wikidata enrichment on Belgium (0% → target 60%)
    • Run Wikidata enrichment on Finland (7.7% → target 30%)

Usage Examples

Query SQLite Database (after fixing overflow)

# Find all Finnish libraries
sqlite3 glam_unified_database.db "SELECT name, city FROM institutions WHERE country='FI' AND institution_type='LIBRARY' LIMIT 10;"

# Count institutions by country
sqlite3 glam_unified_database.db "SELECT country, COUNT(*) FROM institutions GROUP BY country ORDER BY COUNT(*) DESC;"

# Find institutions with Wikidata
sqlite3 glam_unified_database.db "SELECT name, country FROM institutions WHERE has_wikidata=1;"

Query JSON Database

import json

with open('glam_unified_database.json', 'r') as f:
    db = json.load(f)

# Get Finland statistics
finland_stats = db['country_stats']['finland']
print(f"Finland: {finland_stats['total']} institutions")

# Find Finnish museums
finnish_museums = [
    inst for inst in db['institutions'] 
    if inst['source_country'] == 'finland' 
    and inst['institution_type'] == 'MUSEUM'
]
print(f"Finnish museums: {len(finnish_museums)}")

Achievements

First unified database created - Merged 8 country datasets
1,678 institutions integrated - Cross-border consolidation
JSON export complete - 2.5 MB queryable dataset
Country statistics tracked - Quality metrics per country
Duplicate detection implemented - 269 collisions identified


Lessons Learned

  1. Schema Standardization Critical: Different export formats (RDF, LinkML, JSON) require robust parsing
  2. Nested Data Structures Problematic: Need recursive flattening for identifiers/locations
  3. SQLite Type Constraints: 64-bit integers require explicit BIGINT type
  4. GHCID Coverage Varies: TIER_1 datasets don't guarantee GHCID presence
  5. Wikidata Enrichment Effective: NLP-extracted datasets benefit from manual curation

Version: 1.0.0
Next Review: After Phase 2 (Denmark + Canada integration)
Maintained By: GLAM Data Extraction Project