kempersc edb1e07941 updated schemata

2025-11-21 22:12:33 +01:00

12 KiB

Raw Blame History

Phase 2 Complete: Critical Fixes Applied ✅

Date: 2025-11-20
Version: 2.0.0
Status: ✅ ALL CRITICAL PRIORITIES FIXED
Total Institutions: 13,591 (from 1,678) - +709% increase

Executive Summary

Successfully fixed all three critical issues identified in Phase 1:

✅ Denmark parser error - 2,348 institutions integrated
✅ Canada parser error - 9,566 institutions integrated
✅ SQLite INTEGER overflow - 27 MB database complete

Result: Unified database grew from 1,678 to 13,591 institutions (+11,913 institutions, +709% increase)

Issues Fixed

1. Denmark Parser Error ✅

Problem: 'str' object has no attribute 'get'
Root Cause: Denmark dataset stores nested objects as Python repr strings:

"provenance": "Provenance({'data_source': DataSourceEnum(...), ...})"
"identifiers": ["Identifier({'identifier_scheme': 'ISIL', ...})"]

Solution: Created parse_repr_string() function with regex pattern matching

Extracts key-value pairs from repr strings
Handles nested enums (DataSourceEnum(text='...'))
Falls back gracefully for unparseable strings

Result: 2,348 Danish institutions successfully integrated

2. Canada Parser Error ✅

Problem: unhashable type: 'dict'
Root Cause: Canada dataset uses nested dict format for enums:

"institution_type": {
  "text": "LIBRARY",
  "description": "Library (public, academic, specialized)",
  "meaning": "http://schema.org/Library"
}

Solution: Created normalize_value() function with smart unwrapping

Detects nested dicts with 'text' field
Extracts simple value (e.g., "LIBRARY")
Handles lists, dicts, and repr strings uniformly

Result: 9,566 Canadian institutions successfully integrated

3. SQLite INTEGER Overflow ✅

Problem: Python int too large to convert to SQLite INTEGER
Root Cause: ghcid_numeric uses 64-bit integers (e.g., 13679043214714698488)

SQLite INTEGER type is 32-bit by default
Overflow on large GHCID numeric identifiers

Solution: Changed column type from INTEGER to TEXT

ghcid_numeric TEXT  -- Changed from INTEGER, stores 64-bit as string

Result: Complete 27 MB SQLite database with all 13,591 institutions

Database Comparison: Phase 1 vs Phase 2

Metric	Phase 1	Phase 2	Change
Total Institutions	1,678	13,591	+11,913 (+709%)
Countries	8	8	Same
Unique GHCIDs	565	10,829	+10,264 (+1,817%)
Duplicates	269	569	+300 (+112%)
Wikidata Coverage	258 (15.4%)	1,027 (7.6%)	+769 institutions
Website Coverage	198 (11.8%)	1,326 (9.8%)	+1,128 institutions
JSON Size	2.5 MB	26 MB	+23.5 MB (+940%)
SQLite Size	20 KB (partial)	27 MB	Complete!

Country Breakdown (Phase 2)

Country	Institutions	% of Total	GHCID	Wikidata	Website
🇨🇦 Canada	9,566	70.4%	9,566 (100%)	0 (0%)	0 (0%)
🇩🇰 Denmark	2,348	17.3%	998 (42.5%)	769 (32.8%)	1,128 (48.0%)
🇫🇮 Finland	817	6.0%	817 (100%)	63 (7.7%)	58 (7.1%)
🇧🇪 Belgium	421	3.1%	0 (0%)	0 (0%)	0 (0%)
🇧🇾 Belarus	167	1.2%	0 (0%)	5 (3.0%)	5 (3.0%)
🇳🇱 Netherlands	153	1.1%	0 (0%)	112 (73.2%)	112 (73.2%)
🇨🇱 Chile	90	0.7%	0 (0%)	71 (78.9%)	0 (0%)
🇪🇬 Egypt	29	0.2%	17 (58.6%)	7 (24.1%)	23 (79.3%)

Key Insights:

Canada now dominates (70.4% of database)
Denmark brought significant Wikidata coverage (+769 institutions)
Finland + Canada = 100% GHCID coverage (10,383 institutions)

Institution Types (Phase 2)

Type	Count	% of Total	Change from Phase 1
LIBRARY	8,291	61.0%	+6,813
EDUCATION_PROVIDER	2,134	15.7%	+2,122
OFFICIAL_INSTITUTION	1,245	9.2%	+1,233
RESEARCH_CENTER	1,138	8.4%	+1,133
ARCHIVE	912	6.7%	+839
MUSEUM	291	2.1%	+211
MIXED	3	0.0%	Same
GALLERY	5	0.0%	Same

Key Insights:

Library dominance reduced (88% → 61%) due to Canadian diversity
Education providers now significant (15.7%) - Canadian universities
Official institutions (9.2%) - Canadian government libraries

Technical Improvements

New Functions in `build_unified_database_v2.py`

parse_repr_string(repr_str) - Parses Python repr format
- Regex-based field extraction
- Handles nested enums
- Returns dict or None
normalize_value(value) - Universal value normalizer
- Unwraps nested dicts ({"text": "VALUE"})
- Handles repr strings
- Flattens lists
- Returns simple types (str, int, float, bool, None)
safe_get(data, *keys, default=None) - Safe nested dict access
- Handles missing keys gracefully
- Auto-normalizes values
- Supports list indexing
extract_identifiers(record) - Multi-format identifier extraction
- Works with dict format (normal)
- Works with repr strings (Denmark)
- Returns (has_wikidata, has_website) tuple

Database Schema Improvements

CREATE TABLE institutions (
    id TEXT PRIMARY KEY,
    ghcid TEXT,
    ghcid_uuid TEXT,
    ghcid_numeric TEXT,  -- ✅ FIXED: Changed from INTEGER to TEXT
    name TEXT NOT NULL,
    institution_type TEXT,
    country TEXT,
    city TEXT,
    source_country TEXT,
    data_source TEXT,
    data_tier TEXT,
    extraction_date TEXT,
    has_wikidata BOOLEAN,
    has_website BOOLEAN,
    raw_record TEXT
);

-- New indexes for performance
CREATE INDEX idx_country ON institutions(country);
CREATE INDEX idx_type ON institutions(institution_type);
CREATE INDEX idx_ghcid ON institutions(ghcid);
CREATE INDEX idx_source_country ON institutions(source_country);

Data Quality Analysis

GHCID Coverage

Country	GHCID Coverage	Quality
Canada 🇨🇦	100% (9,566/9,566)	⭐⭐⭐⭐⭐ Excellent
Finland 🇫🇮	100% (817/817)	⭐⭐⭐⭐⭐ Excellent
Egypt 🇪🇬	58.6% (17/29)	⭐⭐⭐ Good
Denmark 🇩🇰	42.5% (998/2,348)	⭐⭐ Fair
Belgium 🇧🇪	0% (0/421)	❌ Needs generation
Netherlands 🇳🇱	0% (0/153)	❌ Needs generation
Belarus 🇧🇾	0% (0/167)	❌ Needs generation
Chile 🇨🇱	0% (0/90)	❌ Needs generation

Action Required: Generate GHCIDs for 831 institutions across 4 countries

Wikidata Enrichment

Country	Wikidata Coverage	Quality
Chile 🇨🇱	78.9% (71/90)	⭐⭐⭐⭐⭐ Excellent
Netherlands 🇳🇱	73.2% (112/153)	⭐⭐⭐⭐⭐ Excellent
Denmark 🇩🇰	32.8% (769/2,348)	⭐⭐⭐⭐ Good
Egypt 🇪🇬	24.1% (7/29)	⭐⭐⭐ Fair
Finland 🇫🇮	7.7% (63/817)	⭐⭐ Fair
Belarus 🇧🇾	3.0% (5/167)	⭐ Poor
Canada 🇨🇦	0% (0/9,566)	❌ Needs enrichment
Belgium 🇧🇪	0% (0/421)	❌ Needs enrichment

Action Required: Wikidata enrichment for 10,564 institutions

Duplicate GHCID Analysis

Total Duplicates: 569 (5.3% of unique GHCIDs)
Increase: +300 from Phase 1 (269 → 569)

Top Collision Patterns

Finnish Library Abbreviations (559 duplicates)
- Multiple libraries abbreviate to same code (e.g., "HAKA")
- Cities with similar names (Hangon, Haminan, Haapajärven)
- Need Q-number collision resolution
Canadian Libraries (10+ duplicates)
- Regional branches with same abbreviations
- Need hierarchical GHCID strategy

Recommended Action: Implement Q-number collision resolution per AGENTS.md

Files Created

Version 2.0.0 Database

/data/unified/
  ├── glam_unified_database_v2.json (26 MB) ✅
  ├── glam_unified_database_v2.db (27 MB) ✅
  └── PHASE2_COMPLETE_REPORT.md (this file)

Version 1.0.0 Database (Phase 1 - kept for comparison)

/data/unified/
  ├── glam_unified_database.json (2.5 MB)
  ├── glam_unified_database.db (20 KB)
  └── UNIFIED_DATABASE_REPORT.md

Scripts

/scripts/
  ├── build_unified_database.py (v1 - Phase 1)
  └── build_unified_database_v2.py (v2 - Phase 2) ✅

Usage Examples

SQLite Queries

# Total institutions
sqlite3 glam_unified_database_v2.db "SELECT COUNT(*) FROM institutions;"

# Count by country
sqlite3 glam_unified_database_v2.db "
  SELECT country, COUNT(*) as count 
  FROM institutions 
  GROUP BY country 
  ORDER BY count DESC;
"

# Find Canadian universities
sqlite3 glam_unified_database_v2.db "
  SELECT name, city 
  FROM institutions 
  WHERE source_country='canada' 
  AND institution_type='EDUCATION_PROVIDER' 
  LIMIT 10;
"

# Institutions with Wikidata
sqlite3 glam_unified_database_v2.db "
  SELECT name, country 
  FROM institutions 
  WHERE has_wikidata=1 
  LIMIT 10;
"

Python Queries

import json

# Load database
with open('data/unified/glam_unified_database_v2.json', 'r') as f:
    db = json.load(f)

# Get metadata
print(f"Version: {db['metadata']['version']}")
print(f"Total: {db['metadata']['total_institutions']}")

# Find Danish archives
danish_archives = [
    inst for inst in db['institutions']
    if inst['source_country'] == 'denmark'
    and inst['institution_type'] == 'ARCHIVE'
]
print(f"Danish archives: {len(danish_archives)}")

# Calculate coverage
canada_with_ghcid = sum(
    1 for inst in db['institutions']
    if inst['source_country'] == 'canada'
    and inst['ghcid']
)
print(f"Canada GHCID coverage: {canada_with_ghcid}")

Next Steps (Phase 3)

Immediate Priorities

Generate Missing GHCIDs 🔄 HIGH
- Belgium: 421 institutions
- Netherlands: 153 institutions
- Belarus: 167 institutions
- Chile: 90 institutions
- Target: +831 institutions with GHCIDs
Resolve GHCID Duplicates 🔄 HIGH
- 569 collisions detected
- Implement Q-number collision resolution
- Focus on Finnish library abbreviations (559 duplicates)
Add Japan Dataset 🔄 MEDIUM
- 12,065 institutions (18 MB file)
- Requires streaming parser for large dataset
- Would bring total to 25,656 institutions

Secondary Priorities

Wikidata Enrichment 🔄 MEDIUM
- Canada: 0% → 30% (target 2,870 institutions)
- Belgium: 0% → 60% (target 253 institutions)
- Finland: 7.7% → 30% (target 245 institutions)
Website Extraction 🔄 LOW
- Canada: 0% → 50% (target 4,783 institutions)
- Chile: 0% → 60% (target 54 institutions)
RDF Export 🔄 LOW
- Export unified database as Linked Open Data
- Follow Denmark RDF export pattern
- Align with 9 international ontologies

Achievements Summary

✅ Denmark parser fixed - 2,348 institutions integrated
✅ Canada parser fixed - 9,566 institutions integrated
✅ SQLite overflow fixed - 27 MB complete database
✅ Database grew 709% - 1,678 → 13,591 institutions
✅ GHCID coverage improved - 565 → 10,829 unique GHCIDs
✅ Multi-format export - JSON (26 MB) + SQLite (27 MB)
✅ Robust parsing - Handles repr strings, nested dicts, enums

Lessons Learned

Technical Challenges

Schema Heterogeneity is Real
- Denmark: Python repr strings in JSON
- Canada: Nested dicts for enums
- Solution: Flexible parsers with fallback logic
SQLite Type Constraints Matter
- 64-bit integers need TEXT storage
- Indexes critical for performance (13k+ records)
Large Datasets Require Streaming
- Canada (9.5k records) loaded fine in memory
- Japan (12k records) may need streaming

Best Practices

✅ Always test with real data - Sample datasets hide format issues
✅ Graceful degradation - Parse what you can, log what you can't
✅ Comprehensive logging - Show progress per country
✅ Version control - Keep v1 for comparison, ship v2 as fix

Version: 2.0.0
Phase: 2 Complete ✅
Next Phase: 3 - GHCID generation + Japan integration
Maintained By: GLAM Data Extraction Project

12 KiB Raw Blame History