12 KiB
Phase 2 Complete: Critical Fixes Applied ✅
Note
: Any references to Q-number collision resolution in this document are superseded. Current policy uses native language institution names in snake_case format. See
docs/plan/global_glam/07-ghcid-collision-resolution.mdfor current approach.
Date: 2025-11-20
Version: 2.0.0
Status: ✅ ALL CRITICAL PRIORITIES FIXED
Total Institutions: 13,591 (from 1,678) - +709% increase
Executive Summary
Successfully fixed all three critical issues identified in Phase 1:
- ✅ Denmark parser error - 2,348 institutions integrated
- ✅ Canada parser error - 9,566 institutions integrated
- ✅ SQLite INTEGER overflow - 27 MB database complete
Result: Unified database grew from 1,678 to 13,591 institutions (+11,913 institutions, +709% increase)
Issues Fixed
1. Denmark Parser Error ✅
Problem: 'str' object has no attribute 'get'
Root Cause: Denmark dataset stores nested objects as Python repr strings:
"provenance": "Provenance({'data_source': DataSourceEnum(...), ...})"
"identifiers": ["Identifier({'identifier_scheme': 'ISIL', ...})"]
Solution: Created parse_repr_string() function with regex pattern matching
- Extracts key-value pairs from repr strings
- Handles nested enums (
DataSourceEnum(text='...')) - Falls back gracefully for unparseable strings
Result: 2,348 Danish institutions successfully integrated
2. Canada Parser Error ✅
Problem: unhashable type: 'dict'
Root Cause: Canada dataset uses nested dict format for enums:
"institution_type": {
"text": "LIBRARY",
"description": "Library (public, academic, specialized)",
"meaning": "http://schema.org/Library"
}
Solution: Created normalize_value() function with smart unwrapping
- Detects nested dicts with 'text' field
- Extracts simple value (e.g., "LIBRARY")
- Handles lists, dicts, and repr strings uniformly
Result: 9,566 Canadian institutions successfully integrated
3. SQLite INTEGER Overflow ✅
Problem: Python int too large to convert to SQLite INTEGER
Root Cause: ghcid_numeric uses 64-bit integers (e.g., 13679043214714698488)
- SQLite INTEGER type is 32-bit by default
- Overflow on large GHCID numeric identifiers
Solution: Changed column type from INTEGER to TEXT
ghcid_numeric TEXT -- Changed from INTEGER, stores 64-bit as string
Result: Complete 27 MB SQLite database with all 13,591 institutions
Database Comparison: Phase 1 vs Phase 2
| Metric | Phase 1 | Phase 2 | Change |
|---|---|---|---|
| Total Institutions | 1,678 | 13,591 | +11,913 (+709%) |
| Countries | 8 | 8 | Same |
| Unique GHCIDs | 565 | 10,829 | +10,264 (+1,817%) |
| Duplicates | 269 | 569 | +300 (+112%) |
| Wikidata Coverage | 258 (15.4%) | 1,027 (7.6%) | +769 institutions |
| Website Coverage | 198 (11.8%) | 1,326 (9.8%) | +1,128 institutions |
| JSON Size | 2.5 MB | 26 MB | +23.5 MB (+940%) |
| SQLite Size | 20 KB (partial) | 27 MB | Complete! |
Country Breakdown (Phase 2)
| Country | Institutions | % of Total | GHCID | Wikidata | Website |
|---|---|---|---|---|---|
| 🇨🇦 Canada | 9,566 | 70.4% | 9,566 (100%) | 0 (0%) | 0 (0%) |
| 🇩🇰 Denmark | 2,348 | 17.3% | 998 (42.5%) | 769 (32.8%) | 1,128 (48.0%) |
| 🇫🇮 Finland | 817 | 6.0% | 817 (100%) | 63 (7.7%) | 58 (7.1%) |
| 🇧🇪 Belgium | 421 | 3.1% | 0 (0%) | 0 (0%) | 0 (0%) |
| 🇧🇾 Belarus | 167 | 1.2% | 0 (0%) | 5 (3.0%) | 5 (3.0%) |
| 🇳🇱 Netherlands | 153 | 1.1% | 0 (0%) | 112 (73.2%) | 112 (73.2%) |
| 🇨🇱 Chile | 90 | 0.7% | 0 (0%) | 71 (78.9%) | 0 (0%) |
| 🇪🇬 Egypt | 29 | 0.2% | 17 (58.6%) | 7 (24.1%) | 23 (79.3%) |
Key Insights:
- Canada now dominates (70.4% of database)
- Denmark brought significant Wikidata coverage (+769 institutions)
- Finland + Canada = 100% GHCID coverage (10,383 institutions)
Institution Types (Phase 2)
| Type | Count | % of Total | Change from Phase 1 |
|---|---|---|---|
| LIBRARY | 8,291 | 61.0% | +6,813 |
| EDUCATION_PROVIDER | 2,134 | 15.7% | +2,122 |
| OFFICIAL_INSTITUTION | 1,245 | 9.2% | +1,233 |
| RESEARCH_CENTER | 1,138 | 8.4% | +1,133 |
| ARCHIVE | 912 | 6.7% | +839 |
| MUSEUM | 291 | 2.1% | +211 |
| MIXED | 3 | 0.0% | Same |
| GALLERY | 5 | 0.0% | Same |
Key Insights:
- Library dominance reduced (88% → 61%) due to Canadian diversity
- Education providers now significant (15.7%) - Canadian universities
- Official institutions (9.2%) - Canadian government libraries
Technical Improvements
New Functions in build_unified_database_v2.py
-
parse_repr_string(repr_str)- Parses Python repr format- Regex-based field extraction
- Handles nested enums
- Returns dict or None
-
normalize_value(value)- Universal value normalizer- Unwraps nested dicts (
{"text": "VALUE"}) - Handles repr strings
- Flattens lists
- Returns simple types (str, int, float, bool, None)
- Unwraps nested dicts (
-
safe_get(data, *keys, default=None)- Safe nested dict access- Handles missing keys gracefully
- Auto-normalizes values
- Supports list indexing
-
extract_identifiers(record)- Multi-format identifier extraction- Works with dict format (normal)
- Works with repr strings (Denmark)
- Returns (has_wikidata, has_website) tuple
Database Schema Improvements
CREATE TABLE institutions (
id TEXT PRIMARY KEY,
ghcid TEXT,
ghcid_uuid TEXT,
ghcid_numeric TEXT, -- ✅ FIXED: Changed from INTEGER to TEXT
name TEXT NOT NULL,
institution_type TEXT,
country TEXT,
city TEXT,
source_country TEXT,
data_source TEXT,
data_tier TEXT,
extraction_date TEXT,
has_wikidata BOOLEAN,
has_website BOOLEAN,
raw_record TEXT
);
-- New indexes for performance
CREATE INDEX idx_country ON institutions(country);
CREATE INDEX idx_type ON institutions(institution_type);
CREATE INDEX idx_ghcid ON institutions(ghcid);
CREATE INDEX idx_source_country ON institutions(source_country);
Data Quality Analysis
GHCID Coverage
| Country | GHCID Coverage | Quality |
|---|---|---|
| Canada 🇨🇦 | 100% (9,566/9,566) | ⭐⭐⭐⭐⭐ Excellent |
| Finland 🇫🇮 | 100% (817/817) | ⭐⭐⭐⭐⭐ Excellent |
| Egypt 🇪🇬 | 58.6% (17/29) | ⭐⭐⭐ Good |
| Denmark 🇩🇰 | 42.5% (998/2,348) | ⭐⭐ Fair |
| Belgium 🇧🇪 | 0% (0/421) | ❌ Needs generation |
| Netherlands 🇳🇱 | 0% (0/153) | ❌ Needs generation |
| Belarus 🇧🇾 | 0% (0/167) | ❌ Needs generation |
| Chile 🇨🇱 | 0% (0/90) | ❌ Needs generation |
Action Required: Generate GHCIDs for 831 institutions across 4 countries
Wikidata Enrichment
| Country | Wikidata Coverage | Quality |
|---|---|---|
| Chile 🇨🇱 | 78.9% (71/90) | ⭐⭐⭐⭐⭐ Excellent |
| Netherlands 🇳🇱 | 73.2% (112/153) | ⭐⭐⭐⭐⭐ Excellent |
| Denmark 🇩🇰 | 32.8% (769/2,348) | ⭐⭐⭐⭐ Good |
| Egypt 🇪🇬 | 24.1% (7/29) | ⭐⭐⭐ Fair |
| Finland 🇫🇮 | 7.7% (63/817) | ⭐⭐ Fair |
| Belarus 🇧🇾 | 3.0% (5/167) | ⭐ Poor |
| Canada 🇨🇦 | 0% (0/9,566) | ❌ Needs enrichment |
| Belgium 🇧🇪 | 0% (0/421) | ❌ Needs enrichment |
Action Required: Wikidata enrichment for 10,564 institutions
Duplicate GHCID Analysis
Total Duplicates: 569 (5.3% of unique GHCIDs)
Increase: +300 from Phase 1 (269 → 569)
Top Collision Patterns
-
Finnish Library Abbreviations (559 duplicates)
- Multiple libraries abbreviate to same code (e.g., "HAKA")
- Cities with similar names (Hangon, Haminan, Haapajärven)
- Need Q-number collision resolution
-
Canadian Libraries (10+ duplicates)
- Regional branches with same abbreviations
- Need hierarchical GHCID strategy
Recommended Action: Implement Q-number collision resolution per AGENTS.md
Files Created
Version 2.0.0 Database
/data/unified/
├── glam_unified_database_v2.json (26 MB) ✅
├── glam_unified_database_v2.db (27 MB) ✅
└── PHASE2_COMPLETE_REPORT.md (this file)
Version 1.0.0 Database (Phase 1 - kept for comparison)
/data/unified/
├── glam_unified_database.json (2.5 MB)
├── glam_unified_database.db (20 KB)
└── UNIFIED_DATABASE_REPORT.md
Scripts
/scripts/
├── build_unified_database.py (v1 - Phase 1)
└── build_unified_database_v2.py (v2 - Phase 2) ✅
Usage Examples
SQLite Queries
# Total institutions
sqlite3 glam_unified_database_v2.db "SELECT COUNT(*) FROM institutions;"
# Count by country
sqlite3 glam_unified_database_v2.db "
SELECT country, COUNT(*) as count
FROM institutions
GROUP BY country
ORDER BY count DESC;
"
# Find Canadian universities
sqlite3 glam_unified_database_v2.db "
SELECT name, city
FROM institutions
WHERE source_country='canada'
AND institution_type='EDUCATION_PROVIDER'
LIMIT 10;
"
# Institutions with Wikidata
sqlite3 glam_unified_database_v2.db "
SELECT name, country
FROM institutions
WHERE has_wikidata=1
LIMIT 10;
"
Python Queries
import json
# Load database
with open('data/unified/glam_unified_database_v2.json', 'r') as f:
db = json.load(f)
# Get metadata
print(f"Version: {db['metadata']['version']}")
print(f"Total: {db['metadata']['total_institutions']}")
# Find Danish archives
danish_archives = [
inst for inst in db['institutions']
if inst['source_country'] == 'denmark'
and inst['institution_type'] == 'ARCHIVE'
]
print(f"Danish archives: {len(danish_archives)}")
# Calculate coverage
canada_with_ghcid = sum(
1 for inst in db['institutions']
if inst['source_country'] == 'canada'
and inst['ghcid']
)
print(f"Canada GHCID coverage: {canada_with_ghcid}")
Next Steps (Phase 3)
Immediate Priorities
-
Generate Missing GHCIDs 🔄 HIGH
- Belgium: 421 institutions
- Netherlands: 153 institutions
- Belarus: 167 institutions
- Chile: 90 institutions
- Target: +831 institutions with GHCIDs
-
Resolve GHCID Duplicates 🔄 HIGH
- 569 collisions detected
- Implement Q-number collision resolution
- Focus on Finnish library abbreviations (559 duplicates)
-
Add Japan Dataset 🔄 MEDIUM
- 12,065 institutions (18 MB file)
- Requires streaming parser for large dataset
- Would bring total to 25,656 institutions
Secondary Priorities
-
Wikidata Enrichment 🔄 MEDIUM
- Canada: 0% → 30% (target 2,870 institutions)
- Belgium: 0% → 60% (target 253 institutions)
- Finland: 7.7% → 30% (target 245 institutions)
-
Website Extraction 🔄 LOW
- Canada: 0% → 50% (target 4,783 institutions)
- Chile: 0% → 60% (target 54 institutions)
-
RDF Export 🔄 LOW
- Export unified database as Linked Open Data
- Follow Denmark RDF export pattern
- Align with 9 international ontologies
Achievements Summary
✅ Denmark parser fixed - 2,348 institutions integrated
✅ Canada parser fixed - 9,566 institutions integrated
✅ SQLite overflow fixed - 27 MB complete database
✅ Database grew 709% - 1,678 → 13,591 institutions
✅ GHCID coverage improved - 565 → 10,829 unique GHCIDs
✅ Multi-format export - JSON (26 MB) + SQLite (27 MB)
✅ Robust parsing - Handles repr strings, nested dicts, enums
Lessons Learned
Technical Challenges
-
Schema Heterogeneity is Real
- Denmark: Python repr strings in JSON
- Canada: Nested dicts for enums
- Solution: Flexible parsers with fallback logic
-
SQLite Type Constraints Matter
- 64-bit integers need TEXT storage
- Indexes critical for performance (13k+ records)
-
Large Datasets Require Streaming
- Canada (9.5k records) loaded fine in memory
- Japan (12k records) may need streaming
Best Practices
✅ Always test with real data - Sample datasets hide format issues
✅ Graceful degradation - Parse what you can, log what you can't
✅ Comprehensive logging - Show progress per country
✅ Version control - Keep v1 for comparison, ship v2 as fix
Version: 2.0.0
Phase: 2 Complete ✅
Next Phase: 3 - GHCID generation + Japan integration
Maintained By: GLAM Data Extraction Project