415 lines
12 KiB
Markdown
415 lines
12 KiB
Markdown
# Phase 2 Complete: Critical Fixes Applied ✅
|
|
|
|
> **Note**: Any references to Q-number collision resolution in this document are **superseded**.
|
|
> Current policy uses native language institution names in snake_case format.
|
|
> See `docs/plan/global_glam/07-ghcid-collision-resolution.md` for current approach.
|
|
|
|
**Date**: 2025-11-20
|
|
**Version**: 2.0.0
|
|
**Status**: ✅ **ALL CRITICAL PRIORITIES FIXED**
|
|
**Total Institutions**: 13,591 (from 1,678) - **+709% increase**
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Successfully fixed all three critical issues identified in Phase 1:
|
|
1. ✅ Denmark parser error - 2,348 institutions integrated
|
|
2. ✅ Canada parser error - 9,566 institutions integrated
|
|
3. ✅ SQLite INTEGER overflow - 27 MB database complete
|
|
|
|
**Result**: Unified database grew from 1,678 to **13,591 institutions** (+11,913 institutions, +709% increase)
|
|
|
|
---
|
|
|
|
## Issues Fixed
|
|
|
|
### 1. Denmark Parser Error ✅
|
|
|
|
**Problem**: `'str' object has no attribute 'get'`
|
|
**Root Cause**: Denmark dataset stores nested objects as Python repr strings:
|
|
```json
|
|
"provenance": "Provenance({'data_source': DataSourceEnum(...), ...})"
|
|
"identifiers": ["Identifier({'identifier_scheme': 'ISIL', ...})"]
|
|
```
|
|
|
|
**Solution**: Created `parse_repr_string()` function with regex pattern matching
|
|
- Extracts key-value pairs from repr strings
|
|
- Handles nested enums (`DataSourceEnum(text='...')`)
|
|
- Falls back gracefully for unparseable strings
|
|
|
|
**Result**: 2,348 Danish institutions successfully integrated
|
|
|
|
### 2. Canada Parser Error ✅
|
|
|
|
**Problem**: `unhashable type: 'dict'`
|
|
**Root Cause**: Canada dataset uses nested dict format for enums:
|
|
```json
|
|
"institution_type": {
|
|
"text": "LIBRARY",
|
|
"description": "Library (public, academic, specialized)",
|
|
"meaning": "http://schema.org/Library"
|
|
}
|
|
```
|
|
|
|
**Solution**: Created `normalize_value()` function with smart unwrapping
|
|
- Detects nested dicts with 'text' field
|
|
- Extracts simple value (e.g., "LIBRARY")
|
|
- Handles lists, dicts, and repr strings uniformly
|
|
|
|
**Result**: 9,566 Canadian institutions successfully integrated
|
|
|
|
### 3. SQLite INTEGER Overflow ✅
|
|
|
|
**Problem**: `Python int too large to convert to SQLite INTEGER`
|
|
**Root Cause**: `ghcid_numeric` uses 64-bit integers (e.g., `13679043214714698488`)
|
|
- SQLite INTEGER type is 32-bit by default
|
|
- Overflow on large GHCID numeric identifiers
|
|
|
|
**Solution**: Changed column type from `INTEGER` to `TEXT`
|
|
```sql
|
|
ghcid_numeric TEXT -- Changed from INTEGER, stores 64-bit as string
|
|
```
|
|
|
|
**Result**: Complete 27 MB SQLite database with all 13,591 institutions
|
|
|
|
---
|
|
|
|
## Database Comparison: Phase 1 vs Phase 2
|
|
|
|
| Metric | Phase 1 | Phase 2 | Change |
|
|
|--------|---------|---------|--------|
|
|
| **Total Institutions** | 1,678 | 13,591 | +11,913 (+709%) |
|
|
| **Countries** | 8 | 8 | Same |
|
|
| **Unique GHCIDs** | 565 | 10,829 | +10,264 (+1,817%) |
|
|
| **Duplicates** | 269 | 569 | +300 (+112%) |
|
|
| **Wikidata Coverage** | 258 (15.4%) | 1,027 (7.6%) | +769 institutions |
|
|
| **Website Coverage** | 198 (11.8%) | 1,326 (9.8%) | +1,128 institutions |
|
|
| **JSON Size** | 2.5 MB | 26 MB | +23.5 MB (+940%) |
|
|
| **SQLite Size** | 20 KB (partial) | 27 MB | Complete! |
|
|
|
|
---
|
|
|
|
## Country Breakdown (Phase 2)
|
|
|
|
| Country | Institutions | % of Total | GHCID | Wikidata | Website |
|
|
|---------|--------------|------------|-------|----------|---------|
|
|
| 🇨🇦 **Canada** | 9,566 | 70.4% | 9,566 (100%) | 0 (0%) | 0 (0%) |
|
|
| 🇩🇰 **Denmark** | 2,348 | 17.3% | 998 (42.5%) | 769 (32.8%) | 1,128 (48.0%) |
|
|
| 🇫🇮 **Finland** | 817 | 6.0% | 817 (100%) | 63 (7.7%) | 58 (7.1%) |
|
|
| 🇧🇪 Belgium | 421 | 3.1% | 0 (0%) | 0 (0%) | 0 (0%) |
|
|
| 🇧🇾 Belarus | 167 | 1.2% | 0 (0%) | 5 (3.0%) | 5 (3.0%) |
|
|
| 🇳🇱 Netherlands | 153 | 1.1% | 0 (0%) | 112 (73.2%) | 112 (73.2%) |
|
|
| 🇨🇱 Chile | 90 | 0.7% | 0 (0%) | 71 (78.9%) | 0 (0%) |
|
|
| 🇪🇬 Egypt | 29 | 0.2% | 17 (58.6%) | 7 (24.1%) | 23 (79.3%) |
|
|
|
|
**Key Insights**:
|
|
- Canada now dominates (70.4% of database)
|
|
- Denmark brought significant Wikidata coverage (+769 institutions)
|
|
- Finland + Canada = 100% GHCID coverage (10,383 institutions)
|
|
|
|
---
|
|
|
|
## Institution Types (Phase 2)
|
|
|
|
| Type | Count | % of Total | Change from Phase 1 |
|
|
|------|-------|------------|---------------------|
|
|
| **LIBRARY** | 8,291 | 61.0% | +6,813 |
|
|
| **EDUCATION_PROVIDER** | 2,134 | 15.7% | +2,122 |
|
|
| **OFFICIAL_INSTITUTION** | 1,245 | 9.2% | +1,233 |
|
|
| **RESEARCH_CENTER** | 1,138 | 8.4% | +1,133 |
|
|
| **ARCHIVE** | 912 | 6.7% | +839 |
|
|
| **MUSEUM** | 291 | 2.1% | +211 |
|
|
| **MIXED** | 3 | 0.0% | Same |
|
|
| **GALLERY** | 5 | 0.0% | Same |
|
|
|
|
**Key Insights**:
|
|
- Library dominance reduced (88% → 61%) due to Canadian diversity
|
|
- Education providers now significant (15.7%) - Canadian universities
|
|
- Official institutions (9.2%) - Canadian government libraries
|
|
|
|
---
|
|
|
|
## Technical Improvements
|
|
|
|
### New Functions in `build_unified_database_v2.py`
|
|
|
|
1. **`parse_repr_string(repr_str)`** - Parses Python repr format
|
|
- Regex-based field extraction
|
|
- Handles nested enums
|
|
- Returns dict or None
|
|
|
|
2. **`normalize_value(value)`** - Universal value normalizer
|
|
- Unwraps nested dicts (`{"text": "VALUE"}`)
|
|
- Handles repr strings
|
|
- Flattens lists
|
|
- Returns simple types (str, int, float, bool, None)
|
|
|
|
3. **`safe_get(data, *keys, default=None)`** - Safe nested dict access
|
|
- Handles missing keys gracefully
|
|
- Auto-normalizes values
|
|
- Supports list indexing
|
|
|
|
4. **`extract_identifiers(record)`** - Multi-format identifier extraction
|
|
- Works with dict format (normal)
|
|
- Works with repr strings (Denmark)
|
|
- Returns (has_wikidata, has_website) tuple
|
|
|
|
### Database Schema Improvements
|
|
|
|
```sql
|
|
CREATE TABLE institutions (
|
|
id TEXT PRIMARY KEY,
|
|
ghcid TEXT,
|
|
ghcid_uuid TEXT,
|
|
ghcid_numeric TEXT, -- ✅ FIXED: Changed from INTEGER to TEXT
|
|
name TEXT NOT NULL,
|
|
institution_type TEXT,
|
|
country TEXT,
|
|
city TEXT,
|
|
source_country TEXT,
|
|
data_source TEXT,
|
|
data_tier TEXT,
|
|
extraction_date TEXT,
|
|
has_wikidata BOOLEAN,
|
|
has_website BOOLEAN,
|
|
raw_record TEXT
|
|
);
|
|
|
|
-- New indexes for performance
|
|
CREATE INDEX idx_country ON institutions(country);
|
|
CREATE INDEX idx_type ON institutions(institution_type);
|
|
CREATE INDEX idx_ghcid ON institutions(ghcid);
|
|
CREATE INDEX idx_source_country ON institutions(source_country);
|
|
```
|
|
|
|
---
|
|
|
|
## Data Quality Analysis
|
|
|
|
### GHCID Coverage
|
|
|
|
| Country | GHCID Coverage | Quality |
|
|
|---------|----------------|---------|
|
|
| Canada 🇨🇦 | 100% (9,566/9,566) | ⭐⭐⭐⭐⭐ Excellent |
|
|
| Finland 🇫🇮 | 100% (817/817) | ⭐⭐⭐⭐⭐ Excellent |
|
|
| Egypt 🇪🇬 | 58.6% (17/29) | ⭐⭐⭐ Good |
|
|
| Denmark 🇩🇰 | 42.5% (998/2,348) | ⭐⭐ Fair |
|
|
| Belgium 🇧🇪 | 0% (0/421) | ❌ Needs generation |
|
|
| Netherlands 🇳🇱 | 0% (0/153) | ❌ Needs generation |
|
|
| Belarus 🇧🇾 | 0% (0/167) | ❌ Needs generation |
|
|
| Chile 🇨🇱 | 0% (0/90) | ❌ Needs generation |
|
|
|
|
**Action Required**: Generate GHCIDs for 831 institutions across 4 countries
|
|
|
|
### Wikidata Enrichment
|
|
|
|
| Country | Wikidata Coverage | Quality |
|
|
|---------|-------------------|---------|
|
|
| Chile 🇨🇱 | 78.9% (71/90) | ⭐⭐⭐⭐⭐ Excellent |
|
|
| Netherlands 🇳🇱 | 73.2% (112/153) | ⭐⭐⭐⭐⭐ Excellent |
|
|
| Denmark 🇩🇰 | 32.8% (769/2,348) | ⭐⭐⭐⭐ Good |
|
|
| Egypt 🇪🇬 | 24.1% (7/29) | ⭐⭐⭐ Fair |
|
|
| Finland 🇫🇮 | 7.7% (63/817) | ⭐⭐ Fair |
|
|
| Belarus 🇧🇾 | 3.0% (5/167) | ⭐ Poor |
|
|
| Canada 🇨🇦 | 0% (0/9,566) | ❌ Needs enrichment |
|
|
| Belgium 🇧🇪 | 0% (0/421) | ❌ Needs enrichment |
|
|
|
|
**Action Required**: Wikidata enrichment for 10,564 institutions
|
|
|
|
---
|
|
|
|
## Duplicate GHCID Analysis
|
|
|
|
**Total Duplicates**: 569 (5.3% of unique GHCIDs)
|
|
**Increase**: +300 from Phase 1 (269 → 569)
|
|
|
|
### Top Collision Patterns
|
|
|
|
1. **Finnish Library Abbreviations** (559 duplicates)
|
|
- Multiple libraries abbreviate to same code (e.g., "HAKA")
|
|
- Cities with similar names (Hangon, Haminan, Haapajärven)
|
|
- Need Q-number collision resolution
|
|
|
|
2. **Canadian Libraries** (10+ duplicates)
|
|
- Regional branches with same abbreviations
|
|
- Need hierarchical GHCID strategy
|
|
|
|
**Recommended Action**: Implement Q-number collision resolution per AGENTS.md
|
|
|
|
---
|
|
|
|
## Files Created
|
|
|
|
### Version 2.0.0 Database
|
|
```
|
|
/data/unified/
|
|
├── glam_unified_database_v2.json (26 MB) ✅
|
|
├── glam_unified_database_v2.db (27 MB) ✅
|
|
└── PHASE2_COMPLETE_REPORT.md (this file)
|
|
```
|
|
|
|
### Version 1.0.0 Database (Phase 1 - kept for comparison)
|
|
```
|
|
/data/unified/
|
|
├── glam_unified_database.json (2.5 MB)
|
|
├── glam_unified_database.db (20 KB)
|
|
└── UNIFIED_DATABASE_REPORT.md
|
|
```
|
|
|
|
### Scripts
|
|
```
|
|
/scripts/
|
|
├── build_unified_database.py (v1 - Phase 1)
|
|
└── build_unified_database_v2.py (v2 - Phase 2) ✅
|
|
```
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### SQLite Queries
|
|
|
|
```bash
|
|
# Total institutions
|
|
sqlite3 glam_unified_database_v2.db "SELECT COUNT(*) FROM institutions;"
|
|
|
|
# Count by country
|
|
sqlite3 glam_unified_database_v2.db "
|
|
SELECT country, COUNT(*) as count
|
|
FROM institutions
|
|
GROUP BY country
|
|
ORDER BY count DESC;
|
|
"
|
|
|
|
# Find Canadian universities
|
|
sqlite3 glam_unified_database_v2.db "
|
|
SELECT name, city
|
|
FROM institutions
|
|
WHERE source_country='canada'
|
|
AND institution_type='EDUCATION_PROVIDER'
|
|
LIMIT 10;
|
|
"
|
|
|
|
# Institutions with Wikidata
|
|
sqlite3 glam_unified_database_v2.db "
|
|
SELECT name, country
|
|
FROM institutions
|
|
WHERE has_wikidata=1
|
|
LIMIT 10;
|
|
"
|
|
```
|
|
|
|
### Python Queries
|
|
|
|
```python
|
|
import json
|
|
|
|
# Load database
|
|
with open('data/unified/glam_unified_database_v2.json', 'r') as f:
|
|
db = json.load(f)
|
|
|
|
# Get metadata
|
|
print(f"Version: {db['metadata']['version']}")
|
|
print(f"Total: {db['metadata']['total_institutions']}")
|
|
|
|
# Find Danish archives
|
|
danish_archives = [
|
|
inst for inst in db['institutions']
|
|
if inst['source_country'] == 'denmark'
|
|
and inst['institution_type'] == 'ARCHIVE'
|
|
]
|
|
print(f"Danish archives: {len(danish_archives)}")
|
|
|
|
# Calculate coverage
|
|
canada_with_ghcid = sum(
|
|
1 for inst in db['institutions']
|
|
if inst['source_country'] == 'canada'
|
|
and inst['ghcid']
|
|
)
|
|
print(f"Canada GHCID coverage: {canada_with_ghcid}")
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps (Phase 3)
|
|
|
|
### Immediate Priorities
|
|
|
|
1. **Generate Missing GHCIDs** 🔄 HIGH
|
|
- Belgium: 421 institutions
|
|
- Netherlands: 153 institutions
|
|
- Belarus: 167 institutions
|
|
- Chile: 90 institutions
|
|
- **Target**: +831 institutions with GHCIDs
|
|
|
|
2. **Resolve GHCID Duplicates** 🔄 HIGH
|
|
- 569 collisions detected
|
|
- Implement Q-number collision resolution
|
|
- Focus on Finnish library abbreviations (559 duplicates)
|
|
|
|
3. **Add Japan Dataset** 🔄 MEDIUM
|
|
- 12,065 institutions (18 MB file)
|
|
- Requires streaming parser for large dataset
|
|
- Would bring total to 25,656 institutions
|
|
|
|
### Secondary Priorities
|
|
|
|
4. **Wikidata Enrichment** 🔄 MEDIUM
|
|
- Canada: 0% → 30% (target 2,870 institutions)
|
|
- Belgium: 0% → 60% (target 253 institutions)
|
|
- Finland: 7.7% → 30% (target 245 institutions)
|
|
|
|
5. **Website Extraction** 🔄 LOW
|
|
- Canada: 0% → 50% (target 4,783 institutions)
|
|
- Chile: 0% → 60% (target 54 institutions)
|
|
|
|
6. **RDF Export** 🔄 LOW
|
|
- Export unified database as Linked Open Data
|
|
- Follow Denmark RDF export pattern
|
|
- Align with 9 international ontologies
|
|
|
|
---
|
|
|
|
## Achievements Summary
|
|
|
|
✅ **Denmark parser fixed** - 2,348 institutions integrated
|
|
✅ **Canada parser fixed** - 9,566 institutions integrated
|
|
✅ **SQLite overflow fixed** - 27 MB complete database
|
|
✅ **Database grew 709%** - 1,678 → 13,591 institutions
|
|
✅ **GHCID coverage improved** - 565 → 10,829 unique GHCIDs
|
|
✅ **Multi-format export** - JSON (26 MB) + SQLite (27 MB)
|
|
✅ **Robust parsing** - Handles repr strings, nested dicts, enums
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### Technical Challenges
|
|
|
|
1. **Schema Heterogeneity is Real**
|
|
- Denmark: Python repr strings in JSON
|
|
- Canada: Nested dicts for enums
|
|
- Solution: Flexible parsers with fallback logic
|
|
|
|
2. **SQLite Type Constraints Matter**
|
|
- 64-bit integers need TEXT storage
|
|
- Indexes critical for performance (13k+ records)
|
|
|
|
3. **Large Datasets Require Streaming**
|
|
- Canada (9.5k records) loaded fine in memory
|
|
- Japan (12k records) may need streaming
|
|
|
|
### Best Practices
|
|
|
|
✅ **Always test with real data** - Sample datasets hide format issues
|
|
✅ **Graceful degradation** - Parse what you can, log what you can't
|
|
✅ **Comprehensive logging** - Show progress per country
|
|
✅ **Version control** - Keep v1 for comparison, ship v2 as fix
|
|
|
|
---
|
|
|
|
**Version**: 2.0.0
|
|
**Phase**: 2 Complete ✅
|
|
**Next Phase**: 3 - GHCID generation + Japan integration
|
|
**Maintained By**: GLAM Data Extraction Project
|