glam/data/unified/QUICK_START_UNIFIED_DB.md
2025-11-21 22:12:33 +01:00

216 lines
5.2 KiB
Markdown

# Unified GLAM Database - Quick Start
**Last Updated**: 2025-11-20
**Database Version**: 1.0.0 (Phase 1)
**Total Institutions**: 1,678 across 8 countries
---
## Quick Access
### Database Files
```bash
# JSON format (2.5 MB, complete)
/Users/kempersc/apps/glam/data/unified/glam_unified_database.json
# SQLite format (20 KB, partial due to overflow issue)
/Users/kempersc/apps/glam/data/unified/glam_unified_database.db
```
### Query Examples
#### Python (JSON)
```python
import json
# Load database
with open('data/unified/glam_unified_database.json', 'r') as f:
db = json.load(f)
# Get metadata
print(f"Total institutions: {db['metadata']['total_institutions']}")
print(f"Countries: {', '.join(db['metadata']['countries'])}")
# Find Finnish museums
finnish_museums = [
inst for inst in db['institutions']
if inst['source_country'] == 'finland'
and inst['institution_type'] == 'MUSEUM'
]
print(f"Finnish museums: {len(finnish_museums)}")
# Get country statistics
for country, stats in db['country_stats'].items():
print(f"{country}: {stats['total']} institutions ({stats['with_wikidata']} with Wikidata)")
```
#### SQLite (after fixing overflow)
```bash
# Count by country
sqlite3 data/unified/glam_unified_database.db \
"SELECT country, COUNT(*) FROM institutions GROUP BY country ORDER BY COUNT(*) DESC;"
# Find institutions with Wikidata
sqlite3 data/unified/glam_unified_database.db \
"SELECT name, country FROM institutions WHERE has_wikidata=1 LIMIT 10;"
# Search by institution type
sqlite3 data/unified/glam_unified_database.db \
"SELECT name, city FROM institutions WHERE institution_type='MUSEUM';"
```
---
## Database Schema
### JSON Structure
```json
{
"metadata": {
"export_date": "2025-11-20T15:17:03+00:00",
"total_institutions": 1678,
"unique_ghcids": 565,
"duplicates": 269,
"countries": ["finland", "denmark", ...]
},
"country_stats": {
"finland": {
"total": 817,
"with_ghcid": 817,
"with_wikidata": 63,
"with_website": 58,
"by_type": {"LIBRARY": 789, "MUSEUM": 15, ...}
}
},
"institutions": [
{
"id": "https://w3id.org/heritage/custodian/fi/...",
"ghcid": "FI-A-A-L-ALKU-Q39176216",
"ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
"name": "Alakylän kirjasto",
"institution_type": "LIBRARY",
"country": "FI",
"city": "Alavi",
"has_wikidata": true,
"has_website": false,
"raw_record": "{...full LinkML record...}"
}
]
}
```
### SQLite Schema
```sql
CREATE TABLE institutions (
id TEXT PRIMARY KEY,
ghcid TEXT,
ghcid_uuid TEXT,
ghcid_numeric INTEGER, -- ⚠️ Overflow issue
name TEXT NOT NULL,
institution_type TEXT,
country TEXT,
city TEXT,
source_country TEXT,
data_source TEXT,
data_tier TEXT,
extraction_date TEXT,
has_wikidata BOOLEAN,
has_website BOOLEAN,
raw_record TEXT -- Full JSON record
);
CREATE TABLE metadata (
key TEXT PRIMARY KEY,
value TEXT
);
```
---
## Statistics at a Glance
### Overall
- **Total Institutions**: 1,678
- **Unique GHCIDs**: 565 (33.7%)
- **Wikidata Coverage**: 258 (15.4%)
- **Website Coverage**: 198 (11.8%)
### By Country
| Country | Count | GHCID | Wikidata | Tier |
|---------|-------|-------|----------|------|
| 🇫🇮 Finland | 817 | 100% | 7.7% | TIER_1 |
| 🇧🇪 Belgium | 421 | 0% | 0% | TIER_1 |
| 🇧🇾 Belarus | 167 | 0% | 3.0% | TIER_1 |
| 🇳🇱 Netherlands | 153 | 0% | 73.2% | TIER_1 |
| 🇨🇱 Chile | 90 | 0% | 78.9% | TIER_4 |
| 🇪🇬 Egypt | 29 | 58.6% | 24.1% | TIER_4 |
### By Institution Type
- Libraries: 1,478 (88.1%)
- Museums: 80 (4.8%)
- Archives: 73 (4.4%)
- Education Providers: 12 (0.7%)
- Official Institutions: 12 (0.7%)
---
## Known Limitations (Phase 1)
1. ⚠️ **Denmark excluded** (2,348 institutions) - parser error
2. ⚠️ **Canada excluded** (9,565 institutions) - nested dict error
3. ⚠️ **SQLite incomplete** - INTEGER overflow on ghcid_numeric
4. 🔍 **269 GHCID duplicates** - need collision resolution
5. 📝 **Missing GHCIDs** - Belgium, Netherlands, Belarus, Chile
**Phase 2 will fix these issues and bring total to 13,591 institutions.**
---
## Rebuilding the Database
To rebuild with updated country datasets:
```bash
# Run the unification script
python3 scripts/build_unified_database.py
# Output will be in:
# - data/unified/glam_unified_database.json
# - data/unified/glam_unified_database.db
```
To add a new country dataset:
1. Edit `scripts/build_unified_database.py`
2. Add country to `COUNTRY_DATASETS` dict with path
3. Run script
4. Check `UNIFIED_DATABASE_REPORT.md` for results
---
## Documentation
- **Full Report**: `UNIFIED_DATABASE_REPORT.md` - Detailed statistics and analysis
- **Session Summary**: `SESSION_SUMMARY_20251120_FINLAND_UNIFIED.md` - What we did today
- **Finland Report**: `data/finland_isil/FINLAND_ISIL_HARVEST_REPORT.md` - Finnish dataset details
- **Main Progress**: `PROGRESS.md` - Overall project status
---
## Support
For questions or issues:
- Check `UNIFIED_DATABASE_REPORT.md` for detailed documentation
- Review `AGENTS.md` for extraction guidelines
- See `PROGRESS.md` for project history
**Version**: 1.0.0 (Phase 1)
**Next Update**: Phase 2 (Denmark + Canada integration)