216 lines
5.2 KiB
Markdown
216 lines
5.2 KiB
Markdown
# Unified GLAM Database - Quick Start
|
|
|
|
**Last Updated**: 2025-11-20
|
|
**Database Version**: 1.0.0 (Phase 1)
|
|
**Total Institutions**: 1,678 across 8 countries
|
|
|
|
---
|
|
|
|
## Quick Access
|
|
|
|
### Database Files
|
|
|
|
```bash
|
|
# JSON format (2.5 MB, complete)
|
|
/Users/kempersc/apps/glam/data/unified/glam_unified_database.json
|
|
|
|
# SQLite format (20 KB, partial due to overflow issue)
|
|
/Users/kempersc/apps/glam/data/unified/glam_unified_database.db
|
|
```
|
|
|
|
### Query Examples
|
|
|
|
#### Python (JSON)
|
|
|
|
```python
|
|
import json
|
|
|
|
# Load database
|
|
with open('data/unified/glam_unified_database.json', 'r') as f:
|
|
db = json.load(f)
|
|
|
|
# Get metadata
|
|
print(f"Total institutions: {db['metadata']['total_institutions']}")
|
|
print(f"Countries: {', '.join(db['metadata']['countries'])}")
|
|
|
|
# Find Finnish museums
|
|
finnish_museums = [
|
|
inst for inst in db['institutions']
|
|
if inst['source_country'] == 'finland'
|
|
and inst['institution_type'] == 'MUSEUM'
|
|
]
|
|
print(f"Finnish museums: {len(finnish_museums)}")
|
|
|
|
# Get country statistics
|
|
for country, stats in db['country_stats'].items():
|
|
print(f"{country}: {stats['total']} institutions ({stats['with_wikidata']} with Wikidata)")
|
|
```
|
|
|
|
#### SQLite (after fixing overflow)
|
|
|
|
```bash
|
|
# Count by country
|
|
sqlite3 data/unified/glam_unified_database.db \
|
|
"SELECT country, COUNT(*) FROM institutions GROUP BY country ORDER BY COUNT(*) DESC;"
|
|
|
|
# Find institutions with Wikidata
|
|
sqlite3 data/unified/glam_unified_database.db \
|
|
"SELECT name, country FROM institutions WHERE has_wikidata=1 LIMIT 10;"
|
|
|
|
# Search by institution type
|
|
sqlite3 data/unified/glam_unified_database.db \
|
|
"SELECT name, city FROM institutions WHERE institution_type='MUSEUM';"
|
|
```
|
|
|
|
---
|
|
|
|
## Database Schema
|
|
|
|
### JSON Structure
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"export_date": "2025-11-20T15:17:03+00:00",
|
|
"total_institutions": 1678,
|
|
"unique_ghcids": 565,
|
|
"duplicates": 269,
|
|
"countries": ["finland", "denmark", ...]
|
|
},
|
|
"country_stats": {
|
|
"finland": {
|
|
"total": 817,
|
|
"with_ghcid": 817,
|
|
"with_wikidata": 63,
|
|
"with_website": 58,
|
|
"by_type": {"LIBRARY": 789, "MUSEUM": 15, ...}
|
|
}
|
|
},
|
|
"institutions": [
|
|
{
|
|
"id": "https://w3id.org/heritage/custodian/fi/...",
|
|
"ghcid": "FI-A-A-L-ALKU-Q39176216",
|
|
"ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
|
|
"name": "Alakylän kirjasto",
|
|
"institution_type": "LIBRARY",
|
|
"country": "FI",
|
|
"city": "Alavi",
|
|
"has_wikidata": true,
|
|
"has_website": false,
|
|
"raw_record": "{...full LinkML record...}"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### SQLite Schema
|
|
|
|
```sql
|
|
CREATE TABLE institutions (
|
|
id TEXT PRIMARY KEY,
|
|
ghcid TEXT,
|
|
ghcid_uuid TEXT,
|
|
ghcid_numeric INTEGER, -- ⚠️ Overflow issue
|
|
name TEXT NOT NULL,
|
|
institution_type TEXT,
|
|
country TEXT,
|
|
city TEXT,
|
|
source_country TEXT,
|
|
data_source TEXT,
|
|
data_tier TEXT,
|
|
extraction_date TEXT,
|
|
has_wikidata BOOLEAN,
|
|
has_website BOOLEAN,
|
|
raw_record TEXT -- Full JSON record
|
|
);
|
|
|
|
CREATE TABLE metadata (
|
|
key TEXT PRIMARY KEY,
|
|
value TEXT
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## Statistics at a Glance
|
|
|
|
### Overall
|
|
|
|
- **Total Institutions**: 1,678
|
|
- **Unique GHCIDs**: 565 (33.7%)
|
|
- **Wikidata Coverage**: 258 (15.4%)
|
|
- **Website Coverage**: 198 (11.8%)
|
|
|
|
### By Country
|
|
|
|
| Country | Count | GHCID | Wikidata | Tier |
|
|
|---------|-------|-------|----------|------|
|
|
| 🇫🇮 Finland | 817 | 100% | 7.7% | TIER_1 |
|
|
| 🇧🇪 Belgium | 421 | 0% | 0% | TIER_1 |
|
|
| 🇧🇾 Belarus | 167 | 0% | 3.0% | TIER_1 |
|
|
| 🇳🇱 Netherlands | 153 | 0% | 73.2% | TIER_1 |
|
|
| 🇨🇱 Chile | 90 | 0% | 78.9% | TIER_4 |
|
|
| 🇪🇬 Egypt | 29 | 58.6% | 24.1% | TIER_4 |
|
|
|
|
### By Institution Type
|
|
|
|
- Libraries: 1,478 (88.1%)
|
|
- Museums: 80 (4.8%)
|
|
- Archives: 73 (4.4%)
|
|
- Education Providers: 12 (0.7%)
|
|
- Official Institutions: 12 (0.7%)
|
|
|
|
---
|
|
|
|
## Known Limitations (Phase 1)
|
|
|
|
1. ⚠️ **Denmark excluded** (2,348 institutions) - parser error
|
|
2. ⚠️ **Canada excluded** (9,565 institutions) - nested dict error
|
|
3. ⚠️ **SQLite incomplete** - INTEGER overflow on ghcid_numeric
|
|
4. 🔍 **269 GHCID duplicates** - need collision resolution
|
|
5. 📝 **Missing GHCIDs** - Belgium, Netherlands, Belarus, Chile
|
|
|
|
**Phase 2 will fix these issues and bring total to 13,591 institutions.**
|
|
|
|
---
|
|
|
|
## Rebuilding the Database
|
|
|
|
To rebuild with updated country datasets:
|
|
|
|
```bash
|
|
# Run the unification script
|
|
python3 scripts/build_unified_database.py
|
|
|
|
# Output will be in:
|
|
# - data/unified/glam_unified_database.json
|
|
# - data/unified/glam_unified_database.db
|
|
```
|
|
|
|
To add a new country dataset:
|
|
|
|
1. Edit `scripts/build_unified_database.py`
|
|
2. Add country to `COUNTRY_DATASETS` dict with path
|
|
3. Run script
|
|
4. Check `UNIFIED_DATABASE_REPORT.md` for results
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
- **Full Report**: `UNIFIED_DATABASE_REPORT.md` - Detailed statistics and analysis
|
|
- **Session Summary**: `SESSION_SUMMARY_20251120_FINLAND_UNIFIED.md` - What we did today
|
|
- **Finland Report**: `data/finland_isil/FINLAND_ISIL_HARVEST_REPORT.md` - Finnish dataset details
|
|
- **Main Progress**: `PROGRESS.md` - Overall project status
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
For questions or issues:
|
|
- Check `UNIFIED_DATABASE_REPORT.md` for detailed documentation
|
|
- Review `AGENTS.md` for extraction guidelines
|
|
- See `PROGRESS.md` for project history
|
|
|
|
**Version**: 1.0.0 (Phase 1)
|
|
**Next Update**: Phase 2 (Denmark + Canada integration)
|