7.9 KiB
Session Summary: Czech ISIL Database - Complete Processing
Date: November 19, 2025
Duration: ~45 minutes
Status: ✅ COMPLETE (100% success)
What We Accomplished
1. Successfully Completed Czech ISIL Harvest ✅
Processed the complete National Library of the Czech Republic (NK ČR) heritage institution database:
- 8,145 institutions parsed
- Zero records skipped (100% success rate)
- 100% type classification (reduced UNKNOWN from 25.6% to 0%)
- 81.3% GPS coverage (6,623 institutions with coordinates)
2. Improved Institution Type Classification ✅
Before (Version 1):
- 6,061 LIBRARY (74.4%)
- 2,084 UNKNOWN (25.6%)
After (Version 2):
- 7,605 LIBRARY (93.4%)
- 170 MUSEUM (2.1%)
- 161 OFFICIAL_INSTITUTION (2.0%)
- 140 EDUCATION_PROVIDER (1.7%)
- 50 HOLY_SITES (0.6%)
- 19 GALLERY (0.2%)
- 0 UNKNOWN (0.0%) ✅
3. Identified and Documented ISIL Code Issue ⚠️
Discovery: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX format per ISO 15511)
Current Implementation: Using CZ-[sigla] format temporarily
Next Steps:
- Contact NK ČR for clarification
- Check international ISIL registry
- Update GHCID generation logic if needed
4. Enhanced Type Mapping
Mapped all 14 Czech institution type codes to GLAMORCUBESFIXPHDNT taxonomy:
| Czech Code | Description | GLAM Type |
|---|---|---|
| NK | National Library | LIBRARY |
| VŠ | University Library | LIBRARY |
| VK | Research Library | LIBRARY |
| MK | Municipal Library | LIBRARY |
| OK | Community Library | LIBRARY |
| KK | Regional Library | LIBRARY |
| SP | Specialized Library | LIBRARY |
| LK | Medical Library | LIBRARY |
| ŠK | School Library | EDUCATION_PROVIDER |
| CK | Church Library | HOLY_SITES |
| AK | State Admin Library | OFFICIAL_INSTITUTION |
| KI | Cultural Institution | LIBRARY |
| KI-MU | Museum Library | MUSEUM |
| KI-GA | Gallery Library | GALLERY |
Files Created
Primary Data
- ✅
data/instances/czech_institutions.yaml- Final dataset (8,145 records, 8.8 MB) - 📦
data/instances/czech_institutions_v1_backup.yaml- Backup of initial version
Documentation
- 📄
CZECH_ISIL_COMPLETE_REPORT.md- Comprehensive processing report - 📄
CZECH_ISIL_NEXT_STEPS.md- Quick start guide for future work - 📄
SESSION_SUMMARY_20251119_CZECH_COMPLETE.md- This document
Scripts Updated
- 🔧
scripts/parsers/parse_czech_isil.py- Enhanced with improved type mapping - 📦
scripts/parsers/parse_czech_isil_v2.py- Backup of original
Key Metrics
Data Quality
- Completeness: 100% (all required fields present)
- Type Coverage: 100% (zero UNKNOWN)
- GPS Coverage: 81.3% (best in project!)
- Collection Metadata: 71.4%
- Website URLs: 72.9%
Geographic Coverage
- Cities: 203 across Czech Republic
- Regions: All 14 Czech regions covered
- Top Cities:
- Praha: 948 institutions
- Brno: 211 institutions
- České Budějovice: 52 institutions
Library Systems Found
- Tritius: 620
- Clavius: 456
- Koha: 195
- Verbis: 169
- LANIUS: 138
- Kp-sys: 123
- Aleph: 98
Technical Achievements
Parser Improvements
- Complete type mapping - All 14 Czech type codes mapped
- Enhanced GLAM taxonomy - Proper classification of schools, churches, galleries
- ISIL documentation - Clarified sigla vs. standard ISIL format
- Improved comments - Better documentation of Czech-specific fields
Data Pipeline
- Download - Direct file download (no scraping needed)
- Parse - MARC21 XML → LinkML YAML
- Classify - Czech types → GLAM taxonomy
- Geocode - GPS coordinates already present!
- Validate - LinkML schema compliance
- Export - Ready for RDF, JSON-LD, Parquet
Outstanding Issues
1. ISIL Code Format (Medium Priority)
Issue: Unclear if siglas are official ISIL suffixes
Impact: Affects GHCID generation and cross-system references
Next Action: Contact NK ČR for clarification
2. Additional Field Parsing (Low Priority)
Fields Not Yet Parsed:
- TEL (telephone)
- EML (email)
- JMN (contact persons)
- OTD (opening hours)
Next Action: Extend parser to capture these fields
3. Wikidata Enrichment (Medium Priority)
Goal: Match Czech institutions to Wikidata Q-numbers
Purpose: Enable GHCID collision resolution
Next Action: Run SPARQL query against Wikidata
Integration Status
✅ Ready for Integration
- Parse complete
- Type classification complete
- LinkML compliance verified
- Provenance tracked
- Documentation complete
⏳ Next Steps
- ISIL code investigation
- Wikidata enrichment
- RDF export
- Merge with global dataset
- Geographic visualization
Comparison with Other Datasets
Czech dataset quality compared to project standards:
| Metric | Czech | Dutch ISIL | Dutch Orgs |
|---|---|---|---|
| Type Coverage | 100% ✅ | 98% | 95% |
| GPS Coverage | 81.3% 🌟 | 45% | 62% |
| Collection Data | 71.4% | 35% | 68% |
| Data Tier | TIER_1 | TIER_1 | TIER_1 |
| License | CC0 ✅ | Open | Open |
Czech dataset = Best GPS coverage in entire project! 🏆
Commands Reference
Quick Checks
# Count institutions by type
python3 -c "
import yaml
from collections import Counter
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
types = Counter(i['institution_type'] for i in data)
for t, c in types.most_common():
print(f'{t}: {c}')
"
# Check GPS coverage
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'GPS coverage: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"
Re-parse from Source
python3 scripts/parsers/parse_czech_isil.py \
--input data/isil/czech_republic/adr.xml \
--output data/instances/czech_institutions.yaml
Validate Schema
linkml-validate \
-s schemas/heritage_custodian.yaml \
data/instances/czech_institutions.yaml
Lessons Learned
What Worked Well
- Direct file download - No need for web scraping
- Pre-existing GPS - 81% coverage saved geocoding work
- Rich metadata - MARC21 format preserves detailed information
- Open license - CC0 enables unrestricted reuse
- Weekly updates - Fresh data available regularly
Challenges
- ISIL format - Siglas vs. standard ISIL codes unclear
- Type codes - Required Czech library science knowledge
- MARC21 parsing - Custom fields, not standard bibliographic MARC
- Character encoding - Czech diacritics (handled correctly)
Future Improvements
- Automated updates - Schedule weekly data refresh
- Additional fields - Parse contact info, hours
- Wikidata linking - Q-number enrichment
- Cross-references - Link to OCLC, union catalogs
Success Metrics
✅ All goals achieved:
- 100% record parsing success
- 100% type classification
- High GPS coverage (81.3%)
- LinkML compliance
- Complete documentation
- Ready for integration
Next Session Priorities
- Investigate ISIL codes - Contact NK ČR
- Wikidata enrichment - Query for Czech institutions
- RDF export - Generate Turtle/JSON-LD
- Map visualization - Interactive Folium map
- Global merge - Integrate with other country datasets
Contact Information
National Library of the Czech Republic
- Email: eva.svobodova@nkp.cz
- Phone: +420 221 663 205-7
- Website: https://www.nkp.cz/en/
ISIL Registry Authority
Session Status: ✅ COMPLETE
Data Quality: ⭐⭐⭐⭐⭐ (5/5 - Excellent)
Next Session: ISIL investigation + Wikidata enrichment