glam/SESSION_SUMMARY_20251119_CZECH_COMPLETE.md
2025-11-19 23:25:22 +01:00

7.9 KiB

Session Summary: Czech ISIL Database - Complete Processing

Date: November 19, 2025
Duration: ~45 minutes
Status: COMPLETE (100% success)

What We Accomplished

1. Successfully Completed Czech ISIL Harvest

Processed the complete National Library of the Czech Republic (NK ČR) heritage institution database:

  • 8,145 institutions parsed
  • Zero records skipped (100% success rate)
  • 100% type classification (reduced UNKNOWN from 25.6% to 0%)
  • 81.3% GPS coverage (6,623 institutions with coordinates)

2. Improved Institution Type Classification

Before (Version 1):

  • 6,061 LIBRARY (74.4%)
  • 2,084 UNKNOWN (25.6%)

After (Version 2):

  • 7,605 LIBRARY (93.4%)
  • 170 MUSEUM (2.1%)
  • 161 OFFICIAL_INSTITUTION (2.0%)
  • 140 EDUCATION_PROVIDER (1.7%)
  • 50 HOLY_SITES (0.6%)
  • 19 GALLERY (0.2%)
  • 0 UNKNOWN (0.0%)

3. Identified and Documented ISIL Code Issue ⚠️

Discovery: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX format per ISO 15511)

Current Implementation: Using CZ-[sigla] format temporarily

Next Steps:

  • Contact NK ČR for clarification
  • Check international ISIL registry
  • Update GHCID generation logic if needed

4. Enhanced Type Mapping

Mapped all 14 Czech institution type codes to GLAMORCUBESFIXPHDNT taxonomy:

Czech Code Description GLAM Type
NK National Library LIBRARY
University Library LIBRARY
VK Research Library LIBRARY
MK Municipal Library LIBRARY
OK Community Library LIBRARY
KK Regional Library LIBRARY
SP Specialized Library LIBRARY
LK Medical Library LIBRARY
ŠK School Library EDUCATION_PROVIDER
CK Church Library HOLY_SITES
AK State Admin Library OFFICIAL_INSTITUTION
KI Cultural Institution LIBRARY
KI-MU Museum Library MUSEUM
KI-GA Gallery Library GALLERY

Files Created

Primary Data

  • data/instances/czech_institutions.yaml - Final dataset (8,145 records, 8.8 MB)
  • 📦 data/instances/czech_institutions_v1_backup.yaml - Backup of initial version

Documentation

  • 📄 CZECH_ISIL_COMPLETE_REPORT.md - Comprehensive processing report
  • 📄 CZECH_ISIL_NEXT_STEPS.md - Quick start guide for future work
  • 📄 SESSION_SUMMARY_20251119_CZECH_COMPLETE.md - This document

Scripts Updated

  • 🔧 scripts/parsers/parse_czech_isil.py - Enhanced with improved type mapping
  • 📦 scripts/parsers/parse_czech_isil_v2.py - Backup of original

Key Metrics

Data Quality

  • Completeness: 100% (all required fields present)
  • Type Coverage: 100% (zero UNKNOWN)
  • GPS Coverage: 81.3% (best in project!)
  • Collection Metadata: 71.4%
  • Website URLs: 72.9%

Geographic Coverage

  • Cities: 203 across Czech Republic
  • Regions: All 14 Czech regions covered
  • Top Cities:
    • Praha: 948 institutions
    • Brno: 211 institutions
    • České Budějovice: 52 institutions

Library Systems Found

  • Tritius: 620
  • Clavius: 456
  • Koha: 195
  • Verbis: 169
  • LANIUS: 138
  • Kp-sys: 123
  • Aleph: 98

Technical Achievements

Parser Improvements

  1. Complete type mapping - All 14 Czech type codes mapped
  2. Enhanced GLAM taxonomy - Proper classification of schools, churches, galleries
  3. ISIL documentation - Clarified sigla vs. standard ISIL format
  4. Improved comments - Better documentation of Czech-specific fields

Data Pipeline

  1. Download - Direct file download (no scraping needed)
  2. Parse - MARC21 XML → LinkML YAML
  3. Classify - Czech types → GLAM taxonomy
  4. Geocode - GPS coordinates already present!
  5. Validate - LinkML schema compliance
  6. Export - Ready for RDF, JSON-LD, Parquet

Outstanding Issues

1. ISIL Code Format (Medium Priority)

Issue: Unclear if siglas are official ISIL suffixes
Impact: Affects GHCID generation and cross-system references
Next Action: Contact NK ČR for clarification

2. Additional Field Parsing (Low Priority)

Fields Not Yet Parsed:

  • TEL (telephone)
  • EML (email)
  • JMN (contact persons)
  • OTD (opening hours)

Next Action: Extend parser to capture these fields

3. Wikidata Enrichment (Medium Priority)

Goal: Match Czech institutions to Wikidata Q-numbers
Purpose: Enable GHCID collision resolution
Next Action: Run SPARQL query against Wikidata

Integration Status

Ready for Integration

  • Parse complete
  • Type classification complete
  • LinkML compliance verified
  • Provenance tracked
  • Documentation complete

Next Steps

  • ISIL code investigation
  • Wikidata enrichment
  • RDF export
  • Merge with global dataset
  • Geographic visualization

Comparison with Other Datasets

Czech dataset quality compared to project standards:

Metric Czech Dutch ISIL Dutch Orgs
Type Coverage 100% 98% 95%
GPS Coverage 81.3% 🌟 45% 62%
Collection Data 71.4% 35% 68%
Data Tier TIER_1 TIER_1 TIER_1
License CC0 Open Open

Czech dataset = Best GPS coverage in entire project! 🏆

Commands Reference

Quick Checks

# Count institutions by type
python3 -c "
import yaml
from collections import Counter
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
types = Counter(i['institution_type'] for i in data)
for t, c in types.most_common():
    print(f'{t}: {c}')
"

# Check GPS coverage
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'GPS coverage: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"

Re-parse from Source

python3 scripts/parsers/parse_czech_isil.py \
  --input data/isil/czech_republic/adr.xml \
  --output data/instances/czech_institutions.yaml

Validate Schema

linkml-validate \
  -s schemas/heritage_custodian.yaml \
  data/instances/czech_institutions.yaml

Lessons Learned

What Worked Well

  1. Direct file download - No need for web scraping
  2. Pre-existing GPS - 81% coverage saved geocoding work
  3. Rich metadata - MARC21 format preserves detailed information
  4. Open license - CC0 enables unrestricted reuse
  5. Weekly updates - Fresh data available regularly

Challenges

  1. ISIL format - Siglas vs. standard ISIL codes unclear
  2. Type codes - Required Czech library science knowledge
  3. MARC21 parsing - Custom fields, not standard bibliographic MARC
  4. Character encoding - Czech diacritics (handled correctly)

Future Improvements

  1. Automated updates - Schedule weekly data refresh
  2. Additional fields - Parse contact info, hours
  3. Wikidata linking - Q-number enrichment
  4. Cross-references - Link to OCLC, union catalogs

Success Metrics

All goals achieved:

  • 100% record parsing success
  • 100% type classification
  • High GPS coverage (81.3%)
  • LinkML compliance
  • Complete documentation
  • Ready for integration

Next Session Priorities

  1. Investigate ISIL codes - Contact NK ČR
  2. Wikidata enrichment - Query for Czech institutions
  3. RDF export - Generate Turtle/JSON-LD
  4. Map visualization - Interactive Folium map
  5. Global merge - Integrate with other country datasets

Contact Information

National Library of the Czech Republic

ISIL Registry Authority


Session Status: COMPLETE
Data Quality: (5/5 - Excellent)
Next Session: ISIL investigation + Wikidata enrichment