# Czech ISIL Database - Next Steps & Quick Start ## Current Status ✅ **COMPLETE**: 8,145 Czech institutions parsed with 100% type classification ⏳ **IN PROGRESS**: ISIL code format investigation 📋 **PLANNED**: Wikidata enrichment, RDF export, integration ## Files to Use ### Main Data File - **`data/instances/czech_institutions.yaml`** - 8,145 records, LinkML-compliant - Format: YAML list of HeritageCustodian objects - Size: 8.8 MB - Classification: 100% complete ### Source Data - **`data/isil/czech_republic/adr.xml`** - Original MARC21 XML (27 MB) - **`data/isil/czech_republic/adr.xml.gz`** - Compressed download (1.9 MB) - Download URL: https://aleph.nkp.cz/data/adr.xml.gz - Updated: Weekly (every Monday) ### Parser - **`scripts/parsers/parse_czech_isil.py`** - MARC21 to LinkML converter - Usage: `python3 scripts/parsers/parse_czech_isil.py` - Version: 2.0 (improved type mapping) ## Immediate Next Steps ### 1. Investigate ISIL Code Format **Issue**: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX) **Actions**: ```bash # Option A: Contact NK ČR directly # Email: eva.svobodova@nkp.cz # Questions: # 1. Are siglas the official ISIL suffixes for Czech institutions? # 2. If not, how do siglas map to ISIL codes? # 3. Can you provide an official ISIL registry for Czech libraries? ``` ```bash # Option B: Check international ISIL registry curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-" ``` ```bash # Option C: Test sample ISIL codes # Try looking up CZ-ABA000 in various library registries # Check if format is recognized by OCLC, Wikidata, etc. ``` ### 2. Wikidata Enrichment **Goal**: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution **Script Template**: ```python # scripts/enrich_czech_wikidata.py from SPARQLWrapper import SPARQLWrapper, JSON def query_wikidata_czech_libraries(): """Query Wikidata for Czech libraries and institutions.""" endpoint = "https://query.wikidata.org/sparql" query = """ SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE { ?item wdt:P31/wdt:P279* wd:Q7075 . # Instance of library (or subclass) ?item wdt:P17 wd:Q213 . # Country: Czech Republic OPTIONAL { ?item wdt:P791 ?isil } # ISIL code OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" } } LIMIT 5000 """ sparql = SPARQLWrapper(endpoint) sparql.setQuery(query) sparql.setReturnFormat(JSON) return sparql.query().convert() # Run query results = query_wikidata_czech_libraries() print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata") # Match to our dataset by name, city, or sigla # Use fuzzy matching with rapidfuzz ``` **Usage**: ```bash python3 scripts/enrich_czech_wikidata.py ``` ### 3. RDF Export **Goal**: Export Czech institutions as RDF/Turtle for semantic web integration **Script Template**: ```python # scripts/export_czech_rdf.py import yaml from rdflib import Graph, Namespace, Literal, URIRef from rdflib.namespace import RDF, RDFS, DCTERMS # Load Czech institutions with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f: institutions = yaml.safe_load(f) # Create RDF graph g = Graph() heritage = Namespace("https://w3id.org/heritage/custodian/") glam = Namespace("https://w3id.org/heritage/ontology/") g.bind("heritage", heritage) g.bind("glam", glam) g.bind("dcterms", DCTERMS) for inst in institutions: inst_uri = URIRef(inst['id']) g.add((inst_uri, RDF.type, glam.HeritageCustodian)) g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs'))) g.add((inst_uri, glam.institutionType, Literal(inst['institution_type']))) # Add locations for loc in inst.get('locations', []): if 'latitude' in loc and 'longitude' in loc: g.add((inst_uri, glam.latitude, Literal(loc['latitude']))) g.add((inst_uri, glam.longitude, Literal(loc['longitude']))) # Export g.serialize('data/exports/czech_institutions.ttl', format='turtle') print(f"✅ Exported {len(institutions)} institutions to RDF") ``` **Usage**: ```bash python3 scripts/export_czech_rdf.py ``` ### 4. Geographic Visualization **Goal**: Create interactive map of Czech heritage institutions **Script Template**: ```python # scripts/visualize_czech_map.py import yaml import folium from collections import Counter # Load data with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f: institutions = yaml.safe_load(f) # Create map centered on Czech Republic m = folium.Map(location=[49.8, 15.5], zoom_start=7) # Color by type colors = { 'LIBRARY': 'blue', 'MUSEUM': 'red', 'GALLERY': 'green', 'EDUCATION_PROVIDER': 'orange', 'HOLY_SITES': 'purple', 'OFFICIAL_INSTITUTION': 'darkblue', } # Add markers for inst in institutions: for loc in inst.get('locations', []): if 'latitude' in loc and 'longitude' in loc: folium.Marker( [loc['latitude'], loc['longitude']], popup=f"{inst['name']}
{inst['institution_type']}", icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray')) ).add_to(m) # Save map m.save('data/exports/czech_institutions_map.html') print("✅ Map saved to data/exports/czech_institutions_map.html") ``` **Usage**: ```bash python3 scripts/visualize_czech_map.py open data/exports/czech_institutions_map.html ``` ## Integration with Global Dataset ### Merge Script Template ```python # scripts/merge_czech_into_global.py import yaml # Load Czech data with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f: czech_institutions = yaml.safe_load(f) # Load global dataset (if exists) try: with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f: global_institutions = yaml.safe_load(f) except FileNotFoundError: global_institutions = [] # Add Czech institutions global_institutions.extend(czech_institutions) # Remove duplicates (by ID) seen_ids = set() unique_institutions = [] for inst in global_institutions: if inst['id'] not in seen_ids: unique_institutions.append(inst) seen_ids.add(inst['id']) # Save merged dataset with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f: yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True) print(f"✅ Merged {len(czech_institutions)} Czech institutions") print(f"✅ Total global institutions: {len(unique_institutions)}") ``` ## Quick Checks ### Count by Type ```bash python3 -c " import yaml with open('data/instances/czech_institutions.yaml', 'r') as f: data = yaml.safe_load(f) from collections import Counter types = Counter(i.get('institution_type', 'UNKNOWN') for i in data) for t, c in types.most_common(): print(f'{t}: {c}') " ``` ### Sample Record ```bash python3 -c " import yaml with open('data/instances/czech_institutions.yaml', 'r') as f: data = yaml.safe_load(f) import json print(json.dumps(data[0], indent=2, ensure_ascii=False)) " ``` ### GPS Coverage ```bash python3 -c " import yaml with open('data/instances/czech_institutions.yaml', 'r') as f: data = yaml.safe_load(f) with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', []))) print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)') " ``` ## Data Refresh To download and re-parse the latest data (updated weekly on Mondays): ```bash # Download latest data cd data/isil/czech_republic curl -O https://aleph.nkp.cz/data/adr.xml.gz gunzip -f adr.xml.gz # Re-parse cd /Users/kempersc/apps/glam python3 scripts/parsers/parse_czech_isil.py # Check diff diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml) ``` ## Validation ### LinkML Schema Validation ```bash # Install linkml if needed pip install linkml # Validate Czech institutions against schema linkml-validate \ -s schemas/heritage_custodian.yaml \ data/instances/czech_institutions.yaml ``` ### Data Quality Checks ```bash python3 -c " import yaml with open('data/instances/czech_institutions.yaml', 'r') as f: data = yaml.safe_load(f) # Check required fields missing_name = [i['id'] for i in data if not i.get('name')] missing_type = [i['id'] for i in data if not i.get('institution_type')] missing_location = [i['id'] for i in data if not i.get('locations')] print(f'Missing name: {len(missing_name)}') print(f'Missing type: {len(missing_type)}') print(f'Missing location: {len(missing_location)}') # Check provenance no_provenance = [i['id'] for i in data if not i.get('provenance')] print(f'Missing provenance: {len(no_provenance)}') " ``` ## Related Documentation - **Complete Report**: `CZECH_ISIL_COMPLETE_REPORT.md` - **Initial Harvest**: `CZECH_ISIL_HARVEST_SUMMARY.md` - **Technical Analysis**: `data/isil/czech_republic/czech_isil_analysis.md` - **README**: `data/isil/czech_republic/README.md` - **Schema**: `schemas/heritage_custodian.yaml` - **Agent Instructions**: `AGENTS.md` ## Contact Information **National Library of the Czech Republic (NK ČR)** - CASLIN Team: eva.svobodova@nkp.cz - Phone: +420 221 663 205-7 - Address: Sodomkova 2/1146, 102 00 Praha 10 - Website: https://www.nkp.cz/en/ **International ISIL Registry** - Authority: Danish Agency for Culture and Palaces - Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil --- **Last Updated**: November 19, 2025 **Data Version**: ADR database downloaded November 19, 2025 **Parser Version**: 2.0 (improved type mapping)