glam/CZECH_ISIL_NEXT_STEPS.md
2025-11-19 23:25:22 +01:00

9.7 KiB

Czech ISIL Database - Next Steps & Quick Start

Current Status

COMPLETE: 8,145 Czech institutions parsed with 100% type classification
IN PROGRESS: ISIL code format investigation
📋 PLANNED: Wikidata enrichment, RDF export, integration

Files to Use

Main Data File

  • data/instances/czech_institutions.yaml - 8,145 records, LinkML-compliant
  • Format: YAML list of HeritageCustodian objects
  • Size: 8.8 MB
  • Classification: 100% complete

Source Data

  • data/isil/czech_republic/adr.xml - Original MARC21 XML (27 MB)
  • data/isil/czech_republic/adr.xml.gz - Compressed download (1.9 MB)
  • Download URL: https://aleph.nkp.cz/data/adr.xml.gz
  • Updated: Weekly (every Monday)

Parser

  • scripts/parsers/parse_czech_isil.py - MARC21 to LinkML converter
  • Usage: python3 scripts/parsers/parse_czech_isil.py
  • Version: 2.0 (improved type mapping)

Immediate Next Steps

1. Investigate ISIL Code Format

Issue: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX)

Actions:

# Option A: Contact NK ČR directly
# Email: eva.svobodova@nkp.cz
# Questions:
#   1. Are siglas the official ISIL suffixes for Czech institutions?
#   2. If not, how do siglas map to ISIL codes?
#   3. Can you provide an official ISIL registry for Czech libraries?
# Option B: Check international ISIL registry
curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-"
# Option C: Test sample ISIL codes
# Try looking up CZ-ABA000 in various library registries
# Check if format is recognized by OCLC, Wikidata, etc.

2. Wikidata Enrichment

Goal: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution

Script Template:

# scripts/enrich_czech_wikidata.py
from SPARQLWrapper import SPARQLWrapper, JSON

def query_wikidata_czech_libraries():
    """Query Wikidata for Czech libraries and institutions."""
    endpoint = "https://query.wikidata.org/sparql"
    
    query = """
    SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE {
      ?item wdt:P31/wdt:P279* wd:Q7075 .  # Instance of library (or subclass)
      ?item wdt:P17 wd:Q213 .              # Country: Czech Republic
      OPTIONAL { ?item wdt:P791 ?isil }   # ISIL code
      OPTIONAL { ?item wdt:P214 ?viaf }   # VIAF ID
      OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID
      SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
    }
    LIMIT 5000
    """
    
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Run query
results = query_wikidata_czech_libraries()
print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata")

# Match to our dataset by name, city, or sigla
# Use fuzzy matching with rapidfuzz

Usage:

python3 scripts/enrich_czech_wikidata.py

3. RDF Export

Goal: Export Czech institutions as RDF/Turtle for semantic web integration

Script Template:

# scripts/export_czech_rdf.py
import yaml
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, DCTERMS

# Load Czech institutions
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create RDF graph
g = Graph()
heritage = Namespace("https://w3id.org/heritage/custodian/")
glam = Namespace("https://w3id.org/heritage/ontology/")

g.bind("heritage", heritage)
g.bind("glam", glam)
g.bind("dcterms", DCTERMS)

for inst in institutions:
    inst_uri = URIRef(inst['id'])
    
    g.add((inst_uri, RDF.type, glam.HeritageCustodian))
    g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs')))
    g.add((inst_uri, glam.institutionType, Literal(inst['institution_type'])))
    
    # Add locations
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            g.add((inst_uri, glam.latitude, Literal(loc['latitude'])))
            g.add((inst_uri, glam.longitude, Literal(loc['longitude'])))

# Export
g.serialize('data/exports/czech_institutions.ttl', format='turtle')
print(f"✅ Exported {len(institutions)} institutions to RDF")

Usage:

python3 scripts/export_czech_rdf.py

4. Geographic Visualization

Goal: Create interactive map of Czech heritage institutions

Script Template:

# scripts/visualize_czech_map.py
import yaml
import folium
from collections import Counter

# Load data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create map centered on Czech Republic
m = folium.Map(location=[49.8, 15.5], zoom_start=7)

# Color by type
colors = {
    'LIBRARY': 'blue',
    'MUSEUM': 'red',
    'GALLERY': 'green',
    'EDUCATION_PROVIDER': 'orange',
    'HOLY_SITES': 'purple',
    'OFFICIAL_INSTITUTION': 'darkblue',
}

# Add markers
for inst in institutions:
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            folium.Marker(
                [loc['latitude'], loc['longitude']],
                popup=f"{inst['name']}<br>{inst['institution_type']}",
                icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray'))
            ).add_to(m)

# Save map
m.save('data/exports/czech_institutions_map.html')
print("✅ Map saved to data/exports/czech_institutions_map.html")

Usage:

python3 scripts/visualize_czech_map.py
open data/exports/czech_institutions_map.html

Integration with Global Dataset

Merge Script Template

# scripts/merge_czech_into_global.py
import yaml

# Load Czech data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    czech_institutions = yaml.safe_load(f)

# Load global dataset (if exists)
try:
    with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f:
        global_institutions = yaml.safe_load(f)
except FileNotFoundError:
    global_institutions = []

# Add Czech institutions
global_institutions.extend(czech_institutions)

# Remove duplicates (by ID)
seen_ids = set()
unique_institutions = []
for inst in global_institutions:
    if inst['id'] not in seen_ids:
        unique_institutions.append(inst)
        seen_ids.add(inst['id'])

# Save merged dataset
with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f:
    yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True)

print(f"✅ Merged {len(czech_institutions)} Czech institutions")
print(f"✅ Total global institutions: {len(unique_institutions)}")

Quick Checks

Count by Type

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
from collections import Counter
types = Counter(i.get('institution_type', 'UNKNOWN') for i in data)
for t, c in types.most_common():
    print(f'{t}: {c}')
"

Sample Record

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
import json
print(json.dumps(data[0], indent=2, ensure_ascii=False))
"

GPS Coverage

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"

Data Refresh

To download and re-parse the latest data (updated weekly on Mondays):

# Download latest data
cd data/isil/czech_republic
curl -O https://aleph.nkp.cz/data/adr.xml.gz
gunzip -f adr.xml.gz

# Re-parse
cd /Users/kempersc/apps/glam
python3 scripts/parsers/parse_czech_isil.py

# Check diff
diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml)

Validation

LinkML Schema Validation

# Install linkml if needed
pip install linkml

# Validate Czech institutions against schema
linkml-validate \
  -s schemas/heritage_custodian.yaml \
  data/instances/czech_institutions.yaml

Data Quality Checks

python3 -c "
import yaml

with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)

# Check required fields
missing_name = [i['id'] for i in data if not i.get('name')]
missing_type = [i['id'] for i in data if not i.get('institution_type')]
missing_location = [i['id'] for i in data if not i.get('locations')]

print(f'Missing name: {len(missing_name)}')
print(f'Missing type: {len(missing_type)}')
print(f'Missing location: {len(missing_location)}')

# Check provenance
no_provenance = [i['id'] for i in data if not i.get('provenance')]
print(f'Missing provenance: {len(no_provenance)}')
"
  • Complete Report: CZECH_ISIL_COMPLETE_REPORT.md
  • Initial Harvest: CZECH_ISIL_HARVEST_SUMMARY.md
  • Technical Analysis: data/isil/czech_republic/czech_isil_analysis.md
  • README: data/isil/czech_republic/README.md
  • Schema: schemas/heritage_custodian.yaml
  • Agent Instructions: AGENTS.md

Contact Information

National Library of the Czech Republic (NK ČR)

International ISIL Registry


Last Updated: November 19, 2025
Data Version: ADR database downloaded November 19, 2025
Parser Version: 2.0 (improved type mapping)