kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

9.7 KiB

Raw Blame History

Czech ISIL Database - Next Steps & Quick Start

Current Status

✅ COMPLETE: 8,145 Czech institutions parsed with 100% type classification
⏳ IN PROGRESS: ISIL code format investigation
📋 PLANNED: Wikidata enrichment, RDF export, integration

Files to Use

Main Data File

data/instances/czech_institutions.yaml - 8,145 records, LinkML-compliant
Format: YAML list of HeritageCustodian objects
Size: 8.8 MB
Classification: 100% complete

Source Data

data/isil/czech_republic/adr.xml - Original MARC21 XML (27 MB)
data/isil/czech_republic/adr.xml.gz - Compressed download (1.9 MB)
Download URL: https://aleph.nkp.cz/data/adr.xml.gz
Updated: Weekly (every Monday)

Parser

scripts/parsers/parse_czech_isil.py - MARC21 to LinkML converter
Usage: python3 scripts/parsers/parse_czech_isil.py
Version: 2.0 (improved type mapping)

Immediate Next Steps

1. Investigate ISIL Code Format

Issue: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX)

Actions:

# Option A: Contact NK ČR directly
# Email: eva.svobodova@nkp.cz
# Questions:
#   1. Are siglas the official ISIL suffixes for Czech institutions?
#   2. If not, how do siglas map to ISIL codes?
#   3. Can you provide an official ISIL registry for Czech libraries?

# Option B: Check international ISIL registry
curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-"

# Option C: Test sample ISIL codes
# Try looking up CZ-ABA000 in various library registries
# Check if format is recognized by OCLC, Wikidata, etc.

2. Wikidata Enrichment

Goal: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution

Script Template:

# scripts/enrich_czech_wikidata.py
from SPARQLWrapper import SPARQLWrapper, JSON

def query_wikidata_czech_libraries():
    """Query Wikidata for Czech libraries and institutions."""
    endpoint = "https://query.wikidata.org/sparql"
    
    query = """
    SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE {
      ?item wdt:P31/wdt:P279* wd:Q7075 .  # Instance of library (or subclass)
      ?item wdt:P17 wd:Q213 .              # Country: Czech Republic
      OPTIONAL { ?item wdt:P791 ?isil }   # ISIL code
      OPTIONAL { ?item wdt:P214 ?viaf }   # VIAF ID
      OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID
      SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
    }
    LIMIT 5000
    """
    
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Run query
results = query_wikidata_czech_libraries()
print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata")

# Match to our dataset by name, city, or sigla
# Use fuzzy matching with rapidfuzz

Usage:

python3 scripts/enrich_czech_wikidata.py

3. RDF Export

Goal: Export Czech institutions as RDF/Turtle for semantic web integration

Script Template:

# scripts/export_czech_rdf.py
import yaml
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, DCTERMS

# Load Czech institutions
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create RDF graph
g = Graph()
heritage = Namespace("https://w3id.org/heritage/custodian/")
glam = Namespace("https://w3id.org/heritage/ontology/")

g.bind("heritage", heritage)
g.bind("glam", glam)
g.bind("dcterms", DCTERMS)

for inst in institutions:
    inst_uri = URIRef(inst['id'])
    
    g.add((inst_uri, RDF.type, glam.HeritageCustodian))
    g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs')))
    g.add((inst_uri, glam.institutionType, Literal(inst['institution_type'])))
    
    # Add locations
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            g.add((inst_uri, glam.latitude, Literal(loc['latitude'])))
            g.add((inst_uri, glam.longitude, Literal(loc['longitude'])))

# Export
g.serialize('data/exports/czech_institutions.ttl', format='turtle')
print(f"✅ Exported {len(institutions)} institutions to RDF")

Usage:

python3 scripts/export_czech_rdf.py

4. Geographic Visualization

Goal: Create interactive map of Czech heritage institutions

Script Template:

# scripts/visualize_czech_map.py
import yaml
import folium
from collections import Counter

# Load data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create map centered on Czech Republic
m = folium.Map(location=[49.8, 15.5], zoom_start=7)

# Color by type
colors = {
    'LIBRARY': 'blue',
    'MUSEUM': 'red',
    'GALLERY': 'green',
    'EDUCATION_PROVIDER': 'orange',
    'HOLY_SITES': 'purple',
    'OFFICIAL_INSTITUTION': 'darkblue',
}

# Add markers
for inst in institutions:
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            folium.Marker(
                [loc['latitude'], loc['longitude']],
                popup=f"{inst['name']}<br>{inst['institution_type']}",
                icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray'))
            ).add_to(m)

# Save map
m.save('data/exports/czech_institutions_map.html')
print("✅ Map saved to data/exports/czech_institutions_map.html")

Usage:

python3 scripts/visualize_czech_map.py
open data/exports/czech_institutions_map.html

Integration with Global Dataset

Merge Script Template

# scripts/merge_czech_into_global.py
import yaml

# Load Czech data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    czech_institutions = yaml.safe_load(f)

# Load global dataset (if exists)
try:
    with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f:
        global_institutions = yaml.safe_load(f)
except FileNotFoundError:
    global_institutions = []

# Add Czech institutions
global_institutions.extend(czech_institutions)

# Remove duplicates (by ID)
seen_ids = set()
unique_institutions = []
for inst in global_institutions:
    if inst['id'] not in seen_ids:
        unique_institutions.append(inst)
        seen_ids.add(inst['id'])

# Save merged dataset
with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f:
    yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True)

print(f"✅ Merged {len(czech_institutions)} Czech institutions")
print(f"✅ Total global institutions: {len(unique_institutions)}")

Quick Checks

Count by Type

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
from collections import Counter
types = Counter(i.get('institution_type', 'UNKNOWN') for i in data)
for t, c in types.most_common():
    print(f'{t}: {c}')
"

Sample Record

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
import json
print(json.dumps(data[0], indent=2, ensure_ascii=False))
"

GPS Coverage

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"

Data Refresh

To download and re-parse the latest data (updated weekly on Mondays):

# Download latest data
cd data/isil/czech_republic
curl -O https://aleph.nkp.cz/data/adr.xml.gz
gunzip -f adr.xml.gz

# Re-parse
cd /Users/kempersc/apps/glam
python3 scripts/parsers/parse_czech_isil.py

# Check diff
diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml)

Validation

LinkML Schema Validation

# Install linkml if needed
pip install linkml

# Validate Czech institutions against schema
linkml-validate \
  -s schemas/heritage_custodian.yaml \
  data/instances/czech_institutions.yaml

Data Quality Checks

python3 -c "
import yaml

with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)

# Check required fields
missing_name = [i['id'] for i in data if not i.get('name')]
missing_type = [i['id'] for i in data if not i.get('institution_type')]
missing_location = [i['id'] for i in data if not i.get('locations')]

print(f'Missing name: {len(missing_name)}')
print(f'Missing type: {len(missing_type)}')
print(f'Missing location: {len(missing_location)}')

# Check provenance
no_provenance = [i['id'] for i in data if not i.get('provenance')]
print(f'Missing provenance: {len(no_provenance)}')
"

Complete Report: CZECH_ISIL_COMPLETE_REPORT.md
Initial Harvest: CZECH_ISIL_HARVEST_SUMMARY.md
Technical Analysis: data/isil/czech_republic/czech_isil_analysis.md
README: data/isil/czech_republic/README.md
Schema: schemas/heritage_custodian.yaml
Agent Instructions: AGENTS.md

Contact Information

National Library of the Czech Republic (NK ČR)

CASLIN Team: eva.svobodova@nkp.cz
Phone: +420 221 663 205-7
Address: Sodomkova 2/1146, 102 00 Praha 10
Website: https://www.nkp.cz/en/

International ISIL Registry

Authority: Danish Agency for Culture and Palaces
Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil

Last Updated: November 19, 2025
Data Version: ADR database downloaded November 19, 2025
Parser Version: 2.0 (improved type mapping)

9.7 KiB Raw Blame History

Czech ISIL Database - Next Steps & Quick Start

Current Status

Files to Use

Main Data File

Source Data

Parser

Immediate Next Steps

1. Investigate ISIL Code Format

2. Wikidata Enrichment

3. RDF Export

4. Geographic Visualization

Integration with Global Dataset

Merge Script Template

Quick Checks

Count by Type

Sample Record

GPS Coverage

Data Refresh

Validation

LinkML Schema Validation

Data Quality Checks

Related Documentation

Contact Information

9.7 KiB

Raw Blame History