glam/CZECH_ISIL_NEXT_STEPS.md

# Czech ISIL Database - Next Steps & Quick Start

## Current Status

✅ **COMPLETE**: 8,145 Czech institutions parsed with 100% type classification
⏳ **IN PROGRESS**: ISIL code format investigation
📋 **PLANNED**: Wikidata enrichment, RDF export, integration

## Files to Use

### Main Data File
- **`data/instances/czech_institutions.yaml`** - 8,145 records, LinkML-compliant
- Format: YAML list of HeritageCustodian objects
- Size: 8.8 MB
- Classification: 100% complete

### Source Data
- **`data/isil/czech_republic/adr.xml`** - Original MARC21 XML (27 MB)
- **`data/isil/czech_republic/adr.xml.gz`** - Compressed download (1.9 MB)
- Download URL: https://aleph.nkp.cz/data/adr.xml.gz
- Updated: Weekly (every Monday)

### Parser
- **`scripts/parsers/parse_czech_isil.py`** - MARC21 to LinkML converter
- Usage: `python3 scripts/parsers/parse_czech_isil.py`
- Version: 2.0 (improved type mapping)

## Immediate Next Steps

### 1. Investigate ISIL Code Format

**Issue**: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX)

**Actions**:

```bash
# Option A: Contact NK ČR directly
# Email: eva.svobodova@nkp.cz
# Questions:
#   1. Are siglas the official ISIL suffixes for Czech institutions?
#   2. If not, how do siglas map to ISIL codes?
#   3. Can you provide an official ISIL registry for Czech libraries?
```

```bash
# Option B: Check international ISIL registry
curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-"
```

```bash
# Option C: Test sample ISIL codes
# Try looking up CZ-ABA000 in various library registries
# Check if format is recognized by OCLC, Wikidata, etc.
```

### 2. Wikidata Enrichment

**Goal**: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution

**Script Template**:

```python
# scripts/enrich_czech_wikidata.py
from SPARQLWrapper import SPARQLWrapper, JSON

def query_wikidata_czech_libraries():
    """Query Wikidata for Czech libraries and institutions."""
    endpoint = "https://query.wikidata.org/sparql"

    query = """
    SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE {
      ?item wdt:P31/wdt:P279* wd:Q7075 .  # Instance of library (or subclass)
      ?item wdt:P17 wd:Q213 .              # Country: Czech Republic
      OPTIONAL { ?item wdt:P791 ?isil }   # ISIL code
      OPTIONAL { ?item wdt:P214 ?viaf }   # VIAF ID
      OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID
      SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
    }
    LIMIT 5000
    """

    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Run query
results = query_wikidata_czech_libraries()
print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata")

# Match to our dataset by name, city, or sigla
# Use fuzzy matching with rapidfuzz
```

**Usage**:
```bash
python3 scripts/enrich_czech_wikidata.py
```

### 3. RDF Export

**Goal**: Export Czech institutions as RDF/Turtle for semantic web integration

**Script Template**:

```python
# scripts/export_czech_rdf.py
import yaml
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, DCTERMS

# Load Czech institutions
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create RDF graph
g = Graph()
heritage = Namespace("https://w3id.org/heritage/custodian/")
glam = Namespace("https://w3id.org/heritage/ontology/")

g.bind("heritage", heritage)
g.bind("glam", glam)
g.bind("dcterms", DCTERMS)

for inst in institutions:
    inst_uri = URIRef(inst['id'])

    g.add((inst_uri, RDF.type, glam.HeritageCustodian))
    g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs')))
    g.add((inst_uri, glam.institutionType, Literal(inst['institution_type'])))

    # Add locations
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            g.add((inst_uri, glam.latitude, Literal(loc['latitude'])))
            g.add((inst_uri, glam.longitude, Literal(loc['longitude'])))

# Export
g.serialize('data/exports/czech_institutions.ttl', format='turtle')
print(f"✅ Exported {len(institutions)} institutions to RDF")
```

**Usage**:
```bash
python3 scripts/export_czech_rdf.py
```

### 4. Geographic Visualization

**Goal**: Create interactive map of Czech heritage institutions

**Script Template**:

```python
# scripts/visualize_czech_map.py
import yaml
import folium
from collections import Counter

# Load data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    institutions = yaml.safe_load(f)

# Create map centered on Czech Republic
m = folium.Map(location=[49.8, 15.5], zoom_start=7)

# Color by type
colors = {
    'LIBRARY': 'blue',
    'MUSEUM': 'red',
    'GALLERY': 'green',
    'EDUCATION_PROVIDER': 'orange',
    'HOLY_SITES': 'purple',
    'OFFICIAL_INSTITUTION': 'darkblue',
}

# Add markers
for inst in institutions:
    for loc in inst.get('locations', []):
        if 'latitude' in loc and 'longitude' in loc:
            folium.Marker(
                [loc['latitude'], loc['longitude']],
                popup=f"{inst['name']}<br>{inst['institution_type']}",
                icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray'))
            ).add_to(m)

# Save map
m.save('data/exports/czech_institutions_map.html')
print("✅ Map saved to data/exports/czech_institutions_map.html")
```

**Usage**:
```bash
python3 scripts/visualize_czech_map.py
open data/exports/czech_institutions_map.html
```

## Integration with Global Dataset

### Merge Script Template

```python
# scripts/merge_czech_into_global.py
import yaml

# Load Czech data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
    czech_institutions = yaml.safe_load(f)

# Load global dataset (if exists)
try:
    with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f:
        global_institutions = yaml.safe_load(f)
except FileNotFoundError:
    global_institutions = []

# Add Czech institutions
global_institutions.extend(czech_institutions)

# Remove duplicates (by ID)
seen_ids = set()
unique_institutions = []
for inst in global_institutions:
    if inst['id'] not in seen_ids:
        unique_institutions.append(inst)
        seen_ids.add(inst['id'])

# Save merged dataset
with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f:
    yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True)

print(f"✅ Merged {len(czech_institutions)} Czech institutions")
print(f"✅ Total global institutions: {len(unique_institutions)}")
```

## Quick Checks

### Count by Type
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
from collections import Counter
types = Counter(i.get('institution_type', 'UNKNOWN') for i in data)
for t, c in types.most_common():
    print(f'{t}: {c}')
"
```

### Sample Record
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
import json
print(json.dumps(data[0], indent=2, ensure_ascii=False))
"
```

### GPS Coverage
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"
```

## Data Refresh

To download and re-parse the latest data (updated weekly on Mondays):

```bash
# Download latest data
cd data/isil/czech_republic
curl -O https://aleph.nkp.cz/data/adr.xml.gz
gunzip -f adr.xml.gz

# Re-parse
cd /Users/kempersc/apps/glam
python3 scripts/parsers/parse_czech_isil.py

# Check diff
diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml)
```

## Validation

### LinkML Schema Validation
```bash
# Install linkml if needed
pip install linkml

# Validate Czech institutions against schema
linkml-validate \
  -s schemas/heritage_custodian.yaml \
  data/instances/czech_institutions.yaml
```

### Data Quality Checks
```bash
python3 -c "
import yaml

with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)

# Check required fields
missing_name = [i['id'] for i in data if not i.get('name')]
missing_type = [i['id'] for i in data if not i.get('institution_type')]
missing_location = [i['id'] for i in data if not i.get('locations')]

print(f'Missing name: {len(missing_name)}')
print(f'Missing type: {len(missing_type)}')
print(f'Missing location: {len(missing_location)}')

# Check provenance
no_provenance = [i['id'] for i in data if not i.get('provenance')]
print(f'Missing provenance: {len(no_provenance)}')
"
```

## Related Documentation

- **Complete Report**: `CZECH_ISIL_COMPLETE_REPORT.md`
- **Initial Harvest**: `CZECH_ISIL_HARVEST_SUMMARY.md`
- **Technical Analysis**: `data/isil/czech_republic/czech_isil_analysis.md`
- **README**: `data/isil/czech_republic/README.md`
- **Schema**: `schemas/heritage_custodian.yaml`
- **Agent Instructions**: `AGENTS.md`

## Contact Information

**National Library of the Czech Republic (NK ČR)**
- CASLIN Team: eva.svobodova@nkp.cz
- Phone: +420 221 663 205-7
- Address: Sodomkova 2/1146, 102 00 Praha 10
- Website: https://www.nkp.cz/en/

**International ISIL Registry**
- Authority: Danish Agency for Culture and Palaces
- Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil

---

**Last Updated**: November 19, 2025
**Data Version**: ADR database downloaded November 19, 2025
**Parser Version**: 2.0 (improved type mapping)