glam/CZECH_ISIL_NEXT_STEPS.md
2025-11-19 23:25:22 +01:00

352 lines
9.7 KiB
Markdown

# Czech ISIL Database - Next Steps & Quick Start
## Current Status
**COMPLETE**: 8,145 Czech institutions parsed with 100% type classification
**IN PROGRESS**: ISIL code format investigation
📋 **PLANNED**: Wikidata enrichment, RDF export, integration
## Files to Use
### Main Data File
- **`data/instances/czech_institutions.yaml`** - 8,145 records, LinkML-compliant
- Format: YAML list of HeritageCustodian objects
- Size: 8.8 MB
- Classification: 100% complete
### Source Data
- **`data/isil/czech_republic/adr.xml`** - Original MARC21 XML (27 MB)
- **`data/isil/czech_republic/adr.xml.gz`** - Compressed download (1.9 MB)
- Download URL: https://aleph.nkp.cz/data/adr.xml.gz
- Updated: Weekly (every Monday)
### Parser
- **`scripts/parsers/parse_czech_isil.py`** - MARC21 to LinkML converter
- Usage: `python3 scripts/parsers/parse_czech_isil.py`
- Version: 2.0 (improved type mapping)
## Immediate Next Steps
### 1. Investigate ISIL Code Format
**Issue**: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX)
**Actions**:
```bash
# Option A: Contact NK ČR directly
# Email: eva.svobodova@nkp.cz
# Questions:
# 1. Are siglas the official ISIL suffixes for Czech institutions?
# 2. If not, how do siglas map to ISIL codes?
# 3. Can you provide an official ISIL registry for Czech libraries?
```
```bash
# Option B: Check international ISIL registry
curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-"
```
```bash
# Option C: Test sample ISIL codes
# Try looking up CZ-ABA000 in various library registries
# Check if format is recognized by OCLC, Wikidata, etc.
```
### 2. Wikidata Enrichment
**Goal**: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution
**Script Template**:
```python
# scripts/enrich_czech_wikidata.py
from SPARQLWrapper import SPARQLWrapper, JSON
def query_wikidata_czech_libraries():
"""Query Wikidata for Czech libraries and institutions."""
endpoint = "https://query.wikidata.org/sparql"
query = """
SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE {
?item wdt:P31/wdt:P279* wd:Q7075 . # Instance of library (or subclass)
?item wdt:P17 wd:Q213 . # Country: Czech Republic
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID
SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
}
LIMIT 5000
"""
sparql = SPARQLWrapper(endpoint)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
# Run query
results = query_wikidata_czech_libraries()
print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata")
# Match to our dataset by name, city, or sigla
# Use fuzzy matching with rapidfuzz
```
**Usage**:
```bash
python3 scripts/enrich_czech_wikidata.py
```
### 3. RDF Export
**Goal**: Export Czech institutions as RDF/Turtle for semantic web integration
**Script Template**:
```python
# scripts/export_czech_rdf.py
import yaml
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS, DCTERMS
# Load Czech institutions
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
institutions = yaml.safe_load(f)
# Create RDF graph
g = Graph()
heritage = Namespace("https://w3id.org/heritage/custodian/")
glam = Namespace("https://w3id.org/heritage/ontology/")
g.bind("heritage", heritage)
g.bind("glam", glam)
g.bind("dcterms", DCTERMS)
for inst in institutions:
inst_uri = URIRef(inst['id'])
g.add((inst_uri, RDF.type, glam.HeritageCustodian))
g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs')))
g.add((inst_uri, glam.institutionType, Literal(inst['institution_type'])))
# Add locations
for loc in inst.get('locations', []):
if 'latitude' in loc and 'longitude' in loc:
g.add((inst_uri, glam.latitude, Literal(loc['latitude'])))
g.add((inst_uri, glam.longitude, Literal(loc['longitude'])))
# Export
g.serialize('data/exports/czech_institutions.ttl', format='turtle')
print(f"✅ Exported {len(institutions)} institutions to RDF")
```
**Usage**:
```bash
python3 scripts/export_czech_rdf.py
```
### 4. Geographic Visualization
**Goal**: Create interactive map of Czech heritage institutions
**Script Template**:
```python
# scripts/visualize_czech_map.py
import yaml
import folium
from collections import Counter
# Load data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
institutions = yaml.safe_load(f)
# Create map centered on Czech Republic
m = folium.Map(location=[49.8, 15.5], zoom_start=7)
# Color by type
colors = {
'LIBRARY': 'blue',
'MUSEUM': 'red',
'GALLERY': 'green',
'EDUCATION_PROVIDER': 'orange',
'HOLY_SITES': 'purple',
'OFFICIAL_INSTITUTION': 'darkblue',
}
# Add markers
for inst in institutions:
for loc in inst.get('locations', []):
if 'latitude' in loc and 'longitude' in loc:
folium.Marker(
[loc['latitude'], loc['longitude']],
popup=f"{inst['name']}<br>{inst['institution_type']}",
icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray'))
).add_to(m)
# Save map
m.save('data/exports/czech_institutions_map.html')
print("✅ Map saved to data/exports/czech_institutions_map.html")
```
**Usage**:
```bash
python3 scripts/visualize_czech_map.py
open data/exports/czech_institutions_map.html
```
## Integration with Global Dataset
### Merge Script Template
```python
# scripts/merge_czech_into_global.py
import yaml
# Load Czech data
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
czech_institutions = yaml.safe_load(f)
# Load global dataset (if exists)
try:
with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f:
global_institutions = yaml.safe_load(f)
except FileNotFoundError:
global_institutions = []
# Add Czech institutions
global_institutions.extend(czech_institutions)
# Remove duplicates (by ID)
seen_ids = set()
unique_institutions = []
for inst in global_institutions:
if inst['id'] not in seen_ids:
unique_institutions.append(inst)
seen_ids.add(inst['id'])
# Save merged dataset
with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f:
yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True)
print(f"✅ Merged {len(czech_institutions)} Czech institutions")
print(f"✅ Total global institutions: {len(unique_institutions)}")
```
## Quick Checks
### Count by Type
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
from collections import Counter
types = Counter(i.get('institution_type', 'UNKNOWN') for i in data)
for t, c in types.most_common():
print(f'{t}: {c}')
"
```
### Sample Record
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
import json
print(json.dumps(data[0], indent=2, ensure_ascii=False))
"
```
### GPS Coverage
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
"
```
## Data Refresh
To download and re-parse the latest data (updated weekly on Mondays):
```bash
# Download latest data
cd data/isil/czech_republic
curl -O https://aleph.nkp.cz/data/adr.xml.gz
gunzip -f adr.xml.gz
# Re-parse
cd /Users/kempersc/apps/glam
python3 scripts/parsers/parse_czech_isil.py
# Check diff
diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml)
```
## Validation
### LinkML Schema Validation
```bash
# Install linkml if needed
pip install linkml
# Validate Czech institutions against schema
linkml-validate \
-s schemas/heritage_custodian.yaml \
data/instances/czech_institutions.yaml
```
### Data Quality Checks
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
# Check required fields
missing_name = [i['id'] for i in data if not i.get('name')]
missing_type = [i['id'] for i in data if not i.get('institution_type')]
missing_location = [i['id'] for i in data if not i.get('locations')]
print(f'Missing name: {len(missing_name)}')
print(f'Missing type: {len(missing_type)}')
print(f'Missing location: {len(missing_location)}')
# Check provenance
no_provenance = [i['id'] for i in data if not i.get('provenance')]
print(f'Missing provenance: {len(no_provenance)}')
"
```
## Related Documentation
- **Complete Report**: `CZECH_ISIL_COMPLETE_REPORT.md`
- **Initial Harvest**: `CZECH_ISIL_HARVEST_SUMMARY.md`
- **Technical Analysis**: `data/isil/czech_republic/czech_isil_analysis.md`
- **README**: `data/isil/czech_republic/README.md`
- **Schema**: `schemas/heritage_custodian.yaml`
- **Agent Instructions**: `AGENTS.md`
## Contact Information
**National Library of the Czech Republic (NK ČR)**
- CASLIN Team: eva.svobodova@nkp.cz
- Phone: +420 221 663 205-7
- Address: Sodomkova 2/1146, 102 00 Praha 10
- Website: https://www.nkp.cz/en/
**International ISIL Registry**
- Authority: Danish Agency for Culture and Palaces
- Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
---
**Last Updated**: November 19, 2025
**Data Version**: ADR database downloaded November 19, 2025
**Parser Version**: 2.0 (improved type mapping)