352 lines
9.7 KiB
Markdown
352 lines
9.7 KiB
Markdown
# Czech ISIL Database - Next Steps & Quick Start
|
|
|
|
## Current Status
|
|
|
|
✅ **COMPLETE**: 8,145 Czech institutions parsed with 100% type classification
|
|
⏳ **IN PROGRESS**: ISIL code format investigation
|
|
📋 **PLANNED**: Wikidata enrichment, RDF export, integration
|
|
|
|
## Files to Use
|
|
|
|
### Main Data File
|
|
- **`data/instances/czech_institutions.yaml`** - 8,145 records, LinkML-compliant
|
|
- Format: YAML list of HeritageCustodian objects
|
|
- Size: 8.8 MB
|
|
- Classification: 100% complete
|
|
|
|
### Source Data
|
|
- **`data/isil/czech_republic/adr.xml`** - Original MARC21 XML (27 MB)
|
|
- **`data/isil/czech_republic/adr.xml.gz`** - Compressed download (1.9 MB)
|
|
- Download URL: https://aleph.nkp.cz/data/adr.xml.gz
|
|
- Updated: Weekly (every Monday)
|
|
|
|
### Parser
|
|
- **`scripts/parsers/parse_czech_isil.py`** - MARC21 to LinkML converter
|
|
- Usage: `python3 scripts/parsers/parse_czech_isil.py`
|
|
- Version: 2.0 (improved type mapping)
|
|
|
|
## Immediate Next Steps
|
|
|
|
### 1. Investigate ISIL Code Format
|
|
|
|
**Issue**: Czech database uses "siglas" (e.g., ABA000) instead of standard ISIL codes (CZ-XXXXX)
|
|
|
|
**Actions**:
|
|
|
|
```bash
|
|
# Option A: Contact NK ČR directly
|
|
# Email: eva.svobodova@nkp.cz
|
|
# Questions:
|
|
# 1. Are siglas the official ISIL suffixes for Czech institutions?
|
|
# 2. If not, how do siglas map to ISIL codes?
|
|
# 3. Can you provide an official ISIL registry for Czech libraries?
|
|
```
|
|
|
|
```bash
|
|
# Option B: Check international ISIL registry
|
|
curl -s "https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil" | grep -i "czech\|CZ-"
|
|
```
|
|
|
|
```bash
|
|
# Option C: Test sample ISIL codes
|
|
# Try looking up CZ-ABA000 in various library registries
|
|
# Check if format is recognized by OCLC, Wikidata, etc.
|
|
```
|
|
|
|
### 2. Wikidata Enrichment
|
|
|
|
**Goal**: Match Czech institutions to Wikidata Q-numbers for GHCID collision resolution
|
|
|
|
**Script Template**:
|
|
|
|
```python
|
|
# scripts/enrich_czech_wikidata.py
|
|
from SPARQLWrapper import SPARQLWrapper, JSON
|
|
|
|
def query_wikidata_czech_libraries():
|
|
"""Query Wikidata for Czech libraries and institutions."""
|
|
endpoint = "https://query.wikidata.org/sparql"
|
|
|
|
query = """
|
|
SELECT ?item ?itemLabel ?isil ?viaf ?geonames WHERE {
|
|
?item wdt:P31/wdt:P279* wd:Q7075 . # Instance of library (or subclass)
|
|
?item wdt:P17 wd:Q213 . # Country: Czech Republic
|
|
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
|
|
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
|
|
OPTIONAL { ?item wdt:P1566 ?geonames } # GeoNames ID
|
|
SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
|
|
}
|
|
LIMIT 5000
|
|
"""
|
|
|
|
sparql = SPARQLWrapper(endpoint)
|
|
sparql.setQuery(query)
|
|
sparql.setReturnFormat(JSON)
|
|
return sparql.query().convert()
|
|
|
|
# Run query
|
|
results = query_wikidata_czech_libraries()
|
|
print(f"Found {len(results['results']['bindings'])} Czech libraries in Wikidata")
|
|
|
|
# Match to our dataset by name, city, or sigla
|
|
# Use fuzzy matching with rapidfuzz
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
python3 scripts/enrich_czech_wikidata.py
|
|
```
|
|
|
|
### 3. RDF Export
|
|
|
|
**Goal**: Export Czech institutions as RDF/Turtle for semantic web integration
|
|
|
|
**Script Template**:
|
|
|
|
```python
|
|
# scripts/export_czech_rdf.py
|
|
import yaml
|
|
from rdflib import Graph, Namespace, Literal, URIRef
|
|
from rdflib.namespace import RDF, RDFS, DCTERMS
|
|
|
|
# Load Czech institutions
|
|
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
|
|
institutions = yaml.safe_load(f)
|
|
|
|
# Create RDF graph
|
|
g = Graph()
|
|
heritage = Namespace("https://w3id.org/heritage/custodian/")
|
|
glam = Namespace("https://w3id.org/heritage/ontology/")
|
|
|
|
g.bind("heritage", heritage)
|
|
g.bind("glam", glam)
|
|
g.bind("dcterms", DCTERMS)
|
|
|
|
for inst in institutions:
|
|
inst_uri = URIRef(inst['id'])
|
|
|
|
g.add((inst_uri, RDF.type, glam.HeritageCustodian))
|
|
g.add((inst_uri, RDFS.label, Literal(inst['name'], lang='cs')))
|
|
g.add((inst_uri, glam.institutionType, Literal(inst['institution_type'])))
|
|
|
|
# Add locations
|
|
for loc in inst.get('locations', []):
|
|
if 'latitude' in loc and 'longitude' in loc:
|
|
g.add((inst_uri, glam.latitude, Literal(loc['latitude'])))
|
|
g.add((inst_uri, glam.longitude, Literal(loc['longitude'])))
|
|
|
|
# Export
|
|
g.serialize('data/exports/czech_institutions.ttl', format='turtle')
|
|
print(f"✅ Exported {len(institutions)} institutions to RDF")
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
python3 scripts/export_czech_rdf.py
|
|
```
|
|
|
|
### 4. Geographic Visualization
|
|
|
|
**Goal**: Create interactive map of Czech heritage institutions
|
|
|
|
**Script Template**:
|
|
|
|
```python
|
|
# scripts/visualize_czech_map.py
|
|
import yaml
|
|
import folium
|
|
from collections import Counter
|
|
|
|
# Load data
|
|
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
|
|
institutions = yaml.safe_load(f)
|
|
|
|
# Create map centered on Czech Republic
|
|
m = folium.Map(location=[49.8, 15.5], zoom_start=7)
|
|
|
|
# Color by type
|
|
colors = {
|
|
'LIBRARY': 'blue',
|
|
'MUSEUM': 'red',
|
|
'GALLERY': 'green',
|
|
'EDUCATION_PROVIDER': 'orange',
|
|
'HOLY_SITES': 'purple',
|
|
'OFFICIAL_INSTITUTION': 'darkblue',
|
|
}
|
|
|
|
# Add markers
|
|
for inst in institutions:
|
|
for loc in inst.get('locations', []):
|
|
if 'latitude' in loc and 'longitude' in loc:
|
|
folium.Marker(
|
|
[loc['latitude'], loc['longitude']],
|
|
popup=f"{inst['name']}<br>{inst['institution_type']}",
|
|
icon=folium.Icon(color=colors.get(inst['institution_type'], 'gray'))
|
|
).add_to(m)
|
|
|
|
# Save map
|
|
m.save('data/exports/czech_institutions_map.html')
|
|
print("✅ Map saved to data/exports/czech_institutions_map.html")
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
python3 scripts/visualize_czech_map.py
|
|
open data/exports/czech_institutions_map.html
|
|
```
|
|
|
|
## Integration with Global Dataset
|
|
|
|
### Merge Script Template
|
|
|
|
```python
|
|
# scripts/merge_czech_into_global.py
|
|
import yaml
|
|
|
|
# Load Czech data
|
|
with open('data/instances/czech_institutions.yaml', 'r', encoding='utf-8') as f:
|
|
czech_institutions = yaml.safe_load(f)
|
|
|
|
# Load global dataset (if exists)
|
|
try:
|
|
with open('data/instances/global_glam_institutions.yaml', 'r', encoding='utf-8') as f:
|
|
global_institutions = yaml.safe_load(f)
|
|
except FileNotFoundError:
|
|
global_institutions = []
|
|
|
|
# Add Czech institutions
|
|
global_institutions.extend(czech_institutions)
|
|
|
|
# Remove duplicates (by ID)
|
|
seen_ids = set()
|
|
unique_institutions = []
|
|
for inst in global_institutions:
|
|
if inst['id'] not in seen_ids:
|
|
unique_institutions.append(inst)
|
|
seen_ids.add(inst['id'])
|
|
|
|
# Save merged dataset
|
|
with open('data/instances/global_glam_institutions.yaml', 'w', encoding='utf-8') as f:
|
|
yaml.dump(unique_institutions, f, default_flow_style=False, allow_unicode=True)
|
|
|
|
print(f"✅ Merged {len(czech_institutions)} Czech institutions")
|
|
print(f"✅ Total global institutions: {len(unique_institutions)}")
|
|
```
|
|
|
|
## Quick Checks
|
|
|
|
### Count by Type
|
|
```bash
|
|
python3 -c "
|
|
import yaml
|
|
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
from collections import Counter
|
|
types = Counter(i.get('institution_type', 'UNKNOWN') for i in data)
|
|
for t, c in types.most_common():
|
|
print(f'{t}: {c}')
|
|
"
|
|
```
|
|
|
|
### Sample Record
|
|
```bash
|
|
python3 -c "
|
|
import yaml
|
|
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
import json
|
|
print(json.dumps(data[0], indent=2, ensure_ascii=False))
|
|
"
|
|
```
|
|
|
|
### GPS Coverage
|
|
```bash
|
|
python3 -c "
|
|
import yaml
|
|
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
with_gps = sum(1 for i in data if any('latitude' in l for l in i.get('locations', [])))
|
|
print(f'Institutions with GPS: {with_gps}/{len(data)} ({with_gps/len(data)*100:.1f}%)')
|
|
"
|
|
```
|
|
|
|
## Data Refresh
|
|
|
|
To download and re-parse the latest data (updated weekly on Mondays):
|
|
|
|
```bash
|
|
# Download latest data
|
|
cd data/isil/czech_republic
|
|
curl -O https://aleph.nkp.cz/data/adr.xml.gz
|
|
gunzip -f adr.xml.gz
|
|
|
|
# Re-parse
|
|
cd /Users/kempersc/apps/glam
|
|
python3 scripts/parsers/parse_czech_isil.py
|
|
|
|
# Check diff
|
|
diff <(wc -l data/instances/czech_institutions_v1_backup.yaml) <(wc -l data/instances/czech_institutions.yaml)
|
|
```
|
|
|
|
## Validation
|
|
|
|
### LinkML Schema Validation
|
|
```bash
|
|
# Install linkml if needed
|
|
pip install linkml
|
|
|
|
# Validate Czech institutions against schema
|
|
linkml-validate \
|
|
-s schemas/heritage_custodian.yaml \
|
|
data/instances/czech_institutions.yaml
|
|
```
|
|
|
|
### Data Quality Checks
|
|
```bash
|
|
python3 -c "
|
|
import yaml
|
|
|
|
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
|
|
# Check required fields
|
|
missing_name = [i['id'] for i in data if not i.get('name')]
|
|
missing_type = [i['id'] for i in data if not i.get('institution_type')]
|
|
missing_location = [i['id'] for i in data if not i.get('locations')]
|
|
|
|
print(f'Missing name: {len(missing_name)}')
|
|
print(f'Missing type: {len(missing_type)}')
|
|
print(f'Missing location: {len(missing_location)}')
|
|
|
|
# Check provenance
|
|
no_provenance = [i['id'] for i in data if not i.get('provenance')]
|
|
print(f'Missing provenance: {len(no_provenance)}')
|
|
"
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- **Complete Report**: `CZECH_ISIL_COMPLETE_REPORT.md`
|
|
- **Initial Harvest**: `CZECH_ISIL_HARVEST_SUMMARY.md`
|
|
- **Technical Analysis**: `data/isil/czech_republic/czech_isil_analysis.md`
|
|
- **README**: `data/isil/czech_republic/README.md`
|
|
- **Schema**: `schemas/heritage_custodian.yaml`
|
|
- **Agent Instructions**: `AGENTS.md`
|
|
|
|
## Contact Information
|
|
|
|
**National Library of the Czech Republic (NK ČR)**
|
|
- CASLIN Team: eva.svobodova@nkp.cz
|
|
- Phone: +420 221 663 205-7
|
|
- Address: Sodomkova 2/1146, 102 00 Praha 10
|
|
- Website: https://www.nkp.cz/en/
|
|
|
|
**International ISIL Registry**
|
|
- Authority: Danish Agency for Culture and Palaces
|
|
- Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
|
|
|
|
---
|
|
|
|
**Last Updated**: November 19, 2025
|
|
**Data Version**: ADR database downloaded November 19, 2025
|
|
**Parser Version**: 2.0 (improved type mapping)
|