6.4 KiB
6.4 KiB
German ISIL Data - Quick Start Guide
Files Location
/Users/kempersc/apps/glam/data/isil/germany/
├── german_isil_complete_20251119_134939.json # Full dataset (37 MB)
├── german_isil_complete_20251119_134939.jsonl # Line-delimited (24 MB)
├── german_isil_stats_20251119_134941.json # Statistics (7.6 KB)
└── HARVEST_REPORT.md # This harvest report
Quick Access Examples
1. Load Full Dataset in Python
import json
# Load complete dataset
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
print(f"Total records: {data['metadata']['total_records']}")
print(f"First institution: {data['records'][0]['name']}")
2. Stream Processing (JSONL)
import json
# Process one record at a time (memory-efficient)
with open('german_isil_complete_20251119_134939.jsonl', 'r') as f:
for line in f:
record = json.loads(line)
if record['address'].get('city') == 'Berlin':
print(f"{record['name']} - {record['isil']}")
3. Find Records by ISIL
# Using jq (JSON query tool)
jq '.records[] | select(.isil == "DE-1")' german_isil_complete_20251119_134939.json
# Using grep (fast text search)
grep -i "staatsbibliothek" german_isil_complete_20251119_134939.jsonl
4. Extract All Libraries in Munich
jq '.records[] | select(.address.city == "München")' german_isil_complete_20251119_134939.json
5. Count Institutions by Region
jq '.records | group_by(.interloan_region) | map({region: .[0].interloan_region, count: length})' german_isil_complete_20251119_134939.json
6. Export to CSV
import json
import csv
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
with open('german_isil.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['ISIL', 'Name', 'City', 'Email', 'URL'])
for record in data['records']:
writer.writerow([
record['isil'],
record['name'],
record['address'].get('city', ''),
record['contact'].get('email', ''),
record['urls'][0]['url'] if record['urls'] else ''
])
7. Filter by Institution Type
import json
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Find all Max Planck Institute libraries
mpi_libraries = [
r for r in data['records']
if r['institution_type'] and r['institution_type'].startswith('MPI')
]
print(f"Found {len(mpi_libraries)} Max Planck Institute libraries")
8. Create Geographic Map
import json
import folium
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Create map centered on Germany
m = folium.Map(location=[51.1657, 10.4515], zoom_start=6)
# Add markers for institutions with coordinates
for record in data['records']:
lat = record['address'].get('latitude')
lon = record['address'].get('longitude')
if lat and lon:
folium.Marker(
location=[float(lat), float(lon)],
popup=f"{record['name']}<br>{record['isil']}",
tooltip=record['name']
).add_to(m)
m.save('german_isil_map.html')
9. Convert to LinkML YAML
import json
import yaml
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Convert to GLAM project schema
glam_records = []
for record in data['records']:
glam_record = {
'id': f"https://w3id.org/heritage/custodian/de/{record['isil'].lower()}",
'name': record['name'],
'institution_type': 'LIBRARY', # TODO: classify properly
'locations': [{
'city': record['address'].get('city'),
'street_address': record['address'].get('street'),
'postal_code': record['address'].get('postal_code'),
'country': 'DE',
'latitude': float(record['address']['latitude']) if record['address'].get('latitude') else None,
'longitude': float(record['address']['longitude']) if record['address'].get('longitude') else None
}],
'identifiers': [{
'identifier_scheme': 'ISIL',
'identifier_value': record['isil'],
'identifier_url': f"https://sigel.staatsbibliothek-berlin.de/isil/{record['isil']}"
}],
'provenance': {
'data_source': 'CSV_REGISTRY',
'data_tier': 'TIER_1_AUTHORITATIVE',
'extraction_date': '2025-11-19T12:49:39Z'
}
}
glam_records.append(glam_record)
# Save first 10 as example
with open('german_isil_linkml_example.yaml', 'w') as f:
yaml.dump(glam_records[:10], f, allow_unicode=True, default_flow_style=False)
Common Use Cases
Find All Archives
jq '.records[] | select(.name | test("archiv"; "i"))' german_isil_complete_20251119_134939.json
Find All Museums
jq '.records[] | select(.name | test("museum"; "i"))' german_isil_complete_20251119_134939.json
Get Records with Email
jq '.records[] | select(.contact.email != null)' german_isil_complete_20251119_134939.json
Extract URLs Only
jq -r '.records[].urls[]?.url' german_isil_complete_20251119_134939.json > german_isil_urls.txt
Statistics by Federal State
import json
from collections import Counter
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
states = Counter(r['address'].get('region') for r in data['records'])
for state, count in states.most_common(10):
print(f"{state}: {count}")
Integration with GLAM Project
See HARVEST_REPORT.md for detailed integration recommendations.
Key Tasks:
- Map institution types to GLAMORCUBESFIXPHDNT taxonomy
- Generate GHCIDs for each institution
- Enrich with Wikidata Q-numbers
- Cross-reference with other German registries
- Export to RDF/Turtle for Linked Data
Data Quality Notes
- ✅ 87% have coordinates - excellent for mapping
- ✅ 79% have URLs - good for web scraping
- ⚠️ 38% have emails - may need enrichment
- ⚠️ 97% lack institution type codes - requires classification
Need Help?
See the GLAM project documentation:
/Users/kempersc/apps/glam/AGENTS.md- AI agent instructions/Users/kempersc/apps/glam/docs/- Schema and design docs/Users/kempersc/apps/glam/scripts/- Example parsers and converters