glam/data/isil/germany/QUICK_START.md
2025-11-19 23:25:22 +01:00

6.4 KiB

German ISIL Data - Quick Start Guide

Files Location

/Users/kempersc/apps/glam/data/isil/germany/
├── german_isil_complete_20251119_134939.json    # Full dataset (37 MB)
├── german_isil_complete_20251119_134939.jsonl   # Line-delimited (24 MB)
├── german_isil_stats_20251119_134941.json       # Statistics (7.6 KB)
└── HARVEST_REPORT.md                             # This harvest report

Quick Access Examples

1. Load Full Dataset in Python

import json

# Load complete dataset
with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

print(f"Total records: {data['metadata']['total_records']}")
print(f"First institution: {data['records'][0]['name']}")

2. Stream Processing (JSONL)

import json

# Process one record at a time (memory-efficient)
with open('german_isil_complete_20251119_134939.jsonl', 'r') as f:
    for line in f:
        record = json.loads(line)
        if record['address'].get('city') == 'Berlin':
            print(f"{record['name']} - {record['isil']}")

3. Find Records by ISIL

# Using jq (JSON query tool)
jq '.records[] | select(.isil == "DE-1")' german_isil_complete_20251119_134939.json

# Using grep (fast text search)
grep -i "staatsbibliothek" german_isil_complete_20251119_134939.jsonl

4. Extract All Libraries in Munich

jq '.records[] | select(.address.city == "München")' german_isil_complete_20251119_134939.json

5. Count Institutions by Region

jq '.records | group_by(.interloan_region) | map({region: .[0].interloan_region, count: length})' german_isil_complete_20251119_134939.json

6. Export to CSV

import json
import csv

with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

with open('german_isil.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['ISIL', 'Name', 'City', 'Email', 'URL'])
    
    for record in data['records']:
        writer.writerow([
            record['isil'],
            record['name'],
            record['address'].get('city', ''),
            record['contact'].get('email', ''),
            record['urls'][0]['url'] if record['urls'] else ''
        ])

7. Filter by Institution Type

import json

with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

# Find all Max Planck Institute libraries
mpi_libraries = [
    r for r in data['records']
    if r['institution_type'] and r['institution_type'].startswith('MPI')
]

print(f"Found {len(mpi_libraries)} Max Planck Institute libraries")

8. Create Geographic Map

import json
import folium

with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

# Create map centered on Germany
m = folium.Map(location=[51.1657, 10.4515], zoom_start=6)

# Add markers for institutions with coordinates
for record in data['records']:
    lat = record['address'].get('latitude')
    lon = record['address'].get('longitude')
    
    if lat and lon:
        folium.Marker(
            location=[float(lat), float(lon)],
            popup=f"{record['name']}<br>{record['isil']}",
            tooltip=record['name']
        ).add_to(m)

m.save('german_isil_map.html')

9. Convert to LinkML YAML

import json
import yaml

with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

# Convert to GLAM project schema
glam_records = []

for record in data['records']:
    glam_record = {
        'id': f"https://w3id.org/heritage/custodian/de/{record['isil'].lower()}",
        'name': record['name'],
        'institution_type': 'LIBRARY',  # TODO: classify properly
        'locations': [{
            'city': record['address'].get('city'),
            'street_address': record['address'].get('street'),
            'postal_code': record['address'].get('postal_code'),
            'country': 'DE',
            'latitude': float(record['address']['latitude']) if record['address'].get('latitude') else None,
            'longitude': float(record['address']['longitude']) if record['address'].get('longitude') else None
        }],
        'identifiers': [{
            'identifier_scheme': 'ISIL',
            'identifier_value': record['isil'],
            'identifier_url': f"https://sigel.staatsbibliothek-berlin.de/isil/{record['isil']}"
        }],
        'provenance': {
            'data_source': 'CSV_REGISTRY',
            'data_tier': 'TIER_1_AUTHORITATIVE',
            'extraction_date': '2025-11-19T12:49:39Z'
        }
    }
    glam_records.append(glam_record)

# Save first 10 as example
with open('german_isil_linkml_example.yaml', 'w') as f:
    yaml.dump(glam_records[:10], f, allow_unicode=True, default_flow_style=False)

Common Use Cases

Find All Archives

jq '.records[] | select(.name | test("archiv"; "i"))' german_isil_complete_20251119_134939.json

Find All Museums

jq '.records[] | select(.name | test("museum"; "i"))' german_isil_complete_20251119_134939.json

Get Records with Email

jq '.records[] | select(.contact.email != null)' german_isil_complete_20251119_134939.json

Extract URLs Only

jq -r '.records[].urls[]?.url' german_isil_complete_20251119_134939.json > german_isil_urls.txt

Statistics by Federal State

import json
from collections import Counter

with open('german_isil_complete_20251119_134939.json', 'r') as f:
    data = json.load(f)

states = Counter(r['address'].get('region') for r in data['records'])
for state, count in states.most_common(10):
    print(f"{state}: {count}")

Integration with GLAM Project

See HARVEST_REPORT.md for detailed integration recommendations.

Key Tasks:

  1. Map institution types to GLAMORCUBESFIXPHDNT taxonomy
  2. Generate GHCIDs for each institution
  3. Enrich with Wikidata Q-numbers
  4. Cross-reference with other German registries
  5. Export to RDF/Turtle for Linked Data

Data Quality Notes

  • 87% have coordinates - excellent for mapping
  • 79% have URLs - good for web scraping
  • ⚠️ 38% have emails - may need enrichment
  • ⚠️ 97% lack institution type codes - requires classification

Need Help?

See the GLAM project documentation:

  • /Users/kempersc/apps/glam/AGENTS.md - AI agent instructions
  • /Users/kempersc/apps/glam/docs/ - Schema and design docs
  • /Users/kempersc/apps/glam/scripts/ - Example parsers and converters