231 lines
6.4 KiB
Markdown
231 lines
6.4 KiB
Markdown
# German ISIL Data - Quick Start Guide
|
|
|
|
## Files Location
|
|
|
|
```
|
|
/Users/kempersc/apps/glam/data/isil/germany/
|
|
├── german_isil_complete_20251119_134939.json # Full dataset (37 MB)
|
|
├── german_isil_complete_20251119_134939.jsonl # Line-delimited (24 MB)
|
|
├── german_isil_stats_20251119_134941.json # Statistics (7.6 KB)
|
|
└── HARVEST_REPORT.md # This harvest report
|
|
```
|
|
|
|
## Quick Access Examples
|
|
|
|
### 1. Load Full Dataset in Python
|
|
|
|
```python
|
|
import json
|
|
|
|
# Load complete dataset
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
print(f"Total records: {data['metadata']['total_records']}")
|
|
print(f"First institution: {data['records'][0]['name']}")
|
|
```
|
|
|
|
### 2. Stream Processing (JSONL)
|
|
|
|
```python
|
|
import json
|
|
|
|
# Process one record at a time (memory-efficient)
|
|
with open('german_isil_complete_20251119_134939.jsonl', 'r') as f:
|
|
for line in f:
|
|
record = json.loads(line)
|
|
if record['address'].get('city') == 'Berlin':
|
|
print(f"{record['name']} - {record['isil']}")
|
|
```
|
|
|
|
### 3. Find Records by ISIL
|
|
|
|
```bash
|
|
# Using jq (JSON query tool)
|
|
jq '.records[] | select(.isil == "DE-1")' german_isil_complete_20251119_134939.json
|
|
|
|
# Using grep (fast text search)
|
|
grep -i "staatsbibliothek" german_isil_complete_20251119_134939.jsonl
|
|
```
|
|
|
|
### 4. Extract All Libraries in Munich
|
|
|
|
```bash
|
|
jq '.records[] | select(.address.city == "München")' german_isil_complete_20251119_134939.json
|
|
```
|
|
|
|
### 5. Count Institutions by Region
|
|
|
|
```bash
|
|
jq '.records | group_by(.interloan_region) | map({region: .[0].interloan_region, count: length})' german_isil_complete_20251119_134939.json
|
|
```
|
|
|
|
### 6. Export to CSV
|
|
|
|
```python
|
|
import json
|
|
import csv
|
|
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
with open('german_isil.csv', 'w', newline='', encoding='utf-8') as f:
|
|
writer = csv.writer(f)
|
|
writer.writerow(['ISIL', 'Name', 'City', 'Email', 'URL'])
|
|
|
|
for record in data['records']:
|
|
writer.writerow([
|
|
record['isil'],
|
|
record['name'],
|
|
record['address'].get('city', ''),
|
|
record['contact'].get('email', ''),
|
|
record['urls'][0]['url'] if record['urls'] else ''
|
|
])
|
|
```
|
|
|
|
### 7. Filter by Institution Type
|
|
|
|
```python
|
|
import json
|
|
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
# Find all Max Planck Institute libraries
|
|
mpi_libraries = [
|
|
r for r in data['records']
|
|
if r['institution_type'] and r['institution_type'].startswith('MPI')
|
|
]
|
|
|
|
print(f"Found {len(mpi_libraries)} Max Planck Institute libraries")
|
|
```
|
|
|
|
### 8. Create Geographic Map
|
|
|
|
```python
|
|
import json
|
|
import folium
|
|
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
# Create map centered on Germany
|
|
m = folium.Map(location=[51.1657, 10.4515], zoom_start=6)
|
|
|
|
# Add markers for institutions with coordinates
|
|
for record in data['records']:
|
|
lat = record['address'].get('latitude')
|
|
lon = record['address'].get('longitude')
|
|
|
|
if lat and lon:
|
|
folium.Marker(
|
|
location=[float(lat), float(lon)],
|
|
popup=f"{record['name']}<br>{record['isil']}",
|
|
tooltip=record['name']
|
|
).add_to(m)
|
|
|
|
m.save('german_isil_map.html')
|
|
```
|
|
|
|
### 9. Convert to LinkML YAML
|
|
|
|
```python
|
|
import json
|
|
import yaml
|
|
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
# Convert to GLAM project schema
|
|
glam_records = []
|
|
|
|
for record in data['records']:
|
|
glam_record = {
|
|
'id': f"https://w3id.org/heritage/custodian/de/{record['isil'].lower()}",
|
|
'name': record['name'],
|
|
'institution_type': 'LIBRARY', # TODO: classify properly
|
|
'locations': [{
|
|
'city': record['address'].get('city'),
|
|
'street_address': record['address'].get('street'),
|
|
'postal_code': record['address'].get('postal_code'),
|
|
'country': 'DE',
|
|
'latitude': float(record['address']['latitude']) if record['address'].get('latitude') else None,
|
|
'longitude': float(record['address']['longitude']) if record['address'].get('longitude') else None
|
|
}],
|
|
'identifiers': [{
|
|
'identifier_scheme': 'ISIL',
|
|
'identifier_value': record['isil'],
|
|
'identifier_url': f"https://sigel.staatsbibliothek-berlin.de/isil/{record['isil']}"
|
|
}],
|
|
'provenance': {
|
|
'data_source': 'CSV_REGISTRY',
|
|
'data_tier': 'TIER_1_AUTHORITATIVE',
|
|
'extraction_date': '2025-11-19T12:49:39Z'
|
|
}
|
|
}
|
|
glam_records.append(glam_record)
|
|
|
|
# Save first 10 as example
|
|
with open('german_isil_linkml_example.yaml', 'w') as f:
|
|
yaml.dump(glam_records[:10], f, allow_unicode=True, default_flow_style=False)
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### Find All Archives
|
|
```bash
|
|
jq '.records[] | select(.name | test("archiv"; "i"))' german_isil_complete_20251119_134939.json
|
|
```
|
|
|
|
### Find All Museums
|
|
```bash
|
|
jq '.records[] | select(.name | test("museum"; "i"))' german_isil_complete_20251119_134939.json
|
|
```
|
|
|
|
### Get Records with Email
|
|
```bash
|
|
jq '.records[] | select(.contact.email != null)' german_isil_complete_20251119_134939.json
|
|
```
|
|
|
|
### Extract URLs Only
|
|
```bash
|
|
jq -r '.records[].urls[]?.url' german_isil_complete_20251119_134939.json > german_isil_urls.txt
|
|
```
|
|
|
|
### Statistics by Federal State
|
|
```python
|
|
import json
|
|
from collections import Counter
|
|
|
|
with open('german_isil_complete_20251119_134939.json', 'r') as f:
|
|
data = json.load(f)
|
|
|
|
states = Counter(r['address'].get('region') for r in data['records'])
|
|
for state, count in states.most_common(10):
|
|
print(f"{state}: {count}")
|
|
```
|
|
|
|
## Integration with GLAM Project
|
|
|
|
See `HARVEST_REPORT.md` for detailed integration recommendations.
|
|
|
|
**Key Tasks**:
|
|
1. Map institution types to GLAMORCUBESFIXPHDNT taxonomy
|
|
2. Generate GHCIDs for each institution
|
|
3. Enrich with Wikidata Q-numbers
|
|
4. Cross-reference with other German registries
|
|
5. Export to RDF/Turtle for Linked Data
|
|
|
|
## Data Quality Notes
|
|
|
|
- ✅ 87% have coordinates - excellent for mapping
|
|
- ✅ 79% have URLs - good for web scraping
|
|
- ⚠️ 38% have emails - may need enrichment
|
|
- ⚠️ 97% lack institution type codes - requires classification
|
|
|
|
## Need Help?
|
|
|
|
See the GLAM project documentation:
|
|
- `/Users/kempersc/apps/glam/AGENTS.md` - AI agent instructions
|
|
- `/Users/kempersc/apps/glam/docs/` - Schema and design docs
|
|
- `/Users/kempersc/apps/glam/scripts/` - Example parsers and converters
|