glam/data/isil/germany/QUICK_START.md
2025-11-19 23:25:22 +01:00

231 lines
6.4 KiB
Markdown

# German ISIL Data - Quick Start Guide
## Files Location
```
/Users/kempersc/apps/glam/data/isil/germany/
├── german_isil_complete_20251119_134939.json # Full dataset (37 MB)
├── german_isil_complete_20251119_134939.jsonl # Line-delimited (24 MB)
├── german_isil_stats_20251119_134941.json # Statistics (7.6 KB)
└── HARVEST_REPORT.md # This harvest report
```
## Quick Access Examples
### 1. Load Full Dataset in Python
```python
import json
# Load complete dataset
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
print(f"Total records: {data['metadata']['total_records']}")
print(f"First institution: {data['records'][0]['name']}")
```
### 2. Stream Processing (JSONL)
```python
import json
# Process one record at a time (memory-efficient)
with open('german_isil_complete_20251119_134939.jsonl', 'r') as f:
for line in f:
record = json.loads(line)
if record['address'].get('city') == 'Berlin':
print(f"{record['name']} - {record['isil']}")
```
### 3. Find Records by ISIL
```bash
# Using jq (JSON query tool)
jq '.records[] | select(.isil == "DE-1")' german_isil_complete_20251119_134939.json
# Using grep (fast text search)
grep -i "staatsbibliothek" german_isil_complete_20251119_134939.jsonl
```
### 4. Extract All Libraries in Munich
```bash
jq '.records[] | select(.address.city == "München")' german_isil_complete_20251119_134939.json
```
### 5. Count Institutions by Region
```bash
jq '.records | group_by(.interloan_region) | map({region: .[0].interloan_region, count: length})' german_isil_complete_20251119_134939.json
```
### 6. Export to CSV
```python
import json
import csv
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
with open('german_isil.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['ISIL', 'Name', 'City', 'Email', 'URL'])
for record in data['records']:
writer.writerow([
record['isil'],
record['name'],
record['address'].get('city', ''),
record['contact'].get('email', ''),
record['urls'][0]['url'] if record['urls'] else ''
])
```
### 7. Filter by Institution Type
```python
import json
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Find all Max Planck Institute libraries
mpi_libraries = [
r for r in data['records']
if r['institution_type'] and r['institution_type'].startswith('MPI')
]
print(f"Found {len(mpi_libraries)} Max Planck Institute libraries")
```
### 8. Create Geographic Map
```python
import json
import folium
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Create map centered on Germany
m = folium.Map(location=[51.1657, 10.4515], zoom_start=6)
# Add markers for institutions with coordinates
for record in data['records']:
lat = record['address'].get('latitude')
lon = record['address'].get('longitude')
if lat and lon:
folium.Marker(
location=[float(lat), float(lon)],
popup=f"{record['name']}<br>{record['isil']}",
tooltip=record['name']
).add_to(m)
m.save('german_isil_map.html')
```
### 9. Convert to LinkML YAML
```python
import json
import yaml
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
# Convert to GLAM project schema
glam_records = []
for record in data['records']:
glam_record = {
'id': f"https://w3id.org/heritage/custodian/de/{record['isil'].lower()}",
'name': record['name'],
'institution_type': 'LIBRARY', # TODO: classify properly
'locations': [{
'city': record['address'].get('city'),
'street_address': record['address'].get('street'),
'postal_code': record['address'].get('postal_code'),
'country': 'DE',
'latitude': float(record['address']['latitude']) if record['address'].get('latitude') else None,
'longitude': float(record['address']['longitude']) if record['address'].get('longitude') else None
}],
'identifiers': [{
'identifier_scheme': 'ISIL',
'identifier_value': record['isil'],
'identifier_url': f"https://sigel.staatsbibliothek-berlin.de/isil/{record['isil']}"
}],
'provenance': {
'data_source': 'CSV_REGISTRY',
'data_tier': 'TIER_1_AUTHORITATIVE',
'extraction_date': '2025-11-19T12:49:39Z'
}
}
glam_records.append(glam_record)
# Save first 10 as example
with open('german_isil_linkml_example.yaml', 'w') as f:
yaml.dump(glam_records[:10], f, allow_unicode=True, default_flow_style=False)
```
## Common Use Cases
### Find All Archives
```bash
jq '.records[] | select(.name | test("archiv"; "i"))' german_isil_complete_20251119_134939.json
```
### Find All Museums
```bash
jq '.records[] | select(.name | test("museum"; "i"))' german_isil_complete_20251119_134939.json
```
### Get Records with Email
```bash
jq '.records[] | select(.contact.email != null)' german_isil_complete_20251119_134939.json
```
### Extract URLs Only
```bash
jq -r '.records[].urls[]?.url' german_isil_complete_20251119_134939.json > german_isil_urls.txt
```
### Statistics by Federal State
```python
import json
from collections import Counter
with open('german_isil_complete_20251119_134939.json', 'r') as f:
data = json.load(f)
states = Counter(r['address'].get('region') for r in data['records'])
for state, count in states.most_common(10):
print(f"{state}: {count}")
```
## Integration with GLAM Project
See `HARVEST_REPORT.md` for detailed integration recommendations.
**Key Tasks**:
1. Map institution types to GLAMORCUBESFIXPHDNT taxonomy
2. Generate GHCIDs for each institution
3. Enrich with Wikidata Q-numbers
4. Cross-reference with other German registries
5. Export to RDF/Turtle for Linked Data
## Data Quality Notes
- ✅ 87% have coordinates - excellent for mapping
- ✅ 79% have URLs - good for web scraping
- ⚠️ 38% have emails - may need enrichment
- ⚠️ 97% lack institution type codes - requires classification
## Need Help?
See the GLAM project documentation:
- `/Users/kempersc/apps/glam/AGENTS.md` - AI agent instructions
- `/Users/kempersc/apps/glam/docs/` - Schema and design docs
- `/Users/kempersc/apps/glam/scripts/` - Example parsers and converters