glam/data/isil/germany/HARVEST_REPORT.md
2025-11-19 23:25:22 +01:00

349 lines
11 KiB
Markdown

# German ISIL Database Harvest Report
**Date**: November 19, 2025
**Harvester**: OpenCode + MCP Wikidata Tools
**Status**: ✅ **COMPLETE**
## Overview
Successfully harvested the complete German ISIL (International Standard Identifier for Libraries and Related Organizations) database from the Staatsbibliothek zu Berlin and Deutsche Nationalbibliothek.
## Data Source
- **Provider**: Staatsbibliothek zu Berlin (German ISIL Agency)
- **Website**: https://sigel.staatsbibliothek-berlin.de/
- **API Protocol**: SRU 1.1 (Search/Retrieve via URL)
- **API Endpoint**: https://services.dnb.de/sru/bib
- **Data Format**: PicaPlus-XML (parsed to JSON)
- **License**: CC0 1.0 Universal (Public Domain)
## Coverage
The German ISIL database is the **authoritative registry** for heritage institutions in Germany with ISIL codes. It covers:
- **Libraries** (Bibliotheken)
- Public libraries
- Academic libraries
- Research libraries
- Special libraries
- **Archives** (Archive)
- State archives
- City archives
- Corporate archives
- Personal archives
- **Museums** (Museen)
- Art museums
- History museums
- Science museums
- Technical museums
- **Related Organizations**
- Documentation centers
- Research institutes with libraries
- Heritage societies
## Harvest Statistics
### Total Records
- **16,979 institutions** with German ISIL codes (DE-*)
### Data Completeness
- **87.0%** have street addresses (14,765 records)
- **79.4%** have URLs (13,483 records)
- **79.1%** have phone numbers (13,429 records)
- **87.0%** have geographic coordinates (14,771 records)
- **37.8%** have email addresses (6,420 records)
### Geographic Distribution (Top 10 Regions)
1. **NRW** (North Rhine-Westphalia): 1,503 institutions
2. **BAW** (Baden-Württemberg): 1,295 institutions
3. **BAY** (Bavaria): 1,204 institutions
4. **HES** (Hesse): 659 institutions
5. **BER** (Berlin): 614 institutions
6. **NIE** (Lower Saxony): 598 institutions
7. **HAM** (Hamburg): 450 institutions
8. **SAX** (Saxony): 397 institutions
9. **SAA** (Saxony-Anhalt): 308 institutions
10. **THU** (Thuringia): 249 institutions
**Note**: 9,654 records (56.9%) have no interloan region code assigned.
### Institution Types
The database uses institutional codes (e.g., SBBPK, TUM, UBM) rather than standardized type classifications. The most common codes include:
- **University libraries** (UB-*): 50+ institutions
- **State libraries**: 10+ institutions
- **Max Planck Institute libraries** (MPI-*): 20+ institutions
- **Federal agency libraries** (B-*): 15+ institutions
- **University of Applied Sciences libraries** (FH-*): 80+ institutions
**Note**: 16,408 records (96.6%) have no institution type code in the database.
## Output Files
### 1. Complete Dataset (JSON)
**File**: `german_isil_complete_20251119_134939.json`
**Size**: 37 MB
**Format**: Structured JSON with metadata header
**Structure**:
```json
{
"metadata": {
"source": "German ISIL Database (Staatsbibliothek zu Berlin)",
"harvest_date": "2025-11-19T12:49:39Z",
"total_records": 16979,
"license": "CC0 1.0 Universal"
},
"records": [...]
}
```
### 2. Complete Dataset (JSONL)
**File**: `german_isil_complete_20251119_134939.jsonl`
**Size**: 24 MB
**Format**: JSON Lines (one record per line)
**Use case**: Stream processing, database imports, line-by-line analysis
### 3. Statistics Summary
**File**: `german_isil_stats_20251119_134941.json`
**Size**: 7.6 KB
**Format**: JSON
**Contains**:
- Total record counts
- Data completeness percentages
- Institution type distribution
- Geographic distribution by interloan region
## Record Schema
Each record contains:
```json
{
"isil": "DE-1", // ISIL identifier
"name": "Staatsbibliothek zu Berlin", // Official name
"alternative_names": [], // Alternative forms
"institution_type": "SBBPK", // Institution code (optional)
"address": {
"street": "Unter den Linden 8",
"city": "Berlin",
"postal_code": "10117",
"country": "DE",
"region": "Berlin",
"latitude": "52.51755",
"longitude": "13.39162"
},
"contact": {
"phone": "+49-30-2 66-433888",
"fax": "+49-30-2 66-333701",
"email": "info@sbb.spk-berlin.de"
},
"urls": [
{
"url": "http://staatsbibliothek-berlin.de",
"type": "A",
"label": null
}
],
"parent_org": null, // Parent institution (if branch)
"interloan_region": "BER", // Interloan region code
"notes": "...", // Collection descriptions, etc.
"raw_pica": {...} // Full PICA+ data structure
}
```
## Data Quality
### Strengths
**Authoritative source** - Official German ISIL registry
**Complete coverage** - All German ISIL-registered institutions
**High geographic precision** - 87% have coordinates
**Rich contact data** - Phone, email, URLs for most records
**Well-structured addresses** - Standardized format
**Public domain license** - No restrictions on reuse
### Limitations
⚠️ **Limited type classification** - 96.6% have no institution type code
⚠️ **No English translations** - Names and descriptions in German only
⚠️ **Incomplete interloan data** - 56.9% have no region assignment
⚠️ **Email coverage** - Only 37.8% have email addresses
⚠️ **Historical data** - No founding dates or closure dates
## API Access Methods
The German ISIL database offers three API access methods:
### 1. SRU (Search/Retrieve via URL)
**Endpoint**: https://services.dnb.de/sru/bib
**Protocol**: SRU 1.1
**Formats**: PicaPlus-XML, RDF/XML
**Query Language**: CQL (Common Query Language)
**Example**:
```bash
curl "https://services.dnb.de/sru/bib?version=1.1&operation=searchRetrieve&query=isil%3DDE-1&recordSchema=PicaPlus-xml&maximumRecords=1"
```
### 2. JSON-API
**Endpoint**: https://isil.staatsbibliothek-berlin.de/api/org.jsonld
**Format**: JSON-LD
**Query Language**: CQL
**Example**:
```bash
curl "https://isil.staatsbibliothek-berlin.de/api/org.jsonld?q=ort%3DBerlin&size=10"
```
### 3. Linked Data Service
**Endpoint**: https://ld.zdb-services.de/resource/organisations/<ISIL>
**Formats**: RDF/XML, Turtle, JSON-LD, HTML
**Protocol**: Content negotiation (303 redirects)
**Example**:
```bash
curl -H "Accept: application/rdf+xml" "https://ld.zdb-services.de/resource/organisations/DE-1"
```
## Integration with GLAM Project
### Recommended Next Steps
1. **Parse and Convert to LinkML**
- Map PICA+ fields to `HeritageCustodian` schema
- Classify institutions using GLAMORCUBESFIXPHDNT taxonomy
- Assign data tier: **TIER_1_AUTHORITATIVE**
2. **Enrich with Wikidata**
- Query Wikidata for matching Q-numbers
- Add founding dates, collection information
- Link to parent organizations
3. **Cross-reference with Other Sources**
- Compare with International ISIL Registry
- Link to Museum Digital (museum-digital.de)
- Connect with Archive Portal Germany (archivportal-d.de)
4. **Generate GHCIDs**
- Create persistent identifiers for each institution
- Use format: `DE-[REGION]-[CITY]-[TYPE]-[ABBR]`
- Link ISIL codes as `Identifier` records
## Harvester Implementation
**Script**: `scripts/scrapers/harvest_german_isil_sru.py`
**Features**:
- ✅ Batch processing (100 records per request)
- ✅ Rate limiting (1 second delay between requests)
- ✅ Automatic retry on failure (3 attempts)
- ✅ Progress tracking
- ✅ Error handling
- ✅ Multiple output formats (JSON, JSONL)
- ✅ Complete PICA+ field parsing
**Performance**:
- **Total time**: ~3 minutes
- **Records/second**: ~94
- **Requests**: 170 (batch size 100)
- **No errors or failed requests**
## Example Records
### Example 1: Staatsbibliothek zu Berlin
```json
{
"isil": "DE-1",
"name": "Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, Haus Unter den Linden",
"alternative_names": ["Berlin SBB Haus Unter d.Linden"],
"institution_type": "SBBPK",
"address": {
"street": "Unter den Linden 8",
"city": "Berlin",
"postal_code": "10117",
"country": "DE",
"region": "Berlin",
"latitude": "52.51755",
"longitude": "13.39162"
}
}
```
### Example 2: Stadtarchiv Augsburg
```json
{
"isil": "DE-Aug9",
"name": "Stadtarchiv Augsburg",
"alternative_names": ["Augsburg Stadtarchiv"],
"address": {
"street": "Zur Kammgarnspinnerei 11",
"city": "Augsburg",
"postal_code": "86153",
"country": "DE",
"region": "Bayern",
"latitude": "48.36337",
"longitude": "10.91350"
},
"interloan_region": "BAY"
}
```
## Comparison with Other National ISIL Registries
| Country | Registry | Records | API | Coverage |
|---------|----------|---------|-----|----------|
| **Germany** | Staatsbibliothek zu Berlin | **16,979** | ✅ SRU, JSON, Linked Data | **Comprehensive** |
| Netherlands | KB | ~1,400 | ✅ CSV | Libraries, archives |
| Austria | OBVSG | ~3,000 | ✅ Search | Libraries only |
| Switzerland | Swiss National Library | ~1,500 | ✅ Search | Libraries, archives |
| France | ABES | ~5,000 | ✅ API | Academic libraries |
| UK | British Library | ~4,000 | ⚠️ (cyber attack) | Libraries |
**Germany has the largest and most comprehensive ISIL registry in Europe.**
## References
### Documentation
- ISIL Registry Homepage: https://sigel.staatsbibliothek-berlin.de/
- SRU API Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/sru
- JSON API Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/json-api
- Linked Data Service: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/linked-data-service
- PICA+ Format Specification: https://sigel.staatsbibliothek-berlin.de/vergabe/adressenformat
### Standards
- ISO 15511:2019 - International Standard Identifier for Libraries
- SRU 1.1 - Search/Retrieve via URL (Library of Congress)
- PICA+ - Library cataloging format (OCLC)
- CC0 1.0 Universal - Public Domain Dedication
### Contact
- **German ISIL Agency**: isil@slks.dk (international) / Staatsbibliothek zu Berlin
- **Technical Contact**: Carsten Klee (carsten.klee@sbb.spk-berlin.de)
- **Phone**: +49 30 266 434402
## License
The harvested data is licensed under **CC0 1.0 Universal (Public Domain Dedication)**.
You are free to:
- ✅ Copy, modify, distribute the data
- ✅ Use for commercial purposes
- ✅ Use without attribution (though attribution is appreciated)
**Attribution** (optional but recommended):
```
Data source: German ISIL Database, Staatsbibliothek zu Berlin
Retrieved: November 19, 2025
License: CC0 1.0 Universal
```
---
**Generated by**: OpenCode + MCP Wikidata Tools
**Report Date**: November 19, 2025
**Report Version**: 1.0