glam/data/isil/germany/EXECUTION_GUIDE.md
2025-11-19 23:25:22 +01:00

378 lines
10 KiB
Markdown

# German Archive Completion - Execution Guide
**Status**: Ready to execute (pending DDB API key)
**Goal**: 100% German archive coverage (~25,000-27,000 total institutions)
---
## Quick Start
### Prerequisites (10 minutes)
1. **Get DDB API Key**:
- Visit: https://www.deutsche-digitale-bibliothek.de/
- Register → Verify email → Log in
- Navigate to "Meine DDB" → Generate API key
- Copy the key
2. **Install Dependencies** (if needed):
```bash
pip install requests rapidfuzz
```
### Execution (6-9 hours)
#### Step 1: Configure API Key (1 minute)
Edit the API harvester script:
```bash
nano scripts/scrapers/harvest_archivportal_d_api.py
```
Replace line 21:
```python
API_KEY = "YOUR_API_KEY_HERE"
```
With your actual key:
```python
API_KEY = "ddb_abc123xyz456..."
```
#### Step 2: Run API Harvest (1-2 hours)
```bash
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/harvest_archivportal_d_api.py
```
**Expected output**:
- `data/isil/germany/archivportal_d_api_TIMESTAMP.json` (~10,000-20,000 archives)
- `data/isil/germany/archivportal_d_api_stats_TIMESTAMP.json` (statistics)
#### Step 3: Cross-Reference with ISIL (1 hour)
```bash
python3 scripts/scrapers/merge_archivportal_isil.py
```
**Expected output**:
- `data/isil/germany/merged_matched_TIMESTAMP.json` (overlapping archives)
- `data/isil/germany/merged_new_discoveries_TIMESTAMP.json` (new archives)
- `data/isil/germany/merged_isil_only_TIMESTAMP.json` (ISIL-only institutions)
- `data/isil/germany/merged_stats_TIMESTAMP.json` (overlap statistics)
#### Step 4: Create Unified Dataset (1 hour)
```bash
python3 scripts/scrapers/create_german_unified_dataset.py
```
**Expected output**:
- `data/isil/germany/german_unified_TIMESTAMP.json` (~25,000-27,000 institutions)
- `data/isil/germany/german_unified_TIMESTAMP.jsonl` (line-delimited format)
---
## What Each Script Does
### 1. `harvest_archivportal_d_api.py`
**Purpose**: Fetch all German archives from Deutsche Digitale Bibliothek API
**How it works**:
- Connects to DDB REST API with your authentication key
- Queries `sector=arc_archives` to get archives only
- Fetches in batches of 100 records
- Respects rate limits (0.5s delay between requests)
- Saves raw JSON with metadata
**Output format**:
```json
{
"metadata": {
"source": "Archivportal-D via DDB API",
"total_archives": 12345,
"harvest_date": "2025-11-19T..."
},
"archives": [
{
"id": "archive-unique-id",
"name": "Stadtarchiv Köln",
"location": "Köln",
"federal_state": "Nordrhein-Westfalen",
"archive_type": "Kommunalarchiv",
"isil": "DE-KN",
"latitude": "50.9375",
"longitude": "6.9603",
"profile_url": "https://www.archivportal-d.de/item/..."
}
]
}
```
**Estimated time**: 1-2 hours (depends on total archive count)
---
### 2. `merge_archivportal_isil.py`
**Purpose**: Cross-reference Archivportal-D archives with ISIL registry
**Matching strategy**:
1. **ISIL exact match**: Match archives by ISIL code (high confidence)
2. **Fuzzy name+city**: Match by institution name + location (threshold: 85% similarity)
**How it works**:
- Loads ISIL data (16,979 institutions)
- Loads Archivportal-D data (from previous harvest)
- Matches by ISIL code first (30-50% expected overlap)
- Fuzzy matches remaining by name + city
- Identifies new discoveries (archives without ISIL codes)
- Identifies ISIL-only institutions (not in Archivportal)
**Output categories**:
- **Matched**: Archives in both sources (high quality, cross-validated)
- **New discoveries**: Archives only in Archivportal-D (need ISIL assignment)
- **ISIL-only**: Institutions in ISIL but not in Archivportal-D (libraries/museums)
**Estimated time**: 1 hour
---
### 3. `create_german_unified_dataset.py`
**Purpose**: Combine all sources into single comprehensive dataset
**Data integration**:
1. **Matched institutions**: ISIL data (Tier 1) enriched with Archivportal-D metadata
2. **ISIL-only**: All ISIL records not in Archivportal (libraries, museums, smaller archives)
3. **New discoveries**: Archivportal-D archives not in ISIL registry
**Enrichment process**:
- ISIL data is authoritative (Tier 1)
- Add Archivportal-D fields where missing:
- Archive type/subtype
- Federal state
- Coordinates (if better/missing)
- Profile URLs
- Thumbnails
**Output fields** (unified schema):
```json
{
"id": "unique-id",
"institution_name": "Stadtarchiv München",
"city": "München",
"federal_state": "Bayern",
"institution_type": "ARCHIVE",
"institution_subtype": "Kommunalarchiv",
"isil_code": "DE-M212",
"latitude": "48.1351",
"longitude": "11.5820",
"archivportal_id": "...",
"archivportal_url": "https://...",
"data_sources": ["ISIL", "Archivportal-D"],
"enriched_from_archivportal": true,
"data_tier": "TIER_1_AUTHORITATIVE"
}
```
**Estimated time**: 1 hour
---
## Expected Results
### Coverage Breakdown
| Source | Count | Percentage |
|--------|-------|------------|
| **ISIL-only** (libraries, museums) | ~14,000 | 56% |
| **Matched** (cross-validated archives) | ~3,000-5,000 | 12-20% |
| **New discoveries** (archives without ISIL) | ~7,000-10,000 | 28-40% |
| **TOTAL** | **~25,000-27,000** | **100%** |
### Institution Types
| Type | Estimated Count | Source |
|------|----------------|--------|
| **LIBRARY** | ~8,000-10,000 | ISIL |
| **ARCHIVE** | ~12,000-15,000 | ISIL + Archivportal-D |
| **MUSEUM** | ~3,000-4,000 | ISIL |
| **OTHER** | ~1,000-2,000 | ISIL |
### Data Quality
| Metric | Expected | Notes |
|--------|----------|-------|
| **With ISIL codes** | ~17,000 (68%) | All ISIL + some Archivportal |
| **With coordinates** | ~22,000 (88%) | High geocoding coverage |
| **With websites** | ~13,000 (52%) | ISIL provides URLs |
| **Needing ISIL** | ~7,000-10,000 (28-40%) | New archive discoveries |
---
## Troubleshooting
### Issue: API Key Invalid
**Error**: `401 Unauthorized` or `403 Forbidden`
**Solutions**:
- Verify API key copied correctly (no spaces/newlines)
- Check key is active in DDB account settings
- Ensure using `Bearer` token format: `Authorization: Bearer {key}`
### Issue: No Results Returned
**Error**: `numberOfResults: 0`
**Solutions**:
- Verify API endpoint: `https://api.deutsche-digitale-bibliothek.de/search`
- Check query parameters: `sector=arc_archives`
- Try broader query: `query=*`
- Check DDB API status page
### Issue: Rate Limited
**Error**: `429 Too Many Requests`
**Solutions**:
- Increase `REQUEST_DELAY` from 0.5s to 1.0s or 2.0s
- Reduce `BATCH_SIZE` from 100 to 50
- Wait 5-10 minutes before retrying
### Issue: Merge Script Fails
**Error**: `FileNotFoundError: No Archivportal-D data found`
**Solution**:
- Run Step 2 (API harvest) first
- Verify JSON file exists in `data/isil/germany/archivportal_d_api_*.json`
### Issue: Fuzzy Matching Too Strict
**Symptom**: Too few matches in merge results
**Solution**:
- Edit `merge_archivportal_isil.py`
- Lower `FUZZY_THRESHOLD` from 85 to 75
- Re-run merge script
---
## Validation Checklist
After completing all steps, verify:
- [ ] **API harvest**: 10,000-20,000 archives fetched
- [ ] **Federal states**: All 16 German states represented
- [ ] **ISIL overlap**: 30-50% of archives have ISIL codes
- [ ] **Coordinates**: 80%+ of archives geocoded
- [ ] **New discoveries**: 7,000-10,000 archives without ISIL
- [ ] **Unified dataset**: ~25,000-27,000 total institutions
- [ ] **Duplicates**: < 1% (check by ISIL code)
- [ ] **Data tiers**: TIER_1 (ISIL), TIER_2 (Archivportal-D)
---
## Next Steps (After Completion)
### Immediate (Documentation)
1. **Create harvest report**: `data/isil/germany/GERMAN_UNIFIED_REPORT.md`
2. **Update progress tracker**: Add German completion to `data/isil/HARVEST_PROGRESS_SUMMARY.md`
3. **Document statistics**: Archive types, federal state distribution, etc.
### Short-term (Data Processing)
4. **Convert to LinkML**: Transform JSON LinkML YAML instances
5. **Generate GHCIDs**: Create persistent identifiers for all institutions
6. **Geocode missing**: Fill in coordinates for remaining ~12%
7. **ISIL assignment**: Propose ISIL codes for new discoveries
### Medium-term (Integration)
8. **Export formats**: Generate RDF, CSV, Parquet, SQLite
9. **Wikidata enrichment**: Query for Q-numbers, VIAF IDs
10. **Quality validation**: Check for data anomalies, outliers
11. **Provenance tracking**: Add extraction metadata, confidence scores
### Long-term (Project Impact)
12. **Move to next country**: Czech Republic, Austria, or France
13. **Archive completeness**: Apply same strategy to other countries
14. **Priority 1 completion**: Target all 36 Priority 1 countries
---
## Project Impact
### Before German Completion
- **Total records**: 25,436
- **Progress**: 26.2% of 97,000 target
- **German coverage**: 16,979 (mostly libraries/museums, ~30% archives)
### After German Completion
- **Total records**: ~35,000-40,000
- **Progress**: ~40% of 97,000 target
- **German coverage**: ~25,000-27,000 (100% archives, all sectors)
### Milestone Achievement
- 🇩🇪 **First country with 100% archive coverage**
- 📈 **Project progress: +15% in one session**
- 🎯 **Archive completeness model** for other countries
- 🔬 **Methodology proven** for national portals
---
## Timeline Summary
| Phase | Time | Status |
|-------|------|--------|
| **Planning & Strategy** | 5 hours | Complete |
| **DDB API Registration** | 10 min | Pending |
| **API Harvest** | 1-2 hours | Ready |
| **Cross-Reference** | 1 hour | Ready |
| **Unified Dataset** | 1 hour | Ready |
| **Documentation** | 1 hour | Ready |
| **TOTAL** | **~9 hours** | **90% complete** |
---
## Key Files Reference
### Scripts (Created)
- `scripts/scrapers/harvest_archivportal_d_api.py`
- `scripts/scrapers/merge_archivportal_isil.py`
- `scripts/scrapers/create_german_unified_dataset.py`
### Data (Existing)
- `data/isil/germany/german_isil_complete_20251119_134939.json`
### Data (To Be Created)
- `data/isil/germany/archivportal_d_api_TIMESTAMP.json` 🔄
- `data/isil/germany/merged_matched_TIMESTAMP.json` 🔄
- `data/isil/germany/german_unified_TIMESTAMP.json` 🔄
### Documentation (Existing)
- `data/isil/germany/NEXT_SESSION_QUICK_START.md`
- `data/isil/germany/COMPLETENESS_PLAN.md`
- `data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md`
---
## Contact & Support
**DDB Support**: https://www.deutsche-digitale-bibliothek.de/content/contact
**API Documentation**: https://api.deutsche-digitale-bibliothek.de/ (requires login)
**Archivportal-D**: https://www.archivportal-d.de/kontakt
---
**Ready to start?** 🚀
1. Get your DDB API key (10 minutes)
2. Run the three scripts in order
3. 🇩🇪 Germany 100% complete!