glam/GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md
2025-11-19 23:25:22 +01:00

663 lines
19 KiB
Markdown

# German Regional Archive Portals - Discovery Report
**Date**: 2025-11-19
**Method**: Exa deep web search
**Context**: Discovered after finding archive.nrw.de (441 archives)
---
## Executive Summary
Discovered **12+ regional archive portals** across German federal states (Bundesländer), each with similar structure to archive.nrw.de. These portals provide searchable access to state, municipal, church, and specialized archives within each region.
### Key Finding
**Germany has a FEDERATED archive system** - each state (Bundesland) operates its own archive portal, with **Archivportal-D** serving as the national aggregator. This structure means regional portals contain MORE detailed information than the national ISIL registry or DDB.
---
## National Archive Portal
### Archivportal-D (National Aggregator)
**URL**: https://www.archivportal-d.de/
**Scope**: All 16 German federal states
**Language**: German + English
**Technology**: Part of Deutsche Digitale Bibliothek (DDB)
**Features**:
- Search across all German archives
- Filter by federal state (Bundesland)
- Filter by sector (state, municipal, church, business, etc.)
- Finding aids and digital copies
- Links to regional portals
**Federal States Covered**:
- Baden-Württemberg
- Bayern (Bavaria)
- Berlin
- Brandenburg
- Bremen
- Hamburg
- Hessen
- Mecklenburg-Vorpommern
- Niedersachsen (Lower Saxony)
- Nordrhein-Westfalen (NRW) ✅ **Already harvested**
- Rheinland-Pfalz (Rhineland-Palatinate)
- Saarland
- Sachsen (Saxony)
- Sachsen-Anhalt (Saxony-Anhalt)
- Schleswig-Holstein
- Thüringen (Thuringia)
---
## Regional Archive Portals by State
### 1. Nordrhein-Westfalen (NRW) ✅ **HARVESTED**
**Portal**: https://www.archive.nrw.de/archivsuche
**Status**: ✅ **441 archives harvested (2025-11-19)**
**Technology**: Drupal-based, JavaScript rendering
**Archive Types**: Municipal, district, state, university, church, corporate
**Harvest Results**:
- 441 archives extracted
- 356 cities covered
- 85 new institutions added to German dataset
---
### 2. Niedersachsen & Bremen (Arcinsys)
**Portal**: https://arcinsys.niedersachsen.de/
**Also**: http://arcinsys.niedersachsen.de/ (HTTP redirects to HTTPS)
**Language**: German + English
**Technology**: Arcinsys (shared with Hessen, Schleswig-Holstein)
**Features**:
- Joint portal for Niedersachsen AND Bremen
- Niedersächsisches Landesarchiv (7 locations)
- Municipal, church, and business archives
- Online finding aids
- Digital copies available
- User registration for ordering archival items
**Participating Archives**:
- State archives (Landesarchiv)
- District archives (Kreisarchive)
- Municipal archives (Stadtarchive)
- Community archives (Gemeindearchive)
- Church archives (Kirchenarchive)
- University archives (Hochschularchive)
- Business archives (Wirtschaftsarchive)
- Media archives (Medienarchive)
**Harvest Potential**: HIGH (likely 300+ archives)
---
### 3. Schleswig-Holstein (Arcinsys)
**Portal**: https://arcinsys.schleswig-holstein.de/
**Language**: German + English
**Technology**: Arcinsys (shared system)
**Features**:
- State archive in Schleswig (Prinzenpalais)
- Municipal and church archives
- Same Arcinsys interface as Niedersachsen
- Searchable finding aids
- Digital copies
**Harvest Potential**: MEDIUM (likely 150+ archives)
---
### 4. Hessen (Arcinsys)
**Portal**: https://arcinsys.hessen.de/
**Language**: German
**Technology**: Arcinsys (original developer)
**Features**:
- Hessisches Landesarchiv
- Municipal and specialized archives
- Finding aids online
- Part of 3-state Arcinsys consortium
**Note**: Hessen developed Arcinsys, later adopted by Niedersachsen and Schleswig-Holstein
**Harvest Potential**: MEDIUM-HIGH (likely 200+ archives)
---
### 5. Thüringen (Thuringia)
**Portal**: https://www.archive-in-thueringen.de/
**Also**: https://tharchivtest.thueringen.de/ (test environment)
**Language**: German + English
**Technology**: Custom archive portal
**Statistics (from portal)**:
- **149 archives**
- **14,793 inventories**
- **2,863 online finding aids**
**Archive Types**:
- State archives (5 locations: Altenburg, Gotha, Greiz, Meiningen, Rudolstadt, Weimar)
- Main state archive (Weimar)
- Municipal archives
- Specialized archives
**Features**:
- Cross-archive search
- Online finding aids
- Archive descriptions with historical context
- Newspaper and periodical collections
**Harvest Potential**: **HIGH - 149 archives confirmed**
---
### 6. Brandenburg
**Portal**: https://blha.brandenburg.de/
**Name**: Brandenburgisches Landeshauptarchiv
**Language**: German + English + Polish
**Location**: Potsdam (Zum Windmühlenberg)
**Features**:
- Main state archive for Brandenburg
- Holdings from 10th century to present
- Six main collections (Kurmark, Neumark, Niederlausitz, Prussian, GDR, modern Brandenburg)
- Research services
- Digital provenance research (NS-era financial records)
**Structure**: Centralized state archive (not a portal of multiple archives)
**Harvest Potential**: LOW (1 main institution, but check for branch archives)
---
### 7. Sachsen (Saxony)
**Portal**: https://www.staatsarchiv.sachsen.de/
**Name**: Sächsisches Staatsarchiv
**Language**: German
**Features**:
- State archive system
- Multiple locations (Dresden, Leipzig, Chemnitz, Freiberg, Bautzen)
- Historical records from medieval period
- Online research portal
- Finding aids
**Harvest Potential**: MEDIUM (state archive with multiple locations + municipal archives)
---
### 8. Sachsen-Anhalt (Saxony-Anhalt)
**Portal**: https://landesarchiv.sachsen-anhalt.de/
**Also**: https://lha.sachsen-anhalt.de/
**Name**: Landesarchiv Sachsen-Anhalt (LASA)
**Locations**:
- Abteilung Magdeburg
- Abteilung Dessau
- Abteilung Merseburg
**Features**:
- Three department locations
- Church book duplicates (Kirchenbuchduplikate)
- Civil status registers (Zivilstandsregister)
- Online research portal
- Genealogical resources
**Harvest Potential**: MEDIUM (3 main locations + municipal archives)
---
### 9. Baden-Württemberg
**Portal**: https://www.landesarchiv-bw.de/
**Name**: Landesarchiv Baden-Württemberg
**Language**: German + English
**Online System**: https://www2.landesarchiv-bw.de/ofs21/
**Features**:
- State archive system
- Multiple historical territories (Baden, Württemberg, Hohenzollern)
- Online finding aids (Findmittelsystem)
- Research services
- Medieval to modern holdings
**Harvest Potential**: HIGH (unified state archive + municipal networks)
---
### 10. Bayern (Bavaria)
**Portal**: https://www.gda.bayern.de/
**Name**: Generaldirektion der Staatlichen Archive Bayerns
**Language**: German (+ minimal English)
**State Archives**:
1. Bayerisches Hauptstaatsarchiv (Munich) - central repository
2. Staatsarchiv Amberg (Oberpfalz)
3. Staatsarchiv Augsburg (Schwaben)
4. Staatsarchiv Bamberg (Oberfranken)
5. Staatsarchiv Coburg (Oberfranken)
6. Staatsarchiv Landshut (Niederbayern)
7. Staatsarchiv München (Oberbayern)
8. Staatsarchiv Nürnberg (Mittelfranken)
9. Staatsarchiv Würzburg (Unterfranken)
**Features**:
- 9 state archives covering Bavaria's administrative regions
- Holdings from 777 CE (oldest charter)
- Genealogical research services
- No unified search portal (each archive separate)
**Harvest Potential**: HIGH (9 state archives + extensive municipal network)
---
### 11. Rheinland-Pfalz (Rhineland-Palatinate)
**Status**: Mentioned in Archivportal-D but **no dedicated regional portal found**
**Known Archives**:
- Landesarchiv Rheinland-Pfalz
- Municipal archives (Stadtarchive)
**Harvest Potential**: MEDIUM (rely on Archivportal-D or ISIL registry)
---
### 12. Mecklenburg-Vorpommern
**Portal**: https://www.digitale-bibliothek-mv.de/viewer/cms/
**Name**: Landeshauptarchiv Schwerin
**Part of**: Digitale Bibliothek Mecklenburg-Vorpommern
**Features**:
- Historical collections from Landeshauptarchiv Schwerin
- 15th century origins (ducal archives)
- Merged with Geheimes und Hauptarchiv (1779)
- Digital collections online
**Harvest Potential**: MEDIUM (state archive + regional municipal archives)
---
### 13. Saarland
**Status**: Mentioned in Archivportal-D but **no dedicated regional portal found**
**Harvest Potential**: LOW-MEDIUM (small state, rely on Archivportal-D)
---
### 14. Hamburg
**Status**: City-state, archives part of Hamburg government
**Harvest Potential**: LOW (single city-state archive)
---
### 15. Berlin
**Status**: City-state, archives part of Berlin government
**Harvest Potential**: LOW (single city-state archive)
---
### 16. Bremen
**Portal**: Part of Arcinsys Niedersachsen und Bremen
**URL**: https://www.staatsarchiv.bremen.de/
**Name**: Staatsarchiv Bremen
**Status**: Integrated into Arcinsys Niedersachsen portal (see #2 above)
**Harvest Potential**: LOW (covered by Niedersachsen harvest)
---
## Harvest Priority Ranking
Based on archive count, portal accessibility, and harvest feasibility:
### Priority 1 - High Impact (300+ archives expected)
1. **Thüringen** ⭐ - 149 archives CONFIRMED
2. **Niedersachsen & Bremen (Arcinsys)** ⭐ - 300+ archives estimated
3. **Baden-Württemberg** ⭐ - 200+ archives estimated
4. **Bayern (Bavaria)** - 9 state archives + municipal network
### Priority 2 - Medium Impact (100-200 archives)
5. **Hessen (Arcinsys)** - 200+ archives estimated
6. **Schleswig-Holstein (Arcinsys)** - 150+ archives estimated
7. **Sachsen (Saxony)** - State archive system + municipalities
8. **Sachsen-Anhalt** - 3 departments + municipalities
### Priority 3 - Lower Impact (<100 archives)
9. **Mecklenburg-Vorpommern** - State archive + regional
10. **Brandenburg** - Centralized system (1 main archive)
11. **Rheinland-Pfalz** - No dedicated portal (use Archivportal-D)
12. **Saarland** - Small state (use Archivportal-D)
13. **Hamburg** - City-state (single archive)
14. **Berlin** - City-state (single archive)
---
## Technical Observations
### Portal Technologies
1. **Arcinsys** (Hessen, Niedersachsen, Bremen, Schleswig-Holstein)
- Shared platform developed by Hessen
- Consistent interface across 4 states
- User registration system
- Finding aids + digital copies
- Web-based ordering system
2. **Custom Drupal** (NRW)
- JavaScript-rendered
- Archive navigation by category
- Button-based interface
3. **Custom Portals** (Thüringen, Baden-Württemberg, Sachsen)
- State-specific designs
- Online finding aids
- Search interfaces
4. **Institutional Websites** (Bayern, Brandenburg)
- Individual archive websites
- No unified search portal
### Common Features Across Portals
**Archive Directory** - List of participating archives
**Finding Aids** - Searchable inventories (Findmittel)
**Digital Copies** - Scanned archival materials
**Archive Descriptions** - Historical context, holdings info
**Contact Information** - Addresses, hours, services
**User Accounts** - Registration for ordering materials
### Harvest Challenges
1. **Arcinsys Portals** - May require clicking through archive listings
2. **JavaScript Rendering** - Need Playwright/Selenium (like NRW)
3. **No Unified API** - Each portal has custom structure
4. **German Language Only** - Most portals German-only (except English summaries)
5. **Finding Aid vs Directory** - Some portals focus on inventories, not archive lists
---
## Harvest Strategy Recommendations
### Approach 1: Arcinsys Consortium (3 states, ~650 archives)
**Targets**: Niedersachsen & Bremen, Schleswig-Holstein, Hessen
**Technology**: Shared Arcinsys platform
**Advantage**: Consistent structure, can reuse scraping logic
**Steps**:
1. Analyze Arcinsys archive directory structure
2. Build unified scraper for all 3 Arcinsys portals
3. Extract archive names, cities, types, contact info
4. Geocode and merge with German dataset
**Expected Yield**: 600+ archives
---
### Approach 2: High-Impact Custom Portals (2 states, ~350 archives)
**Targets**: Thüringen (149 confirmed), Baden-Württemberg (200+ estimated)
**Technology**: Custom portals
**Advantage**: High archive counts, separate portal structures
**Steps**:
1. Thüringen: Scrape https://www.archive-in-thueringen.de/ (149 archives listed)
2. Baden-Württemberg: Scrape https://www.landesarchiv-bw.de/ directory
3. Extract and merge
**Expected Yield**: 350+ archives
---
### Approach 3: Bayern State Archives (9 archives + municipal)
**Target**: Bayern (Bavaria)
**Technology**: Individual archive websites
**Challenge**: No unified portal, must compile from GDA directory
**Steps**:
1. Scrape archive list from https://www.gda.bayern.de/archive
2. Extract 9 state archives (Hauptstaatsarchiv + 8 regional)
3. Check for municipal archive lists on state archive websites
**Expected Yield**: 10-50 archives (state + major municipal)
---
### Approach 4: National Aggregator (Archivportal-D)
**Target**: All remaining states (Rheinland-Pfalz, Saarland, etc.)
**Portal**: https://www.archivportal-d.de/
**Advantage**: Single portal for all states
**Steps**:
1. Scrape Archivportal-D archive directory
2. Filter by federal state
3. Extract archive metadata (name, city, type, sector)
4. Cross-reference with existing harvests (avoid duplicates)
**Expected Yield**: 1,000+ archives (all Germany, including duplicates from regional portals)
---
## Expected German Dataset Growth
### Current State (Post-NRW)
- **Total German Institutions**: 20,846
- **Sources**: ISIL + DDB + NRW
- **NRW Archives**: 441
### Projected Growth (Optimistic Scenario)
| Portal/State | Expected Archives | Duplicates (Est.) | Net New |
|--------------|-------------------|-------------------|---------|
| **Thüringen** | 149 | 30 (20%) | 119 |
| **Niedersachsen & Bremen (Arcinsys)** | 350 | 70 (20%) | 280 |
| **Schleswig-Holstein (Arcinsys)** | 150 | 30 (20%) | 120 |
| **Hessen (Arcinsys)** | 200 | 40 (20%) | 160 |
| **Baden-Württemberg** | 250 | 50 (20%) | 200 |
| **Bayern** | 50 | 10 (20%) | 40 |
| **Sachsen** | 150 | 30 (20%) | 120 |
| **Sachsen-Anhalt** | 100 | 20 (20%) | 80 |
| **Other states** | 200 | 40 (20%) | 160 |
| **TOTAL** | **1,599** | **320** | **1,279** |
### Projected German Dataset (After Regional Harvests)
- **Before Regional Harvests**: 20,846 institutions
- **Expected New Additions**: ~1,280 archives
- **Projected Total**: **~22,100 German institutions**
### Phase 1 Impact
- **Current Phase 1**: 38,479 / 97,000 (39.7%)
- **After German Regional Harvests**: 39,800 / 97,000 (41.0%)
- **Gain**: +1.3 percentage points
---
## Recommended Next Steps
### Immediate Actions
1. **Start with Thüringen** ⭐ (149 archives confirmed, easiest harvest)
- Portal: https://www.archive-in-thueringen.de/
- Build scraper for archive directory
- Estimated time: 30 minutes
2. **Harvest Arcinsys Consortium** ⭐ (600+ archives, unified platform)
- Portals: Niedersachsen, Schleswig-Holstein, Hessen
- Build shared Arcinsys scraper
- Estimated time: 2-3 hours
3. **Harvest Baden-Württemberg** (200+ archives)
- Portal: https://www.landesarchiv-bw.de/
- Custom scraper for archive directory
- Estimated time: 1 hour
### Medium-Term Goals
4. **Harvest Bayern** (9-50 archives)
5. **Harvest Sachsen** (150+ archives)
6. **Harvest Sachsen-Anhalt** (100+ archives)
### Long-Term Strategy
7. **Use Archivportal-D as fallback** for remaining states
8. **Cross-reference regional harvests** with Archivportal-D to catch missing archives
9. **Validate against ISIL registry** for quality control
---
## Technical Requirements
### Tools Needed
- **Playwright** - JavaScript rendering (Arcinsys, Thüringen)
- **BeautifulSoup** - HTML parsing
- **RapidFuzz** - Deduplication (fuzzy matching)
- **Nominatim** - Geocoding (rate-limited 1 req/sec)
### Scraper Pattern (from NRW Success)
```python
# 1. Use Playwright for JavaScript-rendered portals
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(portal_url)
await page.wait_for_load_state('networkidle')
# 2. Extract archive buttons/links
archives = await page.locator('.archive-button').all()
# 3. Extract text without clicking (fast approach)
for archive in archives:
name = await archive.inner_text()
# Parse city from name using regex
# 4. Geocode cities
# 5. Merge with existing dataset (fuzzy matching)
# 6. Export unified dataset
```
---
## Key Insights
### 1. Federated Structure
Germany's archive system is **highly federated** - each state operates independently with its own portal/system. This means:
- Regional portals have MORE detail than national ISIL registry
- Must harvest state-by-state to get complete coverage
- Archivportal-D aggregates but doesn't replace regional portals
### 2. Arcinsys Advantage
**4 states share Arcinsys** (Hessen, Niedersachsen, Bremen, Schleswig-Holstein):
- Represents ~25% of German states
- Expected ~600+ archives total
- Single scraper can harvest all 4 portals
- Consistent data structure = easier extraction
### 3. NRW Pattern Replicable
The NRW harvest pattern (fast text extraction without clicking) works well for:
- Drupal-based portals
- Button/link-based archive listings
- JavaScript-rendered pages
**Reuse this approach** for Thüringen, Arcinsys portals, Baden-Württemberg
### 4. Duplicate Rate Validation
NRW showed **80.7% duplicate rate** (356/441) with existing ISIL+DDB data:
- Validates existing data sources are comprehensive
- Expect similar rates for other states
- ~20% new archives per state is realistic expectation
---
## Comparison to NRW Harvest
| Metric | NRW | Expected (All Regional Portals) |
|--------|-----|----------------------------------|
| **Archives Harvested** | 441 | 1,599 |
| **Duplicates (%)** | 80.7% | ~80% (estimated) |
| **Net New** | 85 | ~1,280 |
| **Cities Covered** | 356 | ~800 |
| **Geocoded (%)** | 83.7% | ~85% (target) |
| **Harvest Time** | 9.3 seconds | ~5 hours (estimated) |
---
## Conclusion
Germany has a **rich ecosystem of regional archive portals** beyond archive.nrw.de. Harvesting these portals could add **~1,280 new institutions** to the German dataset, bringing the total from 20,846 → ~22,100.
**Priority targets**:
1. **Thüringen** (149 confirmed) - Quick win ⭐
2. **Arcinsys Consortium** (600+ estimated) - High impact ⭐
3. **Baden-Württemberg** (200+ estimated) - High impact ⭐
**Impact**: +1.3 percentage points toward Phase 1 goal (39.7% → 41.0%)
---
**Next Recommended Action**: Start with Thüringen harvest (149 archives, simple portal structure)
---
## References
### Portal URLs
- **Archivportal-D**: https://www.archivportal-d.de/
- **NRW**: https://www.archive.nrw.de/archivsuche ✅ Harvested
- **Thüringen**: https://www.archive-in-thueringen.de/
- **Niedersachsen & Bremen**: https://arcinsys.niedersachsen.de/
- **Schleswig-Holstein**: https://arcinsys.schleswig-holstein.de/
- **Hessen**: https://arcinsys.hessen.de/
- **Baden-Württemberg**: https://www.landesarchiv-bw.de/
- **Bayern**: https://www.gda.bayern.de/
- **Brandenburg**: https://blha.brandenburg.de/
- **Sachsen**: https://www.staatsarchiv.sachsen.de/
- **Sachsen-Anhalt**: https://landesarchiv.sachsen-anhalt.de/
### Documentation
- **NRW Harvest**: `NRW_HARVEST_COMPLETE_20251119.md`
- **NRW Merge**: `SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md`
- **Quick Status**: `QUICK_STATUS_20251119_POST_NRW.md`
---
**Report Generated**: 2025-11-19 22:30 UTC
**Research Method**: Exa deep web search (30 queries)
**Status**: Ready for harvest implementation