feat: Add comprehensive harvester for Thüringen archives

- Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de.
- Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more.
- Introduced structured data parsing and error handling for robust data extraction.
- Added rate limiting to respect server load and improve scraping efficiency.
- Results are saved in a JSON format with detailed metadata about the extraction process.
This commit is contained in:
kempersc 2025-11-20 00:25:45 +01:00
parent 3c80de87e0
commit 38354539a6
4 changed files with 11173 additions and 0 deletions

View file

@ -0,0 +1,363 @@
# Thüringen Archives Harvest - Complete ✅
**Date**: 2025-11-19
**Status**: Successfully completed and merged into German unified dataset v3.0
## Executive Summary
Successfully harvested and integrated 149 archives from the Thüringen regional portal (archive-in-thueringen.de) into the German institutions dataset.
**Key Results**:
- ✅ Harvested: 149 archives (100% success rate)
- ✅ Geocoded: 81/149 cities extracted (54.4%), 79/81 geocoded (97.5%)
- ✅ Merged: 89 new additions (59.7%) after deduplication
- ✅ Dataset growth: 20,846 → 20,935 institutions (+89, +0.4%)
---
## Phase 1: Portal Discovery
**Portal**: https://www.archive-in-thueringen.de/de/archiv/list
**Structure**:
- Single-page listing with all 149 archives visible
- No pagination required (very fast harvest)
- Format: Mix of "City - Archive Name" and organizational prefixes
**Key Patterns Discovered**:
- Simple: `"Stadtarchiv Erfurt"` → Erfurt
- Prefixed: `"Altenburg - Stadtarchiv Altenburg"` → Altenburg
- Complex: `"Landesarchiv Thüringen - Staatsarchiv Altenburg"` → Altenburg (from second part)
- State: `"Landesarchiv Thüringen Hauptstaatsarchiv Weimar"` → Weimar
---
## Phase 2: Harvester Development
**Script**: `scripts/scrapers/harvest_thueringen_archives.py`
**Technology Stack**:
- Playwright for JavaScript rendering
- Direct DOM extraction (no API)
- Regex-based city extraction
**City Extraction Logic** (Priority Order):
1. **PRIORITY 1**: Archive name patterns
- `Hauptstaatsarchiv` pattern: `"Hauptstaatsarchiv Weimar"``"Weimar"`
- `Staatsarchiv` pattern: `"Staatsarchiv Altenburg"``"Altenburg"`
- `Stadtarchiv` pattern: `"Stadtarchiv Erfurt"``"Erfurt"`
- `Gemeindearchiv` pattern: `"Gemeindearchiv Rohr"``"Rohr"`
- `Stadt- und Kreisarchiv` pattern: `"Stadt- und Kreisarchiv Arnstadt"``"Arnstadt"`
- `Universitätsarchiv` pattern: `"Universitätsarchiv Ilmenau"``"Ilmenau"`
2. **PRIORITY 2**: Split by " - " and filter organizational names
- Skip: "Landesarchiv", "Bundesarchiv", "EKM", "Landkreis", "Kreisarchiv"
- Accept: Actual city names (e.g., "Altenburg", "Erfurt")
**Performance**:
- Harvest time: 4.0 seconds
- Speed: 37.4 archives/second
- Success rate: 100% (149/149)
---
## Phase 3: City Extraction Results
**Statistics**:
- Total archives: 149
- With city names: 81/149 (54.4%)
- Without cities: 68/149 (45.6%)
**City Extraction Success** (Key Fixes):
- ✅ Staatsarchiv Altenburg → `"Altenburg"` (was: "Landesarchiv Thüringen")
- ✅ Staatsarchiv Gotha → `"Gotha"` (was: "Landesarchiv Thüringen")
- ✅ Staatsarchiv Greiz → `"Greiz"` (was: "Landesarchiv Thüringen")
- ✅ Staatsarchiv Meiningen → `"Meiningen"` (was: wrong)
- ✅ Staatsarchiv Rudolstadt → `"Rudolstadt"` (was: "Landesarchiv Thüringen")
- ✅ Hauptstaatsarchiv Weimar → `"Weimar"` (was: None)
**Archives Without Cities** (expected):
- Kreisarchiv (district archives without city in name)
- Corporate/organizational archives
- Religious archives
- Research centers
---
## Phase 4: Geocoding
**Script**: `scripts/scrapers/geocode_json_harvest.py` (created during session)
**Method**: Nominatim API with 1 req/sec rate limiting
**Results**:
- Attempted: 81 archives with city names
- Success: 79/81 (97.5%)
- Failed: 2 (complex city names like "Ilmenau, Archiv Gehren")
**Sample Geocoded Locations**:
```
Ohrdruf → 50.827008, 10.731950
Altenburg → 50.985241, 12.434099
Gotha → 50.949485, 10.701443
Weimar → 50.979330, 11.329792
Erfurt → 50.977797, 11.028736
```
**Output**: `data/isil/germany/thueringen_archives_20251119_221328_geocoded.json`
---
## Phase 5: Merge into German Dataset
**Script**: `scripts/scrapers/merge_thueringen_to_german_dataset.py`
**Deduplication Strategy**:
- Fuzzy name matching (90% similarity threshold)
- City matching bonus for disambiguation
- High confidence threshold (95%) for non-city matches
**Merge Results**:
- Input: German v2 (20,846) + Thüringen (149)
- Duplicates found: 60/149 (40.3%)
- New additions: 89/149 (59.7%)
- Output: German v3 (20,935 institutions)
**Notable Duplicates** (already in dataset):
- Landesarchiv Thüringen Staatsarchiv Meiningen
- Landesarchiv Thüringen Hauptstaatsarchiv Weimar
- Stadtarchiv Erfurt
- Stadtarchiv Weimar
- Stadtarchiv Jena
- Stadtarchiv Eisenach
- (54 more...)
**New Additions** (sample):
- Staatsarchiv Altenburg, Gotha, Greiz, Rudolstadt (4 state archives)
- 15+ Kreisarchiv (district archives)
- 20+ Stadtarchiv (city archives not previously in dataset)
- 8 corporate archives (Carl Zeiss, SCHOTT, Handwerkskammer, etc.)
- 6 religious archives (Landeskirchenarchiv, Bistumsarchiv branches)
- 8 university/research archives
- 8 specialized archives (museums, research centers, memorial sites)
---
## Institution Type Distribution
| Type | Count | Percentage |
|------|-------|------------|
| ARCHIVE | 100 | 67.1% |
| OFFICIAL_INSTITUTION | 13 | 8.7% |
| CORPORATION | 8 | 5.4% |
| EDUCATION_PROVIDER | 8 | 5.4% |
| RESEARCH_CENTER | 8 | 5.4% |
| HOLY_SITES | 6 | 4.0% |
| MUSEUM | 4 | 2.7% |
| COLLECTING_SOCIETY | 1 | 0.7% |
| NGO | 1 | 0.7% |
**Classification Examples**:
- OFFICIAL_INSTITUTION: Landesarchiv Thüringen branches (state archives)
- ARCHIVE: Stadtarchiv, Kreisarchiv (municipal and district archives)
- CORPORATION: Carl Zeiss Archiv, SCHOTT Archiv, Handwerkskammer archives
- EDUCATION_PROVIDER: Universitätsarchiv (university archives)
- HOLY_SITES: Landeskirchenarchiv, Bistumsarchiv branches
- RESEARCH_CENTER: Thüringer Industriearchiv, Lippmann+Rau-Musikarchiv
---
## Output Files
### Harvest Files:
1. **Raw harvest** (with cities): `thueringen_archives_20251119_221328.json`
- 149 archives
- 81 with cities (54.4%)
- File size: 108.4 KB
2. **Geocoded harvest**: `thueringen_archives_20251119_221328_geocoded.json`
- 149 archives
- 79 with coordinates (53.0%)
- File size: 110.2 KB (estimated)
### Merged Dataset:
3. **German unified v3.0**: `german_institutions_unified_v3_20251119_232317.json`
- 20,935 total institutions
- Sources: ISIL Registry + DDB SPARQL + NRW Portal + Thüringen Portal
- File size: 39.4 MB
---
## Data Quality Assessment
**TIER_2_VERIFIED** (Web Scraping):
- All 149 Thüringen archives classified as TIER_2
- Source: Official regional archive portal
- Verification: Direct from authoritative regional website
**Confidence Scores**:
- Institution names: 0.95 (high - from official portal)
- Institution types: 0.95 (inferred from name patterns)
- Cities: 0.90 (regex extraction, verified by geocoding)
- Coordinates: 0.95 (Nominatim geocoding with 97.5% success)
**Known Limitations**:
- No ISIL codes in source portal (enrichment needed)
- 45.6% archives without cities (Kreisarchiv, corporate, research)
- 2 archives failed geocoding (complex city name formats)
- Addresses not provided (only cities)
---
## Impact on German Dataset
### Dataset Growth Trajectory:
- **German v1.0** (2025-11-19, 18:18): 20,760 institutions (ISIL + DDB)
- **German v2.0** (2025-11-19, 21:11): 20,846 institutions (+86 from NRW, +0.4%)
- **German v3.0** (2025-11-19, 23:23): 20,935 institutions (+89 from Thüringen, +0.4%)
### Progress Toward 100K Goal:
- Current: 20,935 institutions
- Goal: 100,000 German heritage institutions
- Progress: 20.9% complete
- Remaining: 79,065 institutions
### Next Regional Targets:
1. **Arcinsys consortium**: ~700 archives (4 states: Hesse, Bremen, Hamburg, Schleswig-Holstein)
2. **Baden-Württemberg**: 200+ archives
3. **Bayern**: 9+ state archives + regional portals
4. **Remaining states**: ~380 archives
**Estimated potential**: +1,280 archives from regional portals → ~22,100 total
---
## Technical Lessons Learned
### City Extraction Challenges:
**Problem 1**: JavaScript split logic was wrong
- **Issue**: "Landesarchiv Thüringen - Staatsarchiv Altenburg" split by " - " → city = "Landesarchiv Thüringen"
- **Solution**: Ignore JavaScript split, use Python regex patterns only
**Problem 2**: Regex priority not enforced
- **Issue**: Split-by-dash ran before archive name patterns
- **Solution**: Extract city from full text, ignore JavaScript-derived city field
**Problem 3**: Hauptstaatsarchiv not matched
- **Issue**: `Staatsarchiv` regex didn't match `Hauptstaatsarchiv`
- **Solution**: Added separate pattern check for `Hauptstaatsarchiv` before `Staatsarchiv`
### Successful Patterns:
```python
# Check Hauptstaatsarchiv first (more specific)
hauptstaatsarchiv_match = re.search(r'Hauptstaatsarchiv\s+(.+?)$', fulltext)
# Then check Staatsarchiv (more general)
staatsarchiv_match = re.search(r'Staatsarchiv\s+(.+?)$', fulltext)
# Stadtarchiv pattern (most common)
stadtarchiv_match = re.search(r'Stadtarchiv\s+(.+?)(?:\s*$)', fulltext)
```
**Key Insight**: Pattern order matters - check specific patterns before general ones!
---
## Scripts Created This Session
1. ✅ **harvest_thueringen_archives.py** (176 lines)
- Playwright-based harvester
- Regex city extraction
- Institution type classification
- Performance: 37.4 archives/sec
2. ✅ **geocode_json_harvest.py** (123 lines)
- Nominatim geocoder
- Rate limiting (1 req/sec)
- Batch processing for JSON harvests
- Success rate: 97.5%
3. ✅ **merge_thueringen_to_german_dataset.py** (251 lines)
- Fuzzy deduplication (90% threshold)
- Format conversion (harvest → unified)
- Merge statistics tracking
- Deduplication rate: 40.3%
---
## Next Steps
### Immediate (Session Continuation):
1. ⏭️ **Harvest Arcinsys consortium** (~700 archives, 4 states)
- Portal: https://arcinsys.de or state-specific portals
- Expected: +500 new additions after dedup (70% new, 30% duplicates)
2. ⏭️ **Harvest Baden-Württemberg** (200+ archives)
- Portal: https://www.landesarchiv-bw.de or regional aggregator
- Expected: +150 new additions
3. ⏭️ **Harvest Bayern state archives** (9+ archives)
- Portal: https://www.gda.bayern.de
- Expected: +5 new additions (state-level archives)
### Long-term (Phase 1 Goal):
- Target: 100,000 German heritage institutions
- Current: 20,935 (20.9% complete)
- Remaining regional harvests: ~1,000 archives from state portals
- Need to expand beyond archives: libraries, museums, galleries
---
## Reproducibility
To reproduce this harvest:
```bash
# Step 1: Harvest Thüringen archives
python scripts/scrapers/harvest_thueringen_archives.py
# Output: data/isil/germany/thueringen_archives_YYYYMMDD_HHMMSS.json
# Step 2: Geocode cities
python scripts/scrapers/geocode_json_harvest.py data/isil/germany/thueringen_archives_*.json
# Output: data/isil/germany/thueringen_archives_*_geocoded.json
# Step 3: Merge with German dataset
python scripts/scrapers/merge_thueringen_to_german_dataset.py
# Output: data/isil/germany/german_institutions_unified_v3_YYYYMMDD_HHMMSS.json
```
**Prerequisites**:
- Python 3.10+
- playwright (installed: `pip install playwright && playwright install chromium`)
- rapidfuzz (installed: `pip install rapidfuzz`)
- requests (installed: `pip install requests`)
---
## Session Summary
**Duration**: ~30 minutes (22:00 - 22:30 UTC)
**Agent**: OpenCode AI
**Result**: ✅ Complete success
**Workflow**:
1. ✅ Discovered Thüringen portal structure
2. ✅ Created harvester script (with city extraction debugging)
3. ✅ Fixed city extraction bugs (3 iterations)
4. ✅ Geocoded 81 cities (97.5% success)
5. ✅ Created merger script
6. ✅ Merged 89 new archives into German dataset v3.0
**Blockers Resolved**:
- City extraction from complex organizational prefixes
- JavaScript split logic vs. regex patterns
- Hauptstaatsarchiv vs. Staatsarchiv pattern matching
- Null city handling in fuzzy matching
**Outcome**: German dataset now includes comprehensive Thüringen archives coverage (149 archives), bringing total to 20,935 institutions (+0.4% growth).
---
**End of Report** ✅

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,650 @@
#!/usr/bin/env python3
"""
Thüringen Archives Comprehensive Harvester
Extracts FULL metadata from 149 archive detail pages on archive-in-thueringen.de
Portal: https://www.archive-in-thueringen.de/de/archiv/list
Detail pages: https://www.archive-in-thueringen.de/de/archiv/view/id/{ID}
Strategy:
1. Get list of 149 archive URLs from list page
2. Visit each detail page and extract ALL available metadata
3. Extract: addresses, emails, phones, directors, collection sizes, opening hours, histories
Author: OpenCode + AI Agent
Date: 2025-11-20
Version: 2.0 (Comprehensive)
"""
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout
import json
import time
from pathlib import Path
from datetime import datetime, timezone
from typing import List, Dict, Optional
import re
# Configuration
BASE_URL = "https://www.archive-in-thueringen.de"
ARCHIVE_LIST_URL = f"{BASE_URL}/de/archiv/list"
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# Rate limiting
REQUEST_DELAY = 1.0 # seconds between requests (be respectful)
# Map German archive types to GLAM taxonomy
ARCHIVE_TYPE_MAPPING = {
"Landesarchiv": "OFFICIAL_INSTITUTION",
"Staatsarchiv": "OFFICIAL_INSTITUTION",
"Hauptstaatsarchiv": "OFFICIAL_INSTITUTION",
"Stadtarchiv": "ARCHIVE",
"Gemeindearchiv": "ARCHIVE",
"Kreisarchiv": "ARCHIVE",
"Stadt- und Kreisarchiv": "ARCHIVE",
"Bistumsarchiv": "HOLY_SITES",
"Kirchenkreisarchiv": "HOLY_SITES",
"Landeskirchenarchiv": "HOLY_SITES",
"Archiv des Ev.": "HOLY_SITES",
"Archiv des Bischöflichen": "HOLY_SITES",
"Pfarrhausarchiv": "HOLY_SITES",
"Universitätsarchiv": "EDUCATION_PROVIDER",
"Hochschularchiv": "EDUCATION_PROVIDER",
"Hochschule": "EDUCATION_PROVIDER",
"Universität": "EDUCATION_PROVIDER",
"Fachhochschule": "EDUCATION_PROVIDER",
"Fachschule": "EDUCATION_PROVIDER",
"Carl Zeiss": "CORPORATION",
"SCHOTT": "CORPORATION",
"Wirtschaftsarchiv": "CORPORATION",
"Handwerkskammer": "CORPORATION",
"Handelskammer": "CORPORATION",
"Industrie- und Handelskammer": "CORPORATION",
"Lederfabrik": "CORPORATION",
"Verlagsgesellschaft": "CORPORATION",
"Bundesarchiv": "OFFICIAL_INSTITUTION",
"Stasi-Unterlagen": "OFFICIAL_INSTITUTION",
"Thüringer Landtages": "OFFICIAL_INSTITUTION",
"Gedenkstätte": "MUSEUM",
"Museum": "MUSEUM",
"Goethe- und Schiller": "RESEARCH_CENTER",
"Akademie": "RESEARCH_CENTER",
"Thüringer Archiv für Zeitgeschichte": "RESEARCH_CENTER",
"Thüringer Industriearchiv": "RESEARCH_CENTER",
"Thüringer Bauteil-Archiv": "RESEARCH_CENTER",
"Thüringer Talsperren": "RESEARCH_CENTER",
"Landesamt": "OFFICIAL_INSTITUTION",
"Archiv des Vogtländischen": "COLLECTING_SOCIETY",
"Archiv des Arbeitskreises": "NGO",
"Grenzlandmuseum": "MUSEUM",
"Archiv der VG": "ARCHIVE",
"Archiv der Verwaltungsgemeinschaft": "ARCHIVE",
"Archiv der Landgemeinde": "ARCHIVE",
"Archiv der Sammlung": "RESEARCH_CENTER",
"Musikarchiv": "RESEARCH_CENTER",
}
def infer_institution_type(name: str) -> str:
"""Infer institution type from German archive name."""
for keyword, inst_type in ARCHIVE_TYPE_MAPPING.items():
if keyword in name:
return inst_type
return "ARCHIVE"
def parse_address_lines(lines: List[str]) -> Dict[str, str]:
"""
Parse German address lines into structured format.
Example input:
[
"Landesarchiv Thüringen - Staatsarchiv Altenburg",
"Schloss 7",
"04600 Altenburg"
]
Returns: {
"organization": "Landesarchiv Thüringen - Staatsarchiv Altenburg",
"street": "Schloss 7",
"postal_code": "04600",
"city": "Altenburg"
}
"""
result = {}
for i, line in enumerate(lines):
line = line.strip()
if not line:
continue
# First line is usually organization
if i == 0:
result["organization"] = line
# Check for postal code pattern (5 digits + city)
elif re.match(r'^\d{5}\s+\S+', line):
parts = line.split(None, 1)
result["postal_code"] = parts[0]
if len(parts) > 1:
result["city"] = parts[1]
# Check for PO Box
elif line.startswith("PF ") or line.startswith("Postfach"):
result["po_box"] = line
# Otherwise assume street address
else:
if "street" not in result:
result["street"] = line
else:
result["street"] += f", {line}"
return result
def extract_detail_page_metadata(page) -> Dict:
"""
Extract comprehensive metadata from archive detail page.
Returns dict with all available fields from the detail page.
"""
metadata = {}
try:
# Extract all structured data using JavaScript
extracted = page.evaluate("""
() => {
const data = {};
// Extract archive name from breadcrumb h1 (not site title)
const h1Elements = document.querySelectorAll('h1');
if (h1Elements.length >= 2) {
data.name = h1Elements[1].textContent.trim();
} else if (h1Elements.length === 1) {
// Fallback: check if it's not the site title
const h1Text = h1Elements[0].textContent.trim();
if (h1Text !== 'Archivportal Thüringen') {
data.name = h1Text;
}
}
// Helper to extract section by heading
function extractSection(headingText) {
const headings = Array.from(document.querySelectorAll('h4, h3, strong'));
const heading = headings.find(h => h.textContent.trim() === headingText);
if (!heading) return null;
// Get parent container
let container = heading.closest('div');
if (!container) return null;
// Extract text content after heading
const content = [];
let sibling = heading.nextElementSibling;
while (sibling && !sibling.matches('h3, h4')) {
const text = sibling.textContent.trim();
if (text) content.push(text);
sibling = sibling.nextElementSibling;
}
return content.join('\\n').trim();
}
// Helper to extract list items from section
function extractListItems(headingText) {
const headings = Array.from(document.querySelectorAll('h4, h3, strong'));
const heading = headings.find(h => h.textContent.trim() === headingText);
if (!heading) return [];
let container = heading.closest('div');
if (!container) return [];
const items = Array.from(container.querySelectorAll('li'))
.map(el => el.textContent.trim())
.filter(text => text.length > 0);
return items;
}
// Extract Postanschrift (postal address)
data.postal_address = extractListItems('Postanschrift');
// Extract Dienstanschrift (physical address)
data.physical_address = extractListItems('Dienstanschrift');
// Extract Besucheranschrift (visitor address) if different
data.visitor_address = extractListItems('Besucheranschrift');
// Extract email
const emailLinks = Array.from(document.querySelectorAll('a[href^="mailto:"]'));
if (emailLinks.length > 0) {
data.email = emailLinks[0].href.replace('mailto:', '').trim();
}
// Extract phone
const phoneLinks = Array.from(document.querySelectorAll('a[href^="tel:"]'));
if (phoneLinks.length > 0) {
data.phone = phoneLinks[0].textContent.trim();
// Also get raw phone number from href
const phoneHref = phoneLinks[0].href.replace('tel:', '').trim();
data.phone_raw = phoneHref;
}
// Extract fax (usually in text, not a link)
const faxPattern = /Fax[:\\s]+([\\d\\s\\/\\-\\(\\)]+)/i;
const bodyText = document.body.textContent;
const faxMatch = bodyText.match(faxPattern);
if (faxMatch) {
data.fax = faxMatch[1].trim();
}
// Extract website
const websiteLinks = Array.from(document.querySelectorAll('a[href^="http"]'))
.filter(a => !a.href.includes('archive-in-thueringen.de'));
if (websiteLinks.length > 0) {
data.website = websiteLinks[0].href;
}
// Extract Öffnungszeiten (opening hours) - special extraction
const openingHoursH4 = Array.from(document.querySelectorAll('h4'))
.find(h => h.textContent.trim() === 'Öffnungszeiten');
if (openingHoursH4) {
const parent = openingHoursH4.parentElement;
if (parent) {
const texts = [];
let node = openingHoursH4.nextSibling;
while (node && node !== parent.querySelector('h4:not(:first-of-type)')) {
if (node.nodeType === Node.TEXT_NODE && node.textContent.trim()) {
texts.push(node.textContent.trim());
} else if (node.nodeType === Node.ELEMENT_NODE && node.tagName !== 'H4') {
const text = node.textContent.trim();
if (text) texts.push(text);
}
node = node.nextSibling;
}
data.opening_hours = texts.join(' ');
}
}
// Extract Archivleiter/in (director) - extract strong tag content
const directorH4 = Array.from(document.querySelectorAll('h4'))
.find(h => h.textContent.trim() === 'Archivleiter/in');
if (directorH4) {
const parent = directorH4.parentElement;
if (parent) {
const strongElem = parent.querySelector('strong');
if (strongElem) {
data.director = strongElem.textContent.trim();
}
}
}
// Extract Bestand (collection size) - in listitem before h4
const bestandH4 = Array.from(document.querySelectorAll('h4'))
.find(h => h.textContent.trim() === 'Bestand');
if (bestandH4) {
const listitem = bestandH4.closest('li');
if (listitem) {
const text = listitem.textContent.replace('Bestand', '').trim();
data.collection_size = text;
}
}
// Extract Laufzeit (temporal coverage) - in listitem before h4
const laufzeitH4 = Array.from(document.querySelectorAll('h4'))
.find(h => h.textContent.trim() === 'Laufzeit');
if (laufzeitH4) {
const listitem = laufzeitH4.closest('li');
if (listitem) {
const text = listitem.textContent.replace('Laufzeit', '').trim();
data.temporal_coverage = text;
}
}
// Extract Archivgeschichte (archive history) - all paragraphs after h4
const archivgeschichteH4 = Array.from(document.querySelectorAll('h4'))
.find(h => h.textContent.trim() === 'Archivgeschichte');
if (archivgeschichteH4) {
const parent = archivgeschichteH4.parentElement;
if (parent) {
const paragraphs = Array.from(parent.querySelectorAll('p'))
.map(p => p.textContent.trim())
.filter(text => text.length > 0);
data.archive_history = paragraphs.join('\\n\\n');
}
}
// Extract Bestände (collection descriptions)
data.collections = extractSection('Bestände');
// Extract Tektonik (classification system)
data.classification = extractSection('Tektonik');
// Extract Recherche (research information)
data.research_info = extractSection('Recherche');
// Extract Benutzung (access/usage information)
data.usage_info = extractSection('Benutzung');
return data;
}
""")
# Merge JavaScript extraction with metadata
metadata.update(extracted)
# Parse addresses
if metadata.get('postal_address'):
metadata['postal_address_parsed'] = parse_address_lines(metadata['postal_address'])
if metadata.get('physical_address'):
metadata['physical_address_parsed'] = parse_address_lines(metadata['physical_address'])
if metadata.get('visitor_address'):
metadata['visitor_address_parsed'] = parse_address_lines(metadata['visitor_address'])
except Exception as e:
print(f" ⚠️ Error extracting metadata: {e}")
return metadata
def harvest_archive_list(page) -> List[Dict]:
"""Get list of all archive URLs from main list page."""
print(f"📄 Loading archive list page...")
page.goto(ARCHIVE_LIST_URL, wait_until='networkidle', timeout=30000)
# Accept cookies if present
try:
cookie_button = page.locator('button:has-text("Akzeptieren"), button:has-text("Accept")')
if cookie_button.is_visible(timeout=2000):
cookie_button.click()
print("✅ Accepted cookies")
time.sleep(1)
except:
pass
print("📋 Extracting archive URLs...")
# Extract archive URLs
result = page.evaluate("""
() => {
const archiveLinks = document.querySelectorAll('ul li a[href*="/de/archiv/view/id/"]');
const uniqueArchives = new Map();
archiveLinks.forEach(link => {
const url = link.href;
const idMatch = url.match(/\\/id\\/(\\d+)/);
if (!idMatch) return;
const archiveId = idMatch[1];
if (uniqueArchives.has(archiveId)) return;
uniqueArchives.set(archiveId, {
id: archiveId,
url: url
});
});
return Array.from(uniqueArchives.values());
}
""")
print(f"✅ Found {len(result)} unique archives")
return result
def harvest_thueringen_archives_comprehensive() -> List[Dict]:
"""
Harvest COMPLETE metadata from all 149 Thüringen archives.
This comprehensive version visits each detail page to extract:
- Addresses (postal, physical, visitor)
- Contact info (email, phone, fax, website)
- Opening hours
- Director names
- Collection sizes
- Temporal coverage
- Archive histories
- Collection descriptions
"""
print(f"🚀 Thüringen Archives Comprehensive Harvester v2.0")
print(f"📍 Portal: {ARCHIVE_LIST_URL}")
print(f"⏱️ Starting harvest at {datetime.now(timezone.utc).isoformat()}")
print(f"⏳ Expected time: ~150 seconds (1 sec/page × 149 pages)")
print()
archives = []
with sync_playwright() as p:
print("🌐 Launching browser...")
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
)
page = context.new_page()
try:
# Step 1: Get list of all archive URLs
archive_list = harvest_archive_list(page)
total = len(archive_list)
# Step 2: Visit each detail page
print(f"\n📚 Processing {total} archive detail pages...")
print(f"⏱️ Rate limit: {REQUEST_DELAY}s between requests")
print()
start_time = time.time()
for idx, archive_info in enumerate(archive_list, 1):
archive_id = archive_info['id']
archive_url = archive_info['url']
print(f"[{idx}/{total}] Processing ID {archive_id}...", end=' ')
try:
# Visit detail page
page.goto(archive_url, wait_until='domcontentloaded', timeout=15000)
time.sleep(0.5) # Let JavaScript render
# Extract comprehensive metadata
metadata = extract_detail_page_metadata(page)
# Determine city from address data
city = None
if metadata.get('physical_address_parsed', {}).get('city'):
city = metadata['physical_address_parsed']['city']
elif metadata.get('postal_address_parsed', {}).get('city'):
city = metadata['postal_address_parsed']['city']
# Infer institution type
inst_type = infer_institution_type(metadata.get('name', ''))
# Build structured record
archive_data = {
"id": f"thueringen-{archive_id}",
"name": metadata.get('name', ''),
"institution_type": inst_type,
"city": city,
"region": "Thüringen",
"country": "DE",
"url": archive_url,
"source_portal": "archive-in-thueringen.de",
# Contact information
"email": metadata.get('email'),
"phone": metadata.get('phone'),
"fax": metadata.get('fax'),
"website": metadata.get('website'),
# Addresses
"postal_address": metadata.get('postal_address_parsed'),
"physical_address": metadata.get('physical_address_parsed'),
"visitor_address": metadata.get('visitor_address_parsed'),
# Archive details
"opening_hours": metadata.get('opening_hours'),
"director": metadata.get('director'),
"collection_size": metadata.get('collection_size'),
"temporal_coverage": metadata.get('temporal_coverage'),
"archive_history": metadata.get('archive_history'),
"collections": metadata.get('collections'),
"classification": metadata.get('classification'),
"research_info": metadata.get('research_info'),
"usage_info": metadata.get('usage_info'),
# Provenance
"provenance": {
"data_source": "WEB_SCRAPING",
"data_tier": "TIER_2_VERIFIED",
"extraction_date": datetime.now(timezone.utc).isoformat(),
"extraction_method": "Playwright comprehensive detail page extraction",
"source_url": archive_url,
"confidence_score": 0.95
}
}
archives.append(archive_data)
print(f"{metadata.get('name', 'Unknown')[:40]}")
except PlaywrightTimeout:
print(f"⏱️ Timeout")
archives.append({
"id": f"thueringen-{archive_id}",
"url": archive_url,
"error": "timeout",
"provenance": {
"data_source": "WEB_SCRAPING",
"extraction_date": datetime.now(timezone.utc).isoformat(),
"confidence_score": 0.0
}
})
except Exception as e:
print(f"❌ Error: {e}")
archives.append({
"id": f"thueringen-{archive_id}",
"url": archive_url,
"error": str(e),
"provenance": {
"data_source": "WEB_SCRAPING",
"extraction_date": datetime.now(timezone.utc).isoformat(),
"confidence_score": 0.0
}
})
# Rate limiting
if idx < total:
time.sleep(REQUEST_DELAY)
# Progress update every 25 archives
if idx % 25 == 0:
elapsed = time.time() - start_time
rate = idx / elapsed
remaining = (total - idx) / rate
print(f" 📊 Progress: {idx}/{total} ({idx/total*100:.1f}%) | " +
f"Speed: {rate:.1f}/sec | ETA: {remaining/60:.1f} min")
# Final statistics
elapsed = time.time() - start_time
successful = sum(1 for a in archives if 'error' not in a)
print(f"\n📊 Harvest Statistics:")
print(f" Total archives: {len(archives)}")
print(f" Successful: {successful}")
print(f" Failed: {len(archives) - successful}")
print(f" Time elapsed: {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
print(f" Speed: {len(archives)/elapsed:.1f} archives/second")
# Count by type
type_counts = {}
for archive in archives:
if 'error' not in archive:
inst_type = archive.get('institution_type', 'UNKNOWN')
type_counts[inst_type] = type_counts.get(inst_type, 0) + 1
print(f"\n By institution type:")
for inst_type, count in sorted(type_counts.items(), key=lambda x: -x[1]):
print(f" - {inst_type}: {count}")
# Count metadata completeness
with_email = sum(1 for a in archives if a.get('email'))
with_phone = sum(1 for a in archives if a.get('phone'))
with_address = sum(1 for a in archives if a.get('physical_address'))
with_director = sum(1 for a in archives if a.get('director'))
with_collection = sum(1 for a in archives if a.get('collection_size'))
with_history = sum(1 for a in archives if a.get('archive_history'))
print(f"\n Metadata completeness:")
print(f" - Email: {with_email}/{successful} ({with_email/successful*100:.1f}%)")
print(f" - Phone: {with_phone}/{successful} ({with_phone/successful*100:.1f}%)")
print(f" - Physical address: {with_address}/{successful} ({with_address/successful*100:.1f}%)")
print(f" - Director: {with_director}/{successful} ({with_director/successful*100:.1f}%)")
print(f" - Collection size: {with_collection}/{successful} ({with_collection/successful*100:.1f}%)")
print(f" - Archive history: {with_history}/{successful} ({with_history/successful*100:.1f}%)")
except Exception as e:
print(f"❌ Critical error during harvest: {e}")
import traceback
traceback.print_exc()
finally:
browser.close()
return archives
def save_results(archives: List[Dict]) -> Path:
"""Save comprehensive harvest to JSON file."""
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
output_file = OUTPUT_DIR / f"thueringen_archives_comprehensive_{timestamp}.json"
output_data = {
"metadata": {
"source": "archive-in-thueringen.de",
"harvest_date": datetime.now(timezone.utc).isoformat(),
"total_archives": len(archives),
"successful_extractions": sum(1 for a in archives if 'error' not in a),
"region": "Thüringen",
"country": "DE",
"harvester_version": "2.0 (comprehensive)",
"extraction_level": "comprehensive_detail_pages"
},
"archives": archives
}
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
print(f"\n💾 Results saved to: {output_file}")
print(f" File size: {output_file.stat().st_size / 1024:.1f} KB")
return output_file
def main():
"""Main execution function."""
start_time = time.time()
# Harvest archives with comprehensive metadata
archives = harvest_thueringen_archives_comprehensive()
if archives:
# Save results
output_file = save_results(archives)
elapsed = time.time() - start_time
print(f"\n✅ Comprehensive harvest completed in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
print(f"\n🎯 Next Steps:")
print(f" 1. Validate extracted data quality")
print(f" 2. Merge with German dataset v3: python scripts/scrapers/merge_thueringen_to_german_dataset.py {output_file}")
print(f" 3. Expected result: Replace 89 basic entries with 149 comprehensive entries")
else:
print("\n❌ No archives harvested!")
return 1
return 0
if __name__ == "__main__":
exit(main())