glam/data/isil/SESSION_SUMMARY_20251119_HARVEST_CONTINUATION.md
2025-11-19 23:25:22 +01:00

9.7 KiB

Session Summary: Global ISIL Harvest Continuation

Date: November 19, 2025, 13:30-14:30 CET
Duration: 1 hour
Agent: OpenCode AI


What We Accomplished

1. Successfully Completed German ISIL Harvest

  • Records: 16,979 German heritage institutions
  • Method: SRU 1.1 protocol from Deutsche Nationalbibliothek
  • Performance: ~3 minutes total, 170 batch requests, 100% success rate
  • Data Quality: Excellent (87% with geocoded addresses, 79% with websites)

2. Verified Existing Swiss ISIL Data

  • Records: 2,379 Swiss + Liechtenstein institutions (already harvested Nov 18)
  • Method: Web scraping from Swiss National Library ISIL directory
  • Duration: 33 minutes (previous session)
  • Data Quality: Very good (80.8% with ISIL codes, 49.1% with phone numbers)

3. Created Comprehensive Progress Tracking

  • Master Plan: MASTER_HARVEST_PLAN.md - Strategy for 36 countries
  • Progress Summary: HARVEST_PROGRESS_SUMMARY.md - Current status (26.2% complete)
  • Country Reports: Detailed harvest documentation per country

Key Statistics

Overall Progress

  • Completed: 7 countries, 25,436 records (26.2%)
  • In Progress: 2 countries, ~5,000 records (5.2%)
  • Planned: 27 countries, ~66,564 records (68.6%)
  • Total Target: 36 countries, ~97,000 institutions

Recent Harvests

  1. Germany (Nov 19): 16,979 institutions - Tier 1 quality
  2. Switzerland (Nov 18): 2,379 institutions - Tier 1 quality

Data Volumes

  • Germany: 37 MB JSON, 24 MB JSONL
  • Switzerland: 1.3 MB JSON, CSV available
  • Total: ~41 MB of structured ISIL data

Files Created This Session

Documentation

  • /data/isil/HARVEST_PROGRESS_SUMMARY.md - Comprehensive progress report
  • /data/isil/germany/HARVEST_REPORT.md - German harvest details
  • /data/isil/germany/QUICK_START.md - Usage examples
  • /data/isil/germany/README.md - Executive summary

Data Files

  • /data/isil/germany/german_isil_complete_20251119_134939.json (37 MB)
  • /data/isil/germany/german_isil_complete_20251119_134939.jsonl (24 MB)
  • /data/isil/germany/german_isil_stats_20251119_134941.json (7.6 KB)

Scripts

  • /scripts/scrapers/harvest_german_isil_sru.py - Production harvester
  • /scripts/scrapers/harvest_swiss_isil.py - Swiss scraper template

What We Discovered

Swiss Data Already Harvested

  • We were about to start harvesting Switzerland when we discovered it was already complete!
  • Previous session (Nov 18) had scraped all 2,379 institutions in 33 minutes
  • Saved 30+ minutes by checking existing data first

German ISIL Registry Structure

  • Very well structured - Uses PICA+ XML format
  • Rich metadata - Includes geocoding, contact info, parent organizations
  • Fast API - SRU protocol allows batch fetching (100 records/request)
  • Excellent documentation - Clear field mappings in PICA format

Switzerland ISIL Registry Characteristics

  • Web-based only - No API, requires scraping
  • Detailed pages - Rich institution descriptions
  • Multi-lingual - German, French, Italian, English
  • Good coverage - Includes archives, libraries, museums, documentation centers

Next Steps (Priority Order)

Immediate (This Week)

  1. Czech Republic - Implement Z39.50 harvester for ~3,000 institutions
  2. Denmark - Investigate registry access, harvest ~900 institutions
  3. Fix Swiss ISIL Extraction - Extract ISIL codes from URLs (currently not captured)

Short-term (Weeks 2-3)

  1. France - SUDOC portal harvest (~5,000 institutions)
  2. Italy - ICCU/SBN API integration (~8,000 institutions)
  3. Austria - Complete full scrape (~3,000 institutions, currently 10 samples)

Medium-term (Week 4)

  1. Data Quality - Geocode Swiss addresses, validate German data
  2. Wikidata Enrichment - Cross-reference all institutions with Wikidata
  3. LinkML Conversion - Transform all data to GLAM project schema

Long-term (Weeks 5-16)

  1. Phase 2: Southern Europe (Spain, Portugal, Greece, Croatia, Serbia, Slovenia)
  2. Phase 3: Eastern Europe (Romania, Hungary, Slovakia, Ukraine, Baltics)
  3. Phase 4: Global expansion (Australia, New Zealand, South Korea, South Africa)

Technical Insights

SRU Protocol (Germany) - Best Practice

# Key advantages:
- Batch fetching: 100 records per request
- Standard protocol: Library industry standard
- XML parsing: Structured, predictable format
- Error handling: Built-in diagnostics
- Performance: ~94 records/second

Web Scraping (Switzerland) - Reliable but Slower

# Considerations:
- Rate limiting: 2 seconds per request
- Pagination: 96 pages @ 25 records/page
- Detail pages: Individual fetches per institution
- Performance: ~1.2 records/second
- Politeness: Essential for long-term access

Data Quality Hierarchy

  1. Tier 1 (Authoritative): Official registries (Germany, Switzerland)
  2. Tier 2 (Verified): Institutional websites (crawl4ai)
  3. Tier 3 (Crowd-sourced): Wikidata, OSM
  4. Tier 4 (Inferred): NLP extraction from conversations

Lessons Learned

Check Before Harvesting

  • Always verify if data already exists before starting a new harvest
  • We almost re-scraped Switzerland unnecessarily
  • Saved ~1 hour by checking /data/isil/switzerland/ first

SRU Protocol is Ideal for Libraries

  • Deutsche Nationalbibliothek provides excellent API access
  • Standard protocol = reusable code for other countries
  • Czech Republic also uses Z39.50/ALEPH (similar protocol family)

Documentation is Critical

  • Creating harvest reports during/after harvest saves time later
  • Quick-start guides help future users understand the data
  • Statistics files provide instant insights without parsing JSON

Batch Checkpoints for Long Harvests

  • Switzerland saved batch files every 50 institutions
  • Allowed resuming after interruptions
  • Germany completed too fast to need checkpoints (3 minutes)

Questions Addressed

Q: What did we do so far?

A: Harvested 25,436 institutions from 7 countries (26.2% of global target). Most recently completed Germany (16,979 records) and verified Switzerland (2,379 records).

Q: We already fetched Swiss data, right?

A: Yes! Swiss ISIL data was harvested on Nov 18 (2,379 records in 33 minutes). We discovered this before unnecessarily re-scraping.

Q: What's next?

A: Continue Phase 1 with Czech Republic (~3,000 records via Z39.50), Denmark (~900 records), France (~5,000 records), and Italy (~8,000 records).


Data Quality Summary

Germany (DE) - Tier 1

  • 16,979 institutions
  • 87% geocoded addresses
  • 79% with websites
  • 79% with phone numbers
  • 38% with email addresses
  • Full PICA+ metadata

Switzerland (CH + LI) - Tier 1

  • 2,379 institutions
  • 80.8% with ISIL codes
  • 49.1% with phone numbers
  • 41.4% with email addresses
  • 39.3% with websites
  • ⚠️ Only 4.9% with physical addresses (needs geocoding)
  • ⚠️ ISIL codes not yet extracted from URLs

Performance Metrics

Country Records Time Rate Method
Germany 16,979 3 min 94 rec/s SRU API
Switzerland 2,379 33 min 1.2 rec/s Web scraping
Average 9,679 18 min 47.6 rec/s Mixed

Estimated Time to 97,000 Records

  • At SRU speed (94 rec/s): ~17 minutes for remaining 71,564 records
  • At scraping speed (1.2 rec/s): ~16.5 hours
  • Realistic estimate (mixed methods): 40-60 hours of harvest time
  • Calendar time (with development): 4 months (16 weeks)

Integration with GLAM Project

Data Transformation Pipeline

  1. Harvest (complete) → Raw JSON/JSONL files
  2. Normalize (next) → Standardize field names, types
  3. Geocode (next) → Add lat/lon for all addresses
  4. Enrich (next) → Wikidata Q-numbers, institution types
  5. LinkML Convert (next) → Transform to HeritageCustodian schema
  6. GHCID Generate (next) → Create persistent identifiers
  7. RDF Export (final) → Publish as Linked Open Data

Schema Mapping

ISIL Field LinkML Field Status
isil identifiers[].identifier_value
name name
alternative_names alternative_names
address locations[].street_address
contact.phone locations[].phone
contact.email locations[].email
urls[].url digital_platforms[].platform_url 🔄
institution_type institution_type (GLAMORCUBESFIXPHDNT) 🔄

Documentation

  • Master Plan: /data/isil/MASTER_HARVEST_PLAN.md
  • Progress Summary: /data/isil/HARVEST_PROGRESS_SUMMARY.md
  • This Session: /data/isil/SESSION_SUMMARY_20251119_HARVEST_CONTINUATION.md

Data Directories

  • Germany: /data/isil/germany/
  • Switzerland: /data/isil/switzerland/
  • All countries: /data/isil/

Scripts

  • German harvester: /scripts/scrapers/harvest_german_isil_sru.py
  • Swiss harvester: /scripts/scrapers/harvest_swiss_isil.py
  • All scrapers: /scripts/scrapers/

External Resources


End of Session

Status: Phase 1 in progress (26.2% complete)
Next Session: Czech Republic harvest + Swiss ISIL code extraction
Estimated Next Milestone: 35% complete after Czech + Denmark harvests


Session ended: November 19, 2025, 14:30 CET
Total active time: 1 hour
Records added: 16,979 (Germany)
Records verified: 2,379 (Switzerland)