glam/data/isil/MASTER_HARVEST_PLAN.md
2025-11-19 23:25:22 +01:00

7.3 KiB

Global ISIL Harvest Master Plan

Goal: Harvest all accessible national ISIL registries worldwide
Status: In Progress
Started: November 19, 2025


Priority Matrix

Priority 1: APIs Available (High Automation Potential)

Country Records API Type Status Priority
🇩🇪 Germany 16,979 SRU, JSON, LD COMPLETE (archives pending)
🇨🇭 Switzerland 2,379 HTML + Open Data COMPLETE
🇨🇿 Czech Republic 8,694 ALEPH (Z39.50) + API COMPLETE
🇳🇱 Netherlands ~1,400 CSV COMPLETE
🇧🇪 Belgium 438 Search COMPLETE
🇨🇦 Canada ~5,000 Search API 🔄 Queue
🇦🇺 Australia ~4,000 Search API 🔄 Queue
🇯🇵 Japan ~6,000 CSV List 🔄 Queue
🇳🇴 Norway ~1,200 Search 🔄 Queue
🇫🇮 Finland ~1,000 Search 🔄 Queue

Priority 2: Scraping Required (Medium Automation)

Country Records Access Status Priority
🇦🇹 Austria ~3,000 JavaScript search 🔄 Queue
🇮🇹 Italy ~10,000 Search portal 🔄 Queue
🇫🇷 France ~5,000 SUDOC search 🔄 Queue
🇪🇸 Spain (GTB) ~2,000 Search list 🔄 Queue
🇭🇺 Hungary ~1,500 PDF list 🔄 Queue
🇭🇷 Croatia ~500 List 🔄 Queue
🇸🇮 Slovenia ~300 List 🔄 Queue
🇸🇰 Slovakia ~400 List 🔄 Queue

Priority 3: Manual/Limited Access

Country Records Access Status Priority
🇬🇧 UK ~4,000 ⚠️ Cyber attack ⏸️ Wait
🇷🇺 Russia ~3,000 Search 🔄 Queue
🇺🇸 USA (LoC) ~10,000 MARC list 🔄 Queue
🇨🇾 Cyprus ~50 List 🔄 Queue
🇷🇴 Romania ~1,000 Search 🔄 Queue
🇧🇬 Bulgaria ~500 Search 🔄 Queue
🇪🇬 Egypt ~300 Search 🔄 Queue
🇮🇱 Israel ~500 List 🔄 Queue

Priority 4: No Public Access (Contact Required)

Country Records Access Status Priority
🇦🇷 Argentina ~1,000 Contact IRAM ⏸️
🇧🇾 Belarus ~500 No API ⏸️
🇮🇷 Iran ~300 No API ⏸️
🇰🇷 South Korea ~1,500 No API ⏸️
🇰🇿 Kazakhstan ~200 No API ⏸️
🇲🇩 Moldova ~100 No API ⏸️
🇳🇵 Nepal n/a No API ⏸️
🇶🇦 Qatar ~100 Search ⏸️

Harvest Strategy

Phase 1: Core European Registries (Week 1)

Germany (16,979) - COMPLETE (archives pending DDB API)
Switzerland (2,379) - COMPLETE
Czech Republic (8,694) - COMPLETE (largest single-country dataset)
Belgium (438) - COMPLETE
Netherlands (~1,400) - COMPLETE
🔄 Austria (~3,000) - Next Priority
🔄 France (~5,000) - Queue

Target: ~30,000 records
Achieved: 27,053 records (90% - pending German archives)

Phase 2: English-Speaking Countries (Week 2)

  • Canada (~5,000)
  • Australia (~4,000)
  • USA (~10,000)
  • New Zealand (~1,000)

Target: ~20,000 records

Phase 3: Additional European (Week 3)

  • Italy (~10,000)
  • Belgium (~1,500)
  • Norway (~1,200)
  • Finland (~1,000)
  • Denmark ( have data)
  • Hungary (~1,500)
  • Croatia (~500)

Target: ~16,000 records

Phase 4: Asia & Global (Week 4)

  • Japan (~6,000)
  • Israel (~500)
  • Russia (~3,000)
  • Egypt (~300)
  • Others

Target: ~10,000 records


Total Expected Coverage

Category Countries Records Harvested Status
Completed 5 27,053+ 27,053 100%
High Priority 4 ~18,000 0 0%
Medium Priority 8 ~25,000 0 0%
Low Priority 10 ~15,000 0 0%
No Access 8 ~5,000 0 ⏸️ 0%
TOTAL 36 ~97,000 27,053 27.9%

Technical Approach

Method 1: SRU Protocol (Germany, Czech Rep)

  • Standard library protocol
  • XML responses (PicaPlus, MARC)
  • Batch processing

Method 2: JSON/REST APIs (Switzerland, Netherlands)

  • Modern web APIs
  • JSON responses
  • Pagination

Method 3: Web Scraping (Austria, Italy, France)

  • Playwright browser automation
  • JavaScript rendering
  • Rate limiting

Method 4: Bulk Downloads (Japan, USA)

  • CSV/Excel files
  • MARC records
  • FTP/HTTP downloads

Success Criteria

Data Quality Metrics

  • ISIL identifier (100%)
  • Institution name (100%)
  • Address/location (>80%)
  • Contact info (>50%)
  • Institution type (>30%)

Coverage Metrics

  • Europe: >90% of registries
  • Americas: >70% of registries
  • Asia-Pacific: >50% of registries
  • Africa/Middle East: >30% of registries

Performance Metrics

  • Average harvest time: <5 minutes per country
  • Success rate: >95%
  • API compliance: 100%

Next Actions

Immediate (Today)

  1. Complete Germany harvest
  2. 🔄 Start Switzerland harvest (Open Data + web scraping)
  3. 🔄 Start Czech Republic harvest (Z39.50 protocol)

Short-term (This Week)

  1. Austria (JavaScript scraping)
  2. France (SUDOC portal)
  3. Canada (Library and Archives Canada)
  4. Australia (National Library)

Medium-term (Next Week)

  1. Italy (ICCU portal)
  2. Belgium (Royal Library)
  3. Norway (National Library)
  4. Japan (National Diet Library)

Resources & Tools

Harvester Scripts

  • harvest_german_isil_sru.py - SRU protocol harvester
  • 🔄 harvest_swiss_isil.py - Swiss Open Data + scraper
  • 🔄 harvest_czech_isil.py - Z39.50 protocol
  • 🔄 harvest_generic_isil.py - Generic web scraper

Data Formats

  • PicaPlus-XML (Germany)
  • MARC21 (USA, Canada)
  • JSON/JSON-LD (Switzerland, Netherlands)
  • CSV (Japan, Hungary)
  • RDF/Turtle (Linked Data services)

APIs & Protocols

  • SRU 1.1 (Search/Retrieve via URL)
  • Z39.50 (Library protocol)
  • OAI-PMH (Open Archives Initiative)
  • REST/JSON APIs
  • Linked Data (RDF, SPARQL)

Output Standards

All harvested data will be:

  1. Parsed to consistent JSON structure
  2. Validated against schema
  3. Enriched with geocoding
  4. Mapped to LinkML HeritageCustodian schema
  5. Exported to multiple formats:
    • JSON (structured)
    • JSONL (line-delimited)
    • CSV (spreadsheet)
    • RDF/Turtle (Linked Data)
    • Parquet (data warehousing)

License & Attribution

All data remains under original licenses:

  • Germany: CC0 1.0 (Public Domain)
  • Switzerland: Open Data (CC0 likely)
  • Netherlands: Open Data
  • Others: Check per-registry

Attribution will be maintained in metadata for all records.


Status: 🔄 IN PROGRESS
Phase: 1 of 4 - 90% COMPLETE
Completion: 5/36 countries (14%)
Records: 27,053 / ~97,000 (28%)
Last Updated: November 19, 2025


Last Updated: November 19, 2025
Maintained by: OpenCode + MCP Wikidata Tools