# Global ISIL Harvest Master Plan **Goal**: Harvest all accessible national ISIL registries worldwide **Status**: In Progress **Started**: November 19, 2025 --- ## Priority Matrix ### Priority 1: APIs Available (High Automation Potential) | Country | Records | API Type | Status | Priority | |---------|---------|----------|--------|----------| | 🇩🇪 **Germany** | 16,979 | SRU, JSON, LD | ✅ **COMPLETE** (archives pending) | ⭐⭐⭐⭐⭐ | | 🇨🇭 **Switzerland** | 2,379 | HTML + Open Data | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ | | 🇨🇿 **Czech Republic** | 8,694 | ALEPH (Z39.50) + API | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ | | 🇳🇱 Netherlands | ~1,400 | CSV | ✅ **COMPLETE** | ⭐⭐⭐⭐ | | 🇧🇪 Belgium | 438 | Search | ✅ **COMPLETE** | ⭐⭐⭐⭐ | | 🇨🇦 Canada | ~5,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ | | 🇦🇺 Australia | ~4,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ | | 🇯🇵 Japan | ~6,000 | CSV List | 🔄 Queue | ⭐⭐⭐ | | 🇳🇴 Norway | ~1,200 | Search | 🔄 Queue | ⭐⭐⭐ | | 🇫🇮 Finland | ~1,000 | Search | 🔄 Queue | ⭐⭐⭐ | ### Priority 2: Scraping Required (Medium Automation) | Country | Records | Access | Status | Priority | |---------|---------|--------|--------|----------| | 🇦🇹 Austria | ~3,000 | JavaScript search | 🔄 Queue | ⭐⭐⭐ | | 🇮🇹 Italy | ~10,000 | Search portal | 🔄 Queue | ⭐⭐⭐ | | 🇫🇷 France | ~5,000 | SUDOC search | 🔄 Queue | ⭐⭐⭐ | | 🇪🇸 Spain (GTB) | ~2,000 | Search list | 🔄 Queue | ⭐⭐ | | 🇭🇺 Hungary | ~1,500 | PDF list | 🔄 Queue | ⭐⭐ | | 🇭🇷 Croatia | ~500 | List | 🔄 Queue | ⭐⭐ | | 🇸🇮 Slovenia | ~300 | List | 🔄 Queue | ⭐⭐ | | 🇸🇰 Slovakia | ~400 | List | 🔄 Queue | ⭐⭐ | ### Priority 3: Manual/Limited Access | Country | Records | Access | Status | Priority | |---------|---------|--------|--------|----------| | 🇬🇧 UK | ~4,000 | ⚠️ Cyber attack | ⏸️ Wait | ⭐⭐ | | 🇷🇺 Russia | ~3,000 | Search | 🔄 Queue | ⭐ | | 🇺🇸 USA (LoC) | ~10,000 | MARC list | 🔄 Queue | ⭐⭐⭐ | | 🇨🇾 Cyprus | ~50 | List | 🔄 Queue | ⭐ | | 🇷🇴 Romania | ~1,000 | Search | 🔄 Queue | ⭐ | | 🇧🇬 Bulgaria | ~500 | Search | 🔄 Queue | ⭐ | | 🇪🇬 Egypt | ~300 | Search | 🔄 Queue | ⭐ | | 🇮🇱 Israel | ~500 | List | 🔄 Queue | ⭐ | ### Priority 4: No Public Access (Contact Required) | Country | Records | Access | Status | Priority | |---------|---------|--------|--------|----------| | 🇦🇷 Argentina | ~1,000 | Contact IRAM | ⏸️ | ⭐ | | 🇧🇾 Belarus | ~500 | No API | ⏸️ | ⭐ | | 🇮🇷 Iran | ~300 | No API | ⏸️ | ⭐ | | 🇰🇷 South Korea | ~1,500 | No API | ⏸️ | ⭐ | | 🇰🇿 Kazakhstan | ~200 | No API | ⏸️ | ⭐ | | 🇲🇩 Moldova | ~100 | No API | ⏸️ | ⭐ | | 🇳🇵 Nepal | n/a | No API | ⏸️ | ⭐ | | 🇶🇦 Qatar | ~100 | Search | ⏸️ | ⭐ | --- ## Harvest Strategy ### Phase 1: Core European Registries (Week 1) ✅ **Germany (16,979)** - **COMPLETE** (archives pending DDB API) ✅ **Switzerland (2,379)** - **COMPLETE** ✅ **Czech Republic (8,694)** - **COMPLETE** (largest single-country dataset) ✅ **Belgium (438)** - **COMPLETE** ✅ **Netherlands (~1,400)** - **COMPLETE** 🔄 Austria (~3,000) - Next Priority 🔄 France (~5,000) - Queue **Target**: ~30,000 records **Achieved**: **27,053 records** (90% - pending German archives) ### Phase 2: English-Speaking Countries (Week 2) - Canada (~5,000) - Australia (~4,000) - USA (~10,000) - New Zealand (~1,000) **Target**: ~20,000 records ### Phase 3: Additional European (Week 3) - Italy (~10,000) - Belgium (~1,500) - Norway (~1,200) - Finland (~1,000) - Denmark (✅ have data) - Hungary (~1,500) - Croatia (~500) **Target**: ~16,000 records ### Phase 4: Asia & Global (Week 4) - Japan (~6,000) - Israel (~500) - Russia (~3,000) - Egypt (~300) - Others **Target**: ~10,000 records --- ## Total Expected Coverage | Category | Countries | Records | Harvested | Status | |----------|-----------|---------|-----------|--------| | **Completed** | 5 | 27,053+ | 27,053 | ✅ 100% | | **High Priority** | 4 | ~18,000 | 0 | ⏳ 0% | | **Medium Priority** | 8 | ~25,000 | 0 | ⏳ 0% | | **Low Priority** | 10 | ~15,000 | 0 | ⏳ 0% | | **No Access** | 8 | ~5,000 | 0 | ⏸️ 0% | | **TOTAL** | 36 | ~97,000 | 27,053 | **27.9%** | --- ## Technical Approach ### Method 1: SRU Protocol (Germany, Czech Rep) - Standard library protocol - XML responses (PicaPlus, MARC) - Batch processing ### Method 2: JSON/REST APIs (Switzerland, Netherlands) - Modern web APIs - JSON responses - Pagination ### Method 3: Web Scraping (Austria, Italy, France) - Playwright browser automation - JavaScript rendering - Rate limiting ### Method 4: Bulk Downloads (Japan, USA) - CSV/Excel files - MARC records - FTP/HTTP downloads --- ## Success Criteria ### Data Quality Metrics - ✅ ISIL identifier (100%) - ✅ Institution name (100%) - ✅ Address/location (>80%) - ✅ Contact info (>50%) - ✅ Institution type (>30%) ### Coverage Metrics - ✅ Europe: >90% of registries - ✅ Americas: >70% of registries - ✅ Asia-Pacific: >50% of registries - ✅ Africa/Middle East: >30% of registries ### Performance Metrics - Average harvest time: <5 minutes per country - Success rate: >95% - API compliance: 100% --- ## Next Actions ### Immediate (Today) 1. ✅ Complete Germany harvest 2. 🔄 Start Switzerland harvest (Open Data + web scraping) 3. 🔄 Start Czech Republic harvest (Z39.50 protocol) ### Short-term (This Week) 1. Austria (JavaScript scraping) 2. France (SUDOC portal) 3. Canada (Library and Archives Canada) 4. Australia (National Library) ### Medium-term (Next Week) 1. Italy (ICCU portal) 2. Belgium (Royal Library) 3. Norway (National Library) 4. Japan (National Diet Library) --- ## Resources & Tools ### Harvester Scripts - ✅ `harvest_german_isil_sru.py` - SRU protocol harvester - 🔄 `harvest_swiss_isil.py` - Swiss Open Data + scraper - 🔄 `harvest_czech_isil.py` - Z39.50 protocol - 🔄 `harvest_generic_isil.py` - Generic web scraper ### Data Formats - PicaPlus-XML (Germany) - MARC21 (USA, Canada) - JSON/JSON-LD (Switzerland, Netherlands) - CSV (Japan, Hungary) - RDF/Turtle (Linked Data services) ### APIs & Protocols - SRU 1.1 (Search/Retrieve via URL) - Z39.50 (Library protocol) - OAI-PMH (Open Archives Initiative) - REST/JSON APIs - Linked Data (RDF, SPARQL) --- ## Output Standards All harvested data will be: 1. **Parsed** to consistent JSON structure 2. **Validated** against schema 3. **Enriched** with geocoding 4. **Mapped** to LinkML HeritageCustodian schema 5. **Exported** to multiple formats: - JSON (structured) - JSONL (line-delimited) - CSV (spreadsheet) - RDF/Turtle (Linked Data) - Parquet (data warehousing) --- ## License & Attribution All data remains under original licenses: - **Germany**: CC0 1.0 (Public Domain) - **Switzerland**: Open Data (CC0 likely) - **Netherlands**: Open Data - Others: Check per-registry Attribution will be maintained in metadata for all records. --- **Status**: 🔄 **IN PROGRESS** **Phase**: 1 of 4 - **90% COMPLETE** **Completion**: 5/36 countries (14%) **Records**: **27,053** / ~97,000 (28%) **Last Updated**: November 19, 2025 --- *Last Updated: November 19, 2025* *Maintained by: OpenCode + MCP Wikidata Tools*