7.3 KiB
Global ISIL Harvest Master Plan
Goal: Harvest all accessible national ISIL registries worldwide
Status: In Progress
Started: November 19, 2025
Priority Matrix
Priority 1: APIs Available (High Automation Potential)
| Country | Records | API Type | Status | Priority |
|---|---|---|---|---|
| 🇩🇪 Germany | 16,979 | SRU, JSON, LD | ✅ COMPLETE (archives pending) | ⭐⭐⭐⭐⭐ |
| 🇨🇭 Switzerland | 2,379 | HTML + Open Data | ✅ COMPLETE | ⭐⭐⭐⭐⭐ |
| 🇨🇿 Czech Republic | 8,694 | ALEPH (Z39.50) + API | ✅ COMPLETE | ⭐⭐⭐⭐⭐ |
| 🇳🇱 Netherlands | ~1,400 | CSV | ✅ COMPLETE | ⭐⭐⭐⭐ |
| 🇧🇪 Belgium | 438 | Search | ✅ COMPLETE | ⭐⭐⭐⭐ |
| 🇨🇦 Canada | ~5,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
| 🇦🇺 Australia | ~4,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
| 🇯🇵 Japan | ~6,000 | CSV List | 🔄 Queue | ⭐⭐⭐ |
| 🇳🇴 Norway | ~1,200 | Search | 🔄 Queue | ⭐⭐⭐ |
| 🇫🇮 Finland | ~1,000 | Search | 🔄 Queue | ⭐⭐⭐ |
Priority 2: Scraping Required (Medium Automation)
| Country | Records | Access | Status | Priority |
|---|---|---|---|---|
| 🇦🇹 Austria | ~3,000 | JavaScript search | 🔄 Queue | ⭐⭐⭐ |
| 🇮🇹 Italy | ~10,000 | Search portal | 🔄 Queue | ⭐⭐⭐ |
| 🇫🇷 France | ~5,000 | SUDOC search | 🔄 Queue | ⭐⭐⭐ |
| 🇪🇸 Spain (GTB) | ~2,000 | Search list | 🔄 Queue | ⭐⭐ |
| 🇭🇺 Hungary | ~1,500 | PDF list | 🔄 Queue | ⭐⭐ |
| 🇭🇷 Croatia | ~500 | List | 🔄 Queue | ⭐⭐ |
| 🇸🇮 Slovenia | ~300 | List | 🔄 Queue | ⭐⭐ |
| 🇸🇰 Slovakia | ~400 | List | 🔄 Queue | ⭐⭐ |
Priority 3: Manual/Limited Access
| Country | Records | Access | Status | Priority |
|---|---|---|---|---|
| 🇬🇧 UK | ~4,000 | ⚠️ Cyber attack | ⏸️ Wait | ⭐⭐ |
| 🇷🇺 Russia | ~3,000 | Search | 🔄 Queue | ⭐ |
| 🇺🇸 USA (LoC) | ~10,000 | MARC list | 🔄 Queue | ⭐⭐⭐ |
| 🇨🇾 Cyprus | ~50 | List | 🔄 Queue | ⭐ |
| 🇷🇴 Romania | ~1,000 | Search | 🔄 Queue | ⭐ |
| 🇧🇬 Bulgaria | ~500 | Search | 🔄 Queue | ⭐ |
| 🇪🇬 Egypt | ~300 | Search | 🔄 Queue | ⭐ |
| 🇮🇱 Israel | ~500 | List | 🔄 Queue | ⭐ |
Priority 4: No Public Access (Contact Required)
| Country | Records | Access | Status | Priority |
|---|---|---|---|---|
| 🇦🇷 Argentina | ~1,000 | Contact IRAM | ⏸️ | ⭐ |
| 🇧🇾 Belarus | ~500 | No API | ⏸️ | ⭐ |
| 🇮🇷 Iran | ~300 | No API | ⏸️ | ⭐ |
| 🇰🇷 South Korea | ~1,500 | No API | ⏸️ | ⭐ |
| 🇰🇿 Kazakhstan | ~200 | No API | ⏸️ | ⭐ |
| 🇲🇩 Moldova | ~100 | No API | ⏸️ | ⭐ |
| 🇳🇵 Nepal | n/a | No API | ⏸️ | ⭐ |
| 🇶🇦 Qatar | ~100 | Search | ⏸️ | ⭐ |
Harvest Strategy
Phase 1: Core European Registries (Week 1)
✅ Germany (16,979) - COMPLETE (archives pending DDB API)
✅ Switzerland (2,379) - COMPLETE
✅ Czech Republic (8,694) - COMPLETE (largest single-country dataset)
✅ Belgium (438) - COMPLETE
✅ Netherlands (~1,400) - COMPLETE
🔄 Austria (~3,000) - Next Priority
🔄 France (~5,000) - Queue
Target: ~30,000 records
Achieved: 27,053 records (90% - pending German archives)
Phase 2: English-Speaking Countries (Week 2)
- Canada (~5,000)
- Australia (~4,000)
- USA (~10,000)
- New Zealand (~1,000)
Target: ~20,000 records
Phase 3: Additional European (Week 3)
- Italy (~10,000)
- Belgium (~1,500)
- Norway (~1,200)
- Finland (~1,000)
- Denmark (✅ have data)
- Hungary (~1,500)
- Croatia (~500)
Target: ~16,000 records
Phase 4: Asia & Global (Week 4)
- Japan (~6,000)
- Israel (~500)
- Russia (~3,000)
- Egypt (~300)
- Others
Target: ~10,000 records
Total Expected Coverage
| Category | Countries | Records | Harvested | Status |
|---|---|---|---|---|
| Completed | 5 | 27,053+ | 27,053 | ✅ 100% |
| High Priority | 4 | ~18,000 | 0 | ⏳ 0% |
| Medium Priority | 8 | ~25,000 | 0 | ⏳ 0% |
| Low Priority | 10 | ~15,000 | 0 | ⏳ 0% |
| No Access | 8 | ~5,000 | 0 | ⏸️ 0% |
| TOTAL | 36 | ~97,000 | 27,053 | 27.9% |
Technical Approach
Method 1: SRU Protocol (Germany, Czech Rep)
- Standard library protocol
- XML responses (PicaPlus, MARC)
- Batch processing
Method 2: JSON/REST APIs (Switzerland, Netherlands)
- Modern web APIs
- JSON responses
- Pagination
Method 3: Web Scraping (Austria, Italy, France)
- Playwright browser automation
- JavaScript rendering
- Rate limiting
Method 4: Bulk Downloads (Japan, USA)
- CSV/Excel files
- MARC records
- FTP/HTTP downloads
Success Criteria
Data Quality Metrics
- ✅ ISIL identifier (100%)
- ✅ Institution name (100%)
- ✅ Address/location (>80%)
- ✅ Contact info (>50%)
- ✅ Institution type (>30%)
Coverage Metrics
- ✅ Europe: >90% of registries
- ✅ Americas: >70% of registries
- ✅ Asia-Pacific: >50% of registries
- ✅ Africa/Middle East: >30% of registries
Performance Metrics
- Average harvest time: <5 minutes per country
- Success rate: >95%
- API compliance: 100%
Next Actions
Immediate (Today)
- ✅ Complete Germany harvest
- 🔄 Start Switzerland harvest (Open Data + web scraping)
- 🔄 Start Czech Republic harvest (Z39.50 protocol)
Short-term (This Week)
- Austria (JavaScript scraping)
- France (SUDOC portal)
- Canada (Library and Archives Canada)
- Australia (National Library)
Medium-term (Next Week)
- Italy (ICCU portal)
- Belgium (Royal Library)
- Norway (National Library)
- Japan (National Diet Library)
Resources & Tools
Harvester Scripts
- ✅
harvest_german_isil_sru.py- SRU protocol harvester - 🔄
harvest_swiss_isil.py- Swiss Open Data + scraper - 🔄
harvest_czech_isil.py- Z39.50 protocol - 🔄
harvest_generic_isil.py- Generic web scraper
Data Formats
- PicaPlus-XML (Germany)
- MARC21 (USA, Canada)
- JSON/JSON-LD (Switzerland, Netherlands)
- CSV (Japan, Hungary)
- RDF/Turtle (Linked Data services)
APIs & Protocols
- SRU 1.1 (Search/Retrieve via URL)
- Z39.50 (Library protocol)
- OAI-PMH (Open Archives Initiative)
- REST/JSON APIs
- Linked Data (RDF, SPARQL)
Output Standards
All harvested data will be:
- Parsed to consistent JSON structure
- Validated against schema
- Enriched with geocoding
- Mapped to LinkML HeritageCustodian schema
- Exported to multiple formats:
- JSON (structured)
- JSONL (line-delimited)
- CSV (spreadsheet)
- RDF/Turtle (Linked Data)
- Parquet (data warehousing)
License & Attribution
All data remains under original licenses:
- Germany: CC0 1.0 (Public Domain)
- Switzerland: Open Data (CC0 likely)
- Netherlands: Open Data
- Others: Check per-registry
Attribution will be maintained in metadata for all records.
Status: 🔄 IN PROGRESS
Phase: 1 of 4 - 90% COMPLETE
Completion: 5/36 countries (14%)
Records: 27,053 / ~97,000 (28%)
Last Updated: November 19, 2025
Last Updated: November 19, 2025
Maintained by: OpenCode + MCP Wikidata Tools