248 lines
7.3 KiB
Markdown
248 lines
7.3 KiB
Markdown
# Global ISIL Harvest Master Plan
|
|
|
|
**Goal**: Harvest all accessible national ISIL registries worldwide
|
|
**Status**: In Progress
|
|
**Started**: November 19, 2025
|
|
|
|
---
|
|
|
|
## Priority Matrix
|
|
|
|
### Priority 1: APIs Available (High Automation Potential)
|
|
| Country | Records | API Type | Status | Priority |
|
|
|---------|---------|----------|--------|----------|
|
|
| 🇩🇪 **Germany** | 16,979 | SRU, JSON, LD | ✅ **COMPLETE** (archives pending) | ⭐⭐⭐⭐⭐ |
|
|
| 🇨🇭 **Switzerland** | 2,379 | HTML + Open Data | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ |
|
|
| 🇨🇿 **Czech Republic** | 8,694 | ALEPH (Z39.50) + API | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ |
|
|
| 🇳🇱 Netherlands | ~1,400 | CSV | ✅ **COMPLETE** | ⭐⭐⭐⭐ |
|
|
| 🇧🇪 Belgium | 438 | Search | ✅ **COMPLETE** | ⭐⭐⭐⭐ |
|
|
| 🇨🇦 Canada | ~5,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
|
|
| 🇦🇺 Australia | ~4,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
|
|
| 🇯🇵 Japan | ~6,000 | CSV List | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇳🇴 Norway | ~1,200 | Search | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇫🇮 Finland | ~1,000 | Search | 🔄 Queue | ⭐⭐⭐ |
|
|
|
|
### Priority 2: Scraping Required (Medium Automation)
|
|
| Country | Records | Access | Status | Priority |
|
|
|---------|---------|--------|--------|----------|
|
|
| 🇦🇹 Austria | ~3,000 | JavaScript search | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇮🇹 Italy | ~10,000 | Search portal | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇫🇷 France | ~5,000 | SUDOC search | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇪🇸 Spain (GTB) | ~2,000 | Search list | 🔄 Queue | ⭐⭐ |
|
|
| 🇭🇺 Hungary | ~1,500 | PDF list | 🔄 Queue | ⭐⭐ |
|
|
| 🇭🇷 Croatia | ~500 | List | 🔄 Queue | ⭐⭐ |
|
|
| 🇸🇮 Slovenia | ~300 | List | 🔄 Queue | ⭐⭐ |
|
|
| 🇸🇰 Slovakia | ~400 | List | 🔄 Queue | ⭐⭐ |
|
|
|
|
### Priority 3: Manual/Limited Access
|
|
| Country | Records | Access | Status | Priority |
|
|
|---------|---------|--------|--------|----------|
|
|
| 🇬🇧 UK | ~4,000 | ⚠️ Cyber attack | ⏸️ Wait | ⭐⭐ |
|
|
| 🇷🇺 Russia | ~3,000 | Search | 🔄 Queue | ⭐ |
|
|
| 🇺🇸 USA (LoC) | ~10,000 | MARC list | 🔄 Queue | ⭐⭐⭐ |
|
|
| 🇨🇾 Cyprus | ~50 | List | 🔄 Queue | ⭐ |
|
|
| 🇷🇴 Romania | ~1,000 | Search | 🔄 Queue | ⭐ |
|
|
| 🇧🇬 Bulgaria | ~500 | Search | 🔄 Queue | ⭐ |
|
|
| 🇪🇬 Egypt | ~300 | Search | 🔄 Queue | ⭐ |
|
|
| 🇮🇱 Israel | ~500 | List | 🔄 Queue | ⭐ |
|
|
|
|
### Priority 4: No Public Access (Contact Required)
|
|
| Country | Records | Access | Status | Priority |
|
|
|---------|---------|--------|--------|----------|
|
|
| 🇦🇷 Argentina | ~1,000 | Contact IRAM | ⏸️ | ⭐ |
|
|
| 🇧🇾 Belarus | ~500 | No API | ⏸️ | ⭐ |
|
|
| 🇮🇷 Iran | ~300 | No API | ⏸️ | ⭐ |
|
|
| 🇰🇷 South Korea | ~1,500 | No API | ⏸️ | ⭐ |
|
|
| 🇰🇿 Kazakhstan | ~200 | No API | ⏸️ | ⭐ |
|
|
| 🇲🇩 Moldova | ~100 | No API | ⏸️ | ⭐ |
|
|
| 🇳🇵 Nepal | n/a | No API | ⏸️ | ⭐ |
|
|
| 🇶🇦 Qatar | ~100 | Search | ⏸️ | ⭐ |
|
|
|
|
---
|
|
|
|
## Harvest Strategy
|
|
|
|
### Phase 1: Core European Registries (Week 1)
|
|
✅ **Germany (16,979)** - **COMPLETE** (archives pending DDB API)
|
|
✅ **Switzerland (2,379)** - **COMPLETE**
|
|
✅ **Czech Republic (8,694)** - **COMPLETE** (largest single-country dataset)
|
|
✅ **Belgium (438)** - **COMPLETE**
|
|
✅ **Netherlands (~1,400)** - **COMPLETE**
|
|
🔄 Austria (~3,000) - Next Priority
|
|
🔄 France (~5,000) - Queue
|
|
|
|
**Target**: ~30,000 records
|
|
**Achieved**: **27,053 records** (90% - pending German archives)
|
|
|
|
### Phase 2: English-Speaking Countries (Week 2)
|
|
- Canada (~5,000)
|
|
- Australia (~4,000)
|
|
- USA (~10,000)
|
|
- New Zealand (~1,000)
|
|
|
|
**Target**: ~20,000 records
|
|
|
|
### Phase 3: Additional European (Week 3)
|
|
- Italy (~10,000)
|
|
- Belgium (~1,500)
|
|
- Norway (~1,200)
|
|
- Finland (~1,000)
|
|
- Denmark (✅ have data)
|
|
- Hungary (~1,500)
|
|
- Croatia (~500)
|
|
|
|
**Target**: ~16,000 records
|
|
|
|
### Phase 4: Asia & Global (Week 4)
|
|
- Japan (~6,000)
|
|
- Israel (~500)
|
|
- Russia (~3,000)
|
|
- Egypt (~300)
|
|
- Others
|
|
|
|
**Target**: ~10,000 records
|
|
|
|
---
|
|
|
|
## Total Expected Coverage
|
|
|
|
| Category | Countries | Records | Harvested | Status |
|
|
|----------|-----------|---------|-----------|--------|
|
|
| **Completed** | 5 | 27,053+ | 27,053 | ✅ 100% |
|
|
| **High Priority** | 4 | ~18,000 | 0 | ⏳ 0% |
|
|
| **Medium Priority** | 8 | ~25,000 | 0 | ⏳ 0% |
|
|
| **Low Priority** | 10 | ~15,000 | 0 | ⏳ 0% |
|
|
| **No Access** | 8 | ~5,000 | 0 | ⏸️ 0% |
|
|
| **TOTAL** | 36 | ~97,000 | 27,053 | **27.9%** |
|
|
|
|
---
|
|
|
|
## Technical Approach
|
|
|
|
### Method 1: SRU Protocol (Germany, Czech Rep)
|
|
- Standard library protocol
|
|
- XML responses (PicaPlus, MARC)
|
|
- Batch processing
|
|
|
|
### Method 2: JSON/REST APIs (Switzerland, Netherlands)
|
|
- Modern web APIs
|
|
- JSON responses
|
|
- Pagination
|
|
|
|
### Method 3: Web Scraping (Austria, Italy, France)
|
|
- Playwright browser automation
|
|
- JavaScript rendering
|
|
- Rate limiting
|
|
|
|
### Method 4: Bulk Downloads (Japan, USA)
|
|
- CSV/Excel files
|
|
- MARC records
|
|
- FTP/HTTP downloads
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Data Quality Metrics
|
|
- ✅ ISIL identifier (100%)
|
|
- ✅ Institution name (100%)
|
|
- ✅ Address/location (>80%)
|
|
- ✅ Contact info (>50%)
|
|
- ✅ Institution type (>30%)
|
|
|
|
### Coverage Metrics
|
|
- ✅ Europe: >90% of registries
|
|
- ✅ Americas: >70% of registries
|
|
- ✅ Asia-Pacific: >50% of registries
|
|
- ✅ Africa/Middle East: >30% of registries
|
|
|
|
### Performance Metrics
|
|
- Average harvest time: <5 minutes per country
|
|
- Success rate: >95%
|
|
- API compliance: 100%
|
|
|
|
---
|
|
|
|
## Next Actions
|
|
|
|
### Immediate (Today)
|
|
1. ✅ Complete Germany harvest
|
|
2. 🔄 Start Switzerland harvest (Open Data + web scraping)
|
|
3. 🔄 Start Czech Republic harvest (Z39.50 protocol)
|
|
|
|
### Short-term (This Week)
|
|
1. Austria (JavaScript scraping)
|
|
2. France (SUDOC portal)
|
|
3. Canada (Library and Archives Canada)
|
|
4. Australia (National Library)
|
|
|
|
### Medium-term (Next Week)
|
|
1. Italy (ICCU portal)
|
|
2. Belgium (Royal Library)
|
|
3. Norway (National Library)
|
|
4. Japan (National Diet Library)
|
|
|
|
---
|
|
|
|
## Resources & Tools
|
|
|
|
### Harvester Scripts
|
|
- ✅ `harvest_german_isil_sru.py` - SRU protocol harvester
|
|
- 🔄 `harvest_swiss_isil.py` - Swiss Open Data + scraper
|
|
- 🔄 `harvest_czech_isil.py` - Z39.50 protocol
|
|
- 🔄 `harvest_generic_isil.py` - Generic web scraper
|
|
|
|
### Data Formats
|
|
- PicaPlus-XML (Germany)
|
|
- MARC21 (USA, Canada)
|
|
- JSON/JSON-LD (Switzerland, Netherlands)
|
|
- CSV (Japan, Hungary)
|
|
- RDF/Turtle (Linked Data services)
|
|
|
|
### APIs & Protocols
|
|
- SRU 1.1 (Search/Retrieve via URL)
|
|
- Z39.50 (Library protocol)
|
|
- OAI-PMH (Open Archives Initiative)
|
|
- REST/JSON APIs
|
|
- Linked Data (RDF, SPARQL)
|
|
|
|
---
|
|
|
|
## Output Standards
|
|
|
|
All harvested data will be:
|
|
1. **Parsed** to consistent JSON structure
|
|
2. **Validated** against schema
|
|
3. **Enriched** with geocoding
|
|
4. **Mapped** to LinkML HeritageCustodian schema
|
|
5. **Exported** to multiple formats:
|
|
- JSON (structured)
|
|
- JSONL (line-delimited)
|
|
- CSV (spreadsheet)
|
|
- RDF/Turtle (Linked Data)
|
|
- Parquet (data warehousing)
|
|
|
|
---
|
|
|
|
## License & Attribution
|
|
|
|
All data remains under original licenses:
|
|
- **Germany**: CC0 1.0 (Public Domain)
|
|
- **Switzerland**: Open Data (CC0 likely)
|
|
- **Netherlands**: Open Data
|
|
- Others: Check per-registry
|
|
|
|
Attribution will be maintained in metadata for all records.
|
|
|
|
---
|
|
|
|
**Status**: 🔄 **IN PROGRESS**
|
|
**Phase**: 1 of 4 - **90% COMPLETE**
|
|
**Completion**: 5/36 countries (14%)
|
|
**Records**: **27,053** / ~97,000 (28%)
|
|
**Last Updated**: November 19, 2025
|
|
|
|
---
|
|
|
|
*Last Updated: November 19, 2025*
|
|
*Maintained by: OpenCode + MCP Wikidata Tools*
|