glam/data/isil/MASTER_HARVEST_PLAN.md
2025-11-19 23:25:22 +01:00

248 lines
7.3 KiB
Markdown

# Global ISIL Harvest Master Plan
**Goal**: Harvest all accessible national ISIL registries worldwide
**Status**: In Progress
**Started**: November 19, 2025
---
## Priority Matrix
### Priority 1: APIs Available (High Automation Potential)
| Country | Records | API Type | Status | Priority |
|---------|---------|----------|--------|----------|
| 🇩🇪 **Germany** | 16,979 | SRU, JSON, LD | ✅ **COMPLETE** (archives pending) | ⭐⭐⭐⭐⭐ |
| 🇨🇭 **Switzerland** | 2,379 | HTML + Open Data | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ |
| 🇨🇿 **Czech Republic** | 8,694 | ALEPH (Z39.50) + API | ✅ **COMPLETE** | ⭐⭐⭐⭐⭐ |
| 🇳🇱 Netherlands | ~1,400 | CSV | ✅ **COMPLETE** | ⭐⭐⭐⭐ |
| 🇧🇪 Belgium | 438 | Search | ✅ **COMPLETE** | ⭐⭐⭐⭐ |
| 🇨🇦 Canada | ~5,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
| 🇦🇺 Australia | ~4,000 | Search API | 🔄 Queue | ⭐⭐⭐⭐ |
| 🇯🇵 Japan | ~6,000 | CSV List | 🔄 Queue | ⭐⭐⭐ |
| 🇳🇴 Norway | ~1,200 | Search | 🔄 Queue | ⭐⭐⭐ |
| 🇫🇮 Finland | ~1,000 | Search | 🔄 Queue | ⭐⭐⭐ |
### Priority 2: Scraping Required (Medium Automation)
| Country | Records | Access | Status | Priority |
|---------|---------|--------|--------|----------|
| 🇦🇹 Austria | ~3,000 | JavaScript search | 🔄 Queue | ⭐⭐⭐ |
| 🇮🇹 Italy | ~10,000 | Search portal | 🔄 Queue | ⭐⭐⭐ |
| 🇫🇷 France | ~5,000 | SUDOC search | 🔄 Queue | ⭐⭐⭐ |
| 🇪🇸 Spain (GTB) | ~2,000 | Search list | 🔄 Queue | ⭐⭐ |
| 🇭🇺 Hungary | ~1,500 | PDF list | 🔄 Queue | ⭐⭐ |
| 🇭🇷 Croatia | ~500 | List | 🔄 Queue | ⭐⭐ |
| 🇸🇮 Slovenia | ~300 | List | 🔄 Queue | ⭐⭐ |
| 🇸🇰 Slovakia | ~400 | List | 🔄 Queue | ⭐⭐ |
### Priority 3: Manual/Limited Access
| Country | Records | Access | Status | Priority |
|---------|---------|--------|--------|----------|
| 🇬🇧 UK | ~4,000 | ⚠️ Cyber attack | ⏸️ Wait | ⭐⭐ |
| 🇷🇺 Russia | ~3,000 | Search | 🔄 Queue | ⭐ |
| 🇺🇸 USA (LoC) | ~10,000 | MARC list | 🔄 Queue | ⭐⭐⭐ |
| 🇨🇾 Cyprus | ~50 | List | 🔄 Queue | ⭐ |
| 🇷🇴 Romania | ~1,000 | Search | 🔄 Queue | ⭐ |
| 🇧🇬 Bulgaria | ~500 | Search | 🔄 Queue | ⭐ |
| 🇪🇬 Egypt | ~300 | Search | 🔄 Queue | ⭐ |
| 🇮🇱 Israel | ~500 | List | 🔄 Queue | ⭐ |
### Priority 4: No Public Access (Contact Required)
| Country | Records | Access | Status | Priority |
|---------|---------|--------|--------|----------|
| 🇦🇷 Argentina | ~1,000 | Contact IRAM | ⏸️ | ⭐ |
| 🇧🇾 Belarus | ~500 | No API | ⏸️ | ⭐ |
| 🇮🇷 Iran | ~300 | No API | ⏸️ | ⭐ |
| 🇰🇷 South Korea | ~1,500 | No API | ⏸️ | ⭐ |
| 🇰🇿 Kazakhstan | ~200 | No API | ⏸️ | ⭐ |
| 🇲🇩 Moldova | ~100 | No API | ⏸️ | ⭐ |
| 🇳🇵 Nepal | n/a | No API | ⏸️ | ⭐ |
| 🇶🇦 Qatar | ~100 | Search | ⏸️ | ⭐ |
---
## Harvest Strategy
### Phase 1: Core European Registries (Week 1)
**Germany (16,979)** - **COMPLETE** (archives pending DDB API)
**Switzerland (2,379)** - **COMPLETE**
**Czech Republic (8,694)** - **COMPLETE** (largest single-country dataset)
**Belgium (438)** - **COMPLETE**
**Netherlands (~1,400)** - **COMPLETE**
🔄 Austria (~3,000) - Next Priority
🔄 France (~5,000) - Queue
**Target**: ~30,000 records
**Achieved**: **27,053 records** (90% - pending German archives)
### Phase 2: English-Speaking Countries (Week 2)
- Canada (~5,000)
- Australia (~4,000)
- USA (~10,000)
- New Zealand (~1,000)
**Target**: ~20,000 records
### Phase 3: Additional European (Week 3)
- Italy (~10,000)
- Belgium (~1,500)
- Norway (~1,200)
- Finland (~1,000)
- Denmark (✅ have data)
- Hungary (~1,500)
- Croatia (~500)
**Target**: ~16,000 records
### Phase 4: Asia & Global (Week 4)
- Japan (~6,000)
- Israel (~500)
- Russia (~3,000)
- Egypt (~300)
- Others
**Target**: ~10,000 records
---
## Total Expected Coverage
| Category | Countries | Records | Harvested | Status |
|----------|-----------|---------|-----------|--------|
| **Completed** | 5 | 27,053+ | 27,053 | ✅ 100% |
| **High Priority** | 4 | ~18,000 | 0 | ⏳ 0% |
| **Medium Priority** | 8 | ~25,000 | 0 | ⏳ 0% |
| **Low Priority** | 10 | ~15,000 | 0 | ⏳ 0% |
| **No Access** | 8 | ~5,000 | 0 | ⏸️ 0% |
| **TOTAL** | 36 | ~97,000 | 27,053 | **27.9%** |
---
## Technical Approach
### Method 1: SRU Protocol (Germany, Czech Rep)
- Standard library protocol
- XML responses (PicaPlus, MARC)
- Batch processing
### Method 2: JSON/REST APIs (Switzerland, Netherlands)
- Modern web APIs
- JSON responses
- Pagination
### Method 3: Web Scraping (Austria, Italy, France)
- Playwright browser automation
- JavaScript rendering
- Rate limiting
### Method 4: Bulk Downloads (Japan, USA)
- CSV/Excel files
- MARC records
- FTP/HTTP downloads
---
## Success Criteria
### Data Quality Metrics
- ✅ ISIL identifier (100%)
- ✅ Institution name (100%)
- ✅ Address/location (>80%)
- ✅ Contact info (>50%)
- ✅ Institution type (>30%)
### Coverage Metrics
- ✅ Europe: >90% of registries
- ✅ Americas: >70% of registries
- ✅ Asia-Pacific: >50% of registries
- ✅ Africa/Middle East: >30% of registries
### Performance Metrics
- Average harvest time: <5 minutes per country
- Success rate: >95%
- API compliance: 100%
---
## Next Actions
### Immediate (Today)
1. ✅ Complete Germany harvest
2. 🔄 Start Switzerland harvest (Open Data + web scraping)
3. 🔄 Start Czech Republic harvest (Z39.50 protocol)
### Short-term (This Week)
1. Austria (JavaScript scraping)
2. France (SUDOC portal)
3. Canada (Library and Archives Canada)
4. Australia (National Library)
### Medium-term (Next Week)
1. Italy (ICCU portal)
2. Belgium (Royal Library)
3. Norway (National Library)
4. Japan (National Diet Library)
---
## Resources & Tools
### Harvester Scripts
-`harvest_german_isil_sru.py` - SRU protocol harvester
- 🔄 `harvest_swiss_isil.py` - Swiss Open Data + scraper
- 🔄 `harvest_czech_isil.py` - Z39.50 protocol
- 🔄 `harvest_generic_isil.py` - Generic web scraper
### Data Formats
- PicaPlus-XML (Germany)
- MARC21 (USA, Canada)
- JSON/JSON-LD (Switzerland, Netherlands)
- CSV (Japan, Hungary)
- RDF/Turtle (Linked Data services)
### APIs & Protocols
- SRU 1.1 (Search/Retrieve via URL)
- Z39.50 (Library protocol)
- OAI-PMH (Open Archives Initiative)
- REST/JSON APIs
- Linked Data (RDF, SPARQL)
---
## Output Standards
All harvested data will be:
1. **Parsed** to consistent JSON structure
2. **Validated** against schema
3. **Enriched** with geocoding
4. **Mapped** to LinkML HeritageCustodian schema
5. **Exported** to multiple formats:
- JSON (structured)
- JSONL (line-delimited)
- CSV (spreadsheet)
- RDF/Turtle (Linked Data)
- Parquet (data warehousing)
---
## License & Attribution
All data remains under original licenses:
- **Germany**: CC0 1.0 (Public Domain)
- **Switzerland**: Open Data (CC0 likely)
- **Netherlands**: Open Data
- Others: Check per-registry
Attribution will be maintained in metadata for all records.
---
**Status**: 🔄 **IN PROGRESS**
**Phase**: 1 of 4 - **90% COMPLETE**
**Completion**: 5/36 countries (14%)
**Records**: **27,053** / ~97,000 (28%)
**Last Updated**: November 19, 2025
---
*Last Updated: November 19, 2025*
*Maintained by: OpenCode + MCP Wikidata Tools*