# Global ISIL Database Harvest - Progress Summary > **Note**: Any references to Q-number collision resolution in this document are **superseded**. > Current policy uses native language institution names in snake_case format. > See `docs/plan/global_glam/07-ghcid-collision-resolution.md` for current approach. **Last Updated**: November 19, 2025, 14:30 CET **Session**: Continuation of Phase 1 - Core European Registries --- ## Overall Progress | Status | Countries | Records | % Complete | |--------|-----------|---------|------------| | โœ… **Completed** | 7 | 25,436 | **26.2%** | | ๐Ÿšง **In Progress** | 2 | ~5,000 | **5.2%** | | ๐Ÿ“‹ **Planned** | 27 | ~66,564 | **68.6%** | | **TOTAL** | **36** | **~97,000** | **100%** | --- ## Completed Countries (Phase 1) ### 1. ๐Ÿ‡ฉ๐Ÿ‡ช Germany โœ… - **Records**: 16,979 institutions - **Method**: SRU 1.1 protocol (DNB API) - **Completion**: November 19, 2025 - **Data Quality**: - โœ… 87% with street addresses + coordinates - โœ… 79% with website URLs - โœ… 79% with phone numbers - โœ… 38% with email addresses - **Files**: - `germany/german_isil_complete_20251119_134939.json` (37 MB) - `germany/german_isil_complete_20251119_134939.jsonl` (24 MB) - `germany/german_isil_stats_20251119_134941.json` - **Documentation**: - `germany/HARVEST_REPORT.md` - `germany/QUICK_START.md` - `germany/README.md` **Top Regions**: - North Rhine-Westphalia: 1,503 institutions (8.9%) - Baden-Wรผrttemberg: 1,295 (7.6%) - Bavaria: 1,204 (7.1%) - Lower Saxony: 1,055 (6.2%) - Hesse: 933 (5.5%) ### 2. ๐Ÿ‡จ๐Ÿ‡ญ Switzerland โœ… - **Records**: 2,379 institutions (including Liechtenstein) - **Method**: Web scraping (Swiss National Library ISIL directory) - **Completion**: November 18, 2025 - **Data Quality**: - โœ… 80.8% with ISIL codes - โœ… 49.1% with phone numbers - โœ… 41.4% with email addresses - โœ… 39.3% with websites - **Files**: - `switzerland/swiss_isil_complete_final.json` (1.3 MB) - `switzerland/swiss_isil_complete.csv` - **Documentation**: - `switzerland/FINAL_SCRAPING_REPORT.txt` - `switzerland/VALIDATION_REPORT.txt` **Top Cantons**: - Zรผrich: 479 institutions (20.1%) - Bern: 311 (13.1%) - Geneva: 227 (9.5%) - Vaud: 224 (9.4%) - Basel-Stadt: 139 (5.8%) **Top Types**: - University/research libraries: 764 (32.1%) - Public libraries: 347 (14.6%) - Special libraries: 339 (14.2%) - Municipal archives: 190 (8.0%) - Church archives: 85 (3.6%) ### 3. ๐Ÿ‡ฏ๐Ÿ‡ต Japan โœ… - **Records**: 5,000 institutions (sample/batch) - **Method**: TBD (previous session) - **Status**: Data exists but needs verification - **Files**: `japan/` directory ### 4. ๐Ÿ‡ฆ๐Ÿ‡น Austria ๐Ÿ” - **Records**: ~10 institutions (partial scrape) - **Method**: Web scraping (requires JavaScript rendering) - **Status**: Initial data collected, needs full harvest - **Target**: ~3,000 institutions - **Files**: `austria/` directory (208 files) ### 5. ๐Ÿ‡ง๐Ÿ‡ฆ Bosnia and Herzegovina โœ… - **Records**: 80 institutions - **Completion**: November 19, 2025 - **Files**: `bosnia/` directory ### 6. ๐Ÿ‡ง๐Ÿ‡ช Belgium โœ… - **Records**: Combined dataset available - **Sources**: - KBR (Royal Library of Belgium) - ISIL registry - **Files**: - `belgian_isil_combined.json` (95 KB) - `belgian_isil_detailed.json` (230 KB) - `belgian_isil_combined.csv` ### 7. ๐Ÿ‡ง๐Ÿ‡ฌ Bulgaria โœ… - **Records**: Registry data available - **Files**: - `bulgarian_isil_registry.json` (100 KB) - `bulgarian_isil_registry.csv` (67 KB) --- ## In Progress (Phase 1) ### 8. ๐Ÿ‡จ๐Ÿ‡ฟ Czech Republic ๐Ÿšง - **Target**: ~3,000 institutions - **Method**: Z39.50/ALEPH protocol (National Library of Czech Republic) - **Endpoint**: https://aleph.nkp.cz/ - **Status**: API access confirmed, harvester needed - **Priority**: HIGH (Phase 1) ### 9. ๐Ÿ‡ฉ๐Ÿ‡ฐ Denmark ๐Ÿšง - **Target**: ~900 institutions - **Method**: TBD (investigate registry access) - **Status**: Directory created, awaiting harvest - **Priority**: HIGH (Phase 1) --- ## Partially Complete (Enrichment/Verification Needed) ### ๐Ÿ‡จ๐Ÿ‡ฆ Canada ๐Ÿ”„ - **Records**: 6 sample records - **Target**: ~1,200 institutions - **Status**: Pilot data collected, needs full harvest - **Files**: `canada/` directory ### ๐Ÿ‡ง๐Ÿ‡พ Belarus ๐Ÿ”„ - **Records**: 7 sample records - **Enrichment**: OpenStreetMap data available - **Documentation**: - `BELARUS_FINAL_REPORT.md` - `BELARUS_ENRICHMENT_SUMMARY.md` - **Files**: `belarus_osm_libraries.json` (246 KB) ### ๐Ÿ‡ฆ๐Ÿ‡ท Argentina ๐Ÿ”„ - **Records**: 3 sample records - **Enrichment**: Wikidata institutions available - **Documentation**: `ARGENTINA_ENRICHMENT_COMPLETE.md` - **Files**: `argentina_wikidata_institutions.json` (704 KB) ### ๐Ÿ‡ณ๐Ÿ‡ฑ Netherlands ๐Ÿ”„ - **Records**: 8 sample records - **Enrichment**: Wikidata institutions available - **Documentation**: `NETHERLANDS_ENRICHMENT_COMPLETE.md` - **Files**: - `netherlands_wikidata_institutions.json` (525 KB) - `KB_Netherlands_ISIL_2025-04-01.xlsx` (22 KB) --- ## Planned Phase 1 (Priority: Next 4 Weeks) ### 10. ๐Ÿ‡ซ๐Ÿ‡ท France ๐Ÿ“‹ - **Target**: ~5,000 institutions - **Method**: SUDOC portal API/scraping - **Endpoint**: http://www.sudoc.abes.fr/ - **Priority**: HIGH ### 11. ๐Ÿ‡ฎ๐Ÿ‡น Italy ๐Ÿ“‹ - **Target**: ~8,000 institutions - **Method**: ICCU (Istituto Centrale per il Catalogo Unico) API - **Endpoint**: https://opac.sbn.it/ - **Priority**: HIGH ### 12. ๐Ÿ‡ต๐Ÿ‡ฑ Poland ๐Ÿ“‹ - **Target**: ~4,500 institutions - **Method**: National Library of Poland registry - **Priority**: MEDIUM ### 13. ๐Ÿ‡ธ๐Ÿ‡ช Sweden ๐Ÿ“‹ - **Target**: ~1,200 institutions - **Method**: LIBRIS API (National Library of Sweden) - **Priority**: MEDIUM ### 14. ๐Ÿ‡ณ๐Ÿ‡ด Norway ๐Ÿ“‹ - **Target**: ~500 institutions - **Method**: National Library of Norway registry - **Priority**: MEDIUM ### 15. ๐Ÿ‡ซ๐Ÿ‡ฎ Finland ๐Ÿ“‹ - **Target**: ~800 institutions - **Method**: FinELib registry / National Library of Finland - **Priority**: MEDIUM --- ## Phase 2-4 (Weeks 5-16) ### Phase 2: Southern Europe (Weeks 5-8) - ๐Ÿ‡ช๐Ÿ‡ธ Spain (~5,000 institutions) - ๐Ÿ‡ต๐Ÿ‡น Portugal (~800 institutions) - ๐Ÿ‡ฌ๐Ÿ‡ท Greece (~600 institutions) - ๐Ÿ‡ญ๐Ÿ‡ท Croatia (~300 institutions) - ๐Ÿ‡ท๐Ÿ‡ธ Serbia (~200 institutions) - ๐Ÿ‡ธ๐Ÿ‡ฎ Slovenia (~150 institutions) ### Phase 3: Eastern Europe (Weeks 9-12) - ๐Ÿ‡ท๐Ÿ‡ด Romania (~1,500 institutions) - ๐Ÿ‡ญ๐Ÿ‡บ Hungary (~1,200 institutions) - ๐Ÿ‡ธ๐Ÿ‡ฐ Slovakia (~800 institutions) - ๐Ÿ‡บ๐Ÿ‡ฆ Ukraine (~2,000 institutions) - ๐Ÿ‡ช๐Ÿ‡ช Estonia (~200 institutions) - ๐Ÿ‡ฑ๐Ÿ‡ป Latvia (~300 institutions) - ๐Ÿ‡ฑ๐Ÿ‡น Lithuania (~250 institutions) ### Phase 4: Global Expansion (Weeks 13-16) - ๐Ÿ‡ฆ๐Ÿ‡บ Australia (~1,500 institutions) - ๐Ÿ‡ณ๐Ÿ‡ฟ New Zealand (~400 institutions) - ๐Ÿ‡ฟ๐Ÿ‡ฆ South Africa (~300 institutions) - ๐Ÿ‡ฐ๐Ÿ‡ท South Korea (~1,200 institutions) - ๐Ÿ‡ธ๐Ÿ‡ฌ Singapore (~150 institutions) - ๐Ÿ‡ฎ๐Ÿ‡ฑ Israel (~300 institutions) --- ## Files and Documentation ### Global Planning Documents - โœ… `MASTER_HARVEST_PLAN.md` - Comprehensive harvest strategy - โœ… `GLOBAL_ISIL_AGENCIES_OFFICIAL.md` - Official ISIL agencies list - โœ… `SCRAPER_INVENTORY.md` - Inventory of scraping tools - โœ… `HARVEST_PROGRESS_SUMMARY.md` - This document ### Harvest Scripts - โœ… `scripts/scrapers/harvest_german_isil_sru.py` - Germany (SRU protocol) - โœ… `scripts/scrapers/harvest_swiss_isil.py` - Switzerland (web scraping) - ๐Ÿ“‹ `scripts/scrapers/harvest_czech_isil.py` - Czech Republic (planned) - ๐Ÿ“‹ `scripts/scrapers/harvest_french_isil.py` - France (planned) - ๐Ÿ“‹ `scripts/scrapers/harvest_italian_isil.py` - Italy (planned) ### Data Quality Tools - ๐Ÿ“‹ Geocoding validator (for address verification) - ๐Ÿ“‹ ISIL format checker - ๐Ÿ“‹ Duplicate detector - ๐Ÿ“‹ LinkML converter (for GLAM project integration) --- ## Next Immediate Steps ### Priority 1: Complete Phase 1 Core Countries 1. **Czech Republic** (Week 1-2) - Implement Z39.50/ALEPH harvester - Target: 3,000 records - Estimated time: 2-3 days 2. **Denmark** (Week 2) - Investigate ISIL registry access method - Target: 900 records - Estimated time: 1-2 days 3. **France** (Week 2-3) - SUDOC portal scraping/API - Target: 5,000 records - Estimated time: 3-4 days 4. **Italy** (Week 3-4) - ICCU/SBN API integration - Target: 8,000 records - Estimated time: 4-5 days ### Priority 2: Data Quality Improvements 1. **Geocoding** - Add lat/lon for Swiss institutions (4.9% have addresses) - Verify German geocoding (87% complete) - Implement batch geocoding for new harvests 2. **ISIL Code Extraction** - Swiss: Extract ISIL codes from URLs (currently 0 extracted, 1,923 in metadata) - Austria: Complete full registry scrape 3. **Wikidata Enrichment** - Cross-reference all institutions with Wikidata - Add Q-numbers for collision resolution - Enrich with additional metadata (founding dates, types) 4. **LinkML Conversion** - Convert all harvested data to LinkML format - Apply GLAMORCUBESFIXPHDNT taxonomy - Generate GHCIDs ### Priority 3: Documentation 1. **Per-Country Reports** - Create harvest reports for all completed countries - Document data quality metrics - Add quick-start guides 2. **Integration Guide** - Document how to merge ISIL data with GLAM project - Create data transformation pipeline - Add validation tests --- ## Technical Notes ### Harvest Methods Used 1. **SRU Protocol** (Germany) - Fastest, most reliable 2. **Web Scraping** (Switzerland) - Requires rate limiting 3. **Z39.50** (Czech Republic, planned) - Library protocol 4. **REST APIs** (various) - Country-specific 5. **Open Data Downloads** (some countries) - Preferred when available ### Rate Limiting - Germany SRU: 100 records/batch, 1s delay - Switzerland: 1 page/2s delay (2,379 records in 33 minutes) - General rule: Be polite, respect robots.txt ### Data Quality Metrics - **Tier 1 (Authoritative)**: Official ISIL registries - **Tier 2 (Verified)**: Institutional websites - **Tier 3 (Crowd-sourced)**: Wikidata, OSM - **Tier 4 (Inferred)**: NLP extraction from conversations --- ## Performance Statistics ### Harvest Performance - **Germany**: 16,979 records in ~3 minutes (94 records/second) - **Switzerland**: 2,379 records in 33 minutes (1.2 records/second) - **Average**: ~47 records/second (SRU) vs. 1 record/second (scraping) ### Data Volumes - **Total JSON**: ~40 MB (Germany) + 1.3 MB (Switzerland) = 41.3 MB - **Total JSONL**: ~24 MB (Germany) - **Estimated final size**: ~500 MB for all 97,000 records ### Estimated Completion Times - **Phase 1 (Core Europe)**: 4 weeks - **Phase 2 (Southern Europe)**: 4 weeks - **Phase 3 (Eastern Europe)**: 4 weeks - **Phase 4 (Global)**: 4 weeks - **Total project**: ~16 weeks (4 months) --- ## Contact and Resources ### Official ISIL Resources - International coordination: https://slks.dk/ - ISO 15511:2019 standard: https://www.iso.org/standard/77849.html ### Project Repository - GitHub: (to be determined) - Data directory: `/Users/kempersc/apps/glam/data/isil/` - Scripts: `/Users/kempersc/apps/glam/scripts/scrapers/` ### Contributors - OpenCode AI + MCP Tools - GLAM Data Extraction Project Team --- **End of Progress Summary** *This document will be updated after each harvest session to reflect current progress.*