glam/data/isil/HARVEST_PROGRESS_SUMMARY.md
2025-11-30 23:30:29 +01:00

11 KiB

Global ISIL Database Harvest - Progress Summary

Note

: Any references to Q-number collision resolution in this document are superseded. Current policy uses native language institution names in snake_case format. See docs/plan/global_glam/07-ghcid-collision-resolution.md for current approach.

Last Updated: November 19, 2025, 14:30 CET
Session: Continuation of Phase 1 - Core European Registries


Overall Progress

Status Countries Records % Complete
Completed 7 25,436 26.2%
🚧 In Progress 2 ~5,000 5.2%
📋 Planned 27 ~66,564 68.6%
TOTAL 36 ~97,000 100%

Completed Countries (Phase 1)

1. 🇩🇪 Germany

  • Records: 16,979 institutions
  • Method: SRU 1.1 protocol (DNB API)
  • Completion: November 19, 2025
  • Data Quality:
    • 87% with street addresses + coordinates
    • 79% with website URLs
    • 79% with phone numbers
    • 38% with email addresses
  • Files:
    • germany/german_isil_complete_20251119_134939.json (37 MB)
    • germany/german_isil_complete_20251119_134939.jsonl (24 MB)
    • germany/german_isil_stats_20251119_134941.json
  • Documentation:
    • germany/HARVEST_REPORT.md
    • germany/QUICK_START.md
    • germany/README.md

Top Regions:

  • North Rhine-Westphalia: 1,503 institutions (8.9%)
  • Baden-Württemberg: 1,295 (7.6%)
  • Bavaria: 1,204 (7.1%)
  • Lower Saxony: 1,055 (6.2%)
  • Hesse: 933 (5.5%)

2. 🇨🇭 Switzerland

  • Records: 2,379 institutions (including Liechtenstein)
  • Method: Web scraping (Swiss National Library ISIL directory)
  • Completion: November 18, 2025
  • Data Quality:
    • 80.8% with ISIL codes
    • 49.1% with phone numbers
    • 41.4% with email addresses
    • 39.3% with websites
  • Files:
    • switzerland/swiss_isil_complete_final.json (1.3 MB)
    • switzerland/swiss_isil_complete.csv
  • Documentation:
    • switzerland/FINAL_SCRAPING_REPORT.txt
    • switzerland/VALIDATION_REPORT.txt

Top Cantons:

  • Zürich: 479 institutions (20.1%)
  • Bern: 311 (13.1%)
  • Geneva: 227 (9.5%)
  • Vaud: 224 (9.4%)
  • Basel-Stadt: 139 (5.8%)

Top Types:

  • University/research libraries: 764 (32.1%)
  • Public libraries: 347 (14.6%)
  • Special libraries: 339 (14.2%)
  • Municipal archives: 190 (8.0%)
  • Church archives: 85 (3.6%)

3. 🇯🇵 Japan

  • Records: 5,000 institutions (sample/batch)
  • Method: TBD (previous session)
  • Status: Data exists but needs verification
  • Files: japan/ directory

4. 🇦🇹 Austria 🔍

  • Records: ~10 institutions (partial scrape)
  • Method: Web scraping (requires JavaScript rendering)
  • Status: Initial data collected, needs full harvest
  • Target: ~3,000 institutions
  • Files: austria/ directory (208 files)

5. 🇧🇦 Bosnia and Herzegovina

  • Records: 80 institutions
  • Completion: November 19, 2025
  • Files: bosnia/ directory

6. 🇧🇪 Belgium

  • Records: Combined dataset available
  • Sources:
    • KBR (Royal Library of Belgium)
    • ISIL registry
  • Files:
    • belgian_isil_combined.json (95 KB)
    • belgian_isil_detailed.json (230 KB)
    • belgian_isil_combined.csv

7. 🇧🇬 Bulgaria

  • Records: Registry data available
  • Files:
    • bulgarian_isil_registry.json (100 KB)
    • bulgarian_isil_registry.csv (67 KB)

In Progress (Phase 1)

8. 🇨🇿 Czech Republic 🚧

  • Target: ~3,000 institutions
  • Method: Z39.50/ALEPH protocol (National Library of Czech Republic)
  • Endpoint: https://aleph.nkp.cz/
  • Status: API access confirmed, harvester needed
  • Priority: HIGH (Phase 1)

9. 🇩🇰 Denmark 🚧

  • Target: ~900 institutions
  • Method: TBD (investigate registry access)
  • Status: Directory created, awaiting harvest
  • Priority: HIGH (Phase 1)

Partially Complete (Enrichment/Verification Needed)

🇨🇦 Canada 🔄

  • Records: 6 sample records
  • Target: ~1,200 institutions
  • Status: Pilot data collected, needs full harvest
  • Files: canada/ directory

🇧🇾 Belarus 🔄

  • Records: 7 sample records
  • Enrichment: OpenStreetMap data available
  • Documentation:
    • BELARUS_FINAL_REPORT.md
    • BELARUS_ENRICHMENT_SUMMARY.md
  • Files: belarus_osm_libraries.json (246 KB)

🇦🇷 Argentina 🔄

  • Records: 3 sample records
  • Enrichment: Wikidata institutions available
  • Documentation: ARGENTINA_ENRICHMENT_COMPLETE.md
  • Files: argentina_wikidata_institutions.json (704 KB)

🇳🇱 Netherlands 🔄

  • Records: 8 sample records
  • Enrichment: Wikidata institutions available
  • Documentation: NETHERLANDS_ENRICHMENT_COMPLETE.md
  • Files:
    • netherlands_wikidata_institutions.json (525 KB)
    • KB_Netherlands_ISIL_2025-04-01.xlsx (22 KB)

Planned Phase 1 (Priority: Next 4 Weeks)

10. 🇫🇷 France 📋

11. 🇮🇹 Italy 📋

  • Target: ~8,000 institutions
  • Method: ICCU (Istituto Centrale per il Catalogo Unico) API
  • Endpoint: https://opac.sbn.it/
  • Priority: HIGH

12. 🇵🇱 Poland 📋

  • Target: ~4,500 institutions
  • Method: National Library of Poland registry
  • Priority: MEDIUM

13. 🇸🇪 Sweden 📋

  • Target: ~1,200 institutions
  • Method: LIBRIS API (National Library of Sweden)
  • Priority: MEDIUM

14. 🇳🇴 Norway 📋

  • Target: ~500 institutions
  • Method: National Library of Norway registry
  • Priority: MEDIUM

15. 🇫🇮 Finland 📋

  • Target: ~800 institutions
  • Method: FinELib registry / National Library of Finland
  • Priority: MEDIUM

Phase 2-4 (Weeks 5-16)

Phase 2: Southern Europe (Weeks 5-8)

  • 🇪🇸 Spain (~5,000 institutions)
  • 🇵🇹 Portugal (~800 institutions)
  • 🇬🇷 Greece (~600 institutions)
  • 🇭🇷 Croatia (~300 institutions)
  • 🇷🇸 Serbia (~200 institutions)
  • 🇸🇮 Slovenia (~150 institutions)

Phase 3: Eastern Europe (Weeks 9-12)

  • 🇷🇴 Romania (~1,500 institutions)
  • 🇭🇺 Hungary (~1,200 institutions)
  • 🇸🇰 Slovakia (~800 institutions)
  • 🇺🇦 Ukraine (~2,000 institutions)
  • 🇪🇪 Estonia (~200 institutions)
  • 🇱🇻 Latvia (~300 institutions)
  • 🇱🇹 Lithuania (~250 institutions)

Phase 4: Global Expansion (Weeks 13-16)

  • 🇦🇺 Australia (~1,500 institutions)
  • 🇳🇿 New Zealand (~400 institutions)
  • 🇿🇦 South Africa (~300 institutions)
  • 🇰🇷 South Korea (~1,200 institutions)
  • 🇸🇬 Singapore (~150 institutions)
  • 🇮🇱 Israel (~300 institutions)

Files and Documentation

Global Planning Documents

  • MASTER_HARVEST_PLAN.md - Comprehensive harvest strategy
  • GLOBAL_ISIL_AGENCIES_OFFICIAL.md - Official ISIL agencies list
  • SCRAPER_INVENTORY.md - Inventory of scraping tools
  • HARVEST_PROGRESS_SUMMARY.md - This document

Harvest Scripts

  • scripts/scrapers/harvest_german_isil_sru.py - Germany (SRU protocol)
  • scripts/scrapers/harvest_swiss_isil.py - Switzerland (web scraping)
  • 📋 scripts/scrapers/harvest_czech_isil.py - Czech Republic (planned)
  • 📋 scripts/scrapers/harvest_french_isil.py - France (planned)
  • 📋 scripts/scrapers/harvest_italian_isil.py - Italy (planned)

Data Quality Tools

  • 📋 Geocoding validator (for address verification)
  • 📋 ISIL format checker
  • 📋 Duplicate detector
  • 📋 LinkML converter (for GLAM project integration)

Next Immediate Steps

Priority 1: Complete Phase 1 Core Countries

  1. Czech Republic (Week 1-2)

    • Implement Z39.50/ALEPH harvester
    • Target: 3,000 records
    • Estimated time: 2-3 days
  2. Denmark (Week 2)

    • Investigate ISIL registry access method
    • Target: 900 records
    • Estimated time: 1-2 days
  3. France (Week 2-3)

    • SUDOC portal scraping/API
    • Target: 5,000 records
    • Estimated time: 3-4 days
  4. Italy (Week 3-4)

    • ICCU/SBN API integration
    • Target: 8,000 records
    • Estimated time: 4-5 days

Priority 2: Data Quality Improvements

  1. Geocoding

    • Add lat/lon for Swiss institutions (4.9% have addresses)
    • Verify German geocoding (87% complete)
    • Implement batch geocoding for new harvests
  2. ISIL Code Extraction

    • Swiss: Extract ISIL codes from URLs (currently 0 extracted, 1,923 in metadata)
    • Austria: Complete full registry scrape
  3. Wikidata Enrichment

    • Cross-reference all institutions with Wikidata
    • Add Q-numbers for collision resolution
    • Enrich with additional metadata (founding dates, types)
  4. LinkML Conversion

    • Convert all harvested data to LinkML format
    • Apply GLAMORCUBESFIXPHDNT taxonomy
    • Generate GHCIDs

Priority 3: Documentation

  1. Per-Country Reports

    • Create harvest reports for all completed countries
    • Document data quality metrics
    • Add quick-start guides
  2. Integration Guide

    • Document how to merge ISIL data with GLAM project
    • Create data transformation pipeline
    • Add validation tests

Technical Notes

Harvest Methods Used

  1. SRU Protocol (Germany) - Fastest, most reliable
  2. Web Scraping (Switzerland) - Requires rate limiting
  3. Z39.50 (Czech Republic, planned) - Library protocol
  4. REST APIs (various) - Country-specific
  5. Open Data Downloads (some countries) - Preferred when available

Rate Limiting

  • Germany SRU: 100 records/batch, 1s delay
  • Switzerland: 1 page/2s delay (2,379 records in 33 minutes)
  • General rule: Be polite, respect robots.txt

Data Quality Metrics

  • Tier 1 (Authoritative): Official ISIL registries
  • Tier 2 (Verified): Institutional websites
  • Tier 3 (Crowd-sourced): Wikidata, OSM
  • Tier 4 (Inferred): NLP extraction from conversations

Performance Statistics

Harvest Performance

  • Germany: 16,979 records in ~3 minutes (94 records/second)
  • Switzerland: 2,379 records in 33 minutes (1.2 records/second)
  • Average: ~47 records/second (SRU) vs. 1 record/second (scraping)

Data Volumes

  • Total JSON: ~40 MB (Germany) + 1.3 MB (Switzerland) = 41.3 MB
  • Total JSONL: ~24 MB (Germany)
  • Estimated final size: ~500 MB for all 97,000 records

Estimated Completion Times

  • Phase 1 (Core Europe): 4 weeks
  • Phase 2 (Southern Europe): 4 weeks
  • Phase 3 (Eastern Europe): 4 weeks
  • Phase 4 (Global): 4 weeks
  • Total project: ~16 weeks (4 months)

Contact and Resources

Official ISIL Resources

Project Repository

  • GitHub: (to be determined)
  • Data directory: /Users/kempersc/apps/glam/data/isil/
  • Scripts: /Users/kempersc/apps/glam/scripts/scrapers/

Contributors

  • OpenCode AI + MCP Tools
  • GLAM Data Extraction Project Team

End of Progress Summary

This document will be updated after each harvest session to reflect current progress.