glam/docs/sessions/SESSION_2025-11-17_ISIL_GLOBAL_MAPPING.md
2025-11-19 23:25:22 +01:00

7.9 KiB

Session Summary: Global ISIL Registry Mapping

Date: 2025-11-17
Focus: Mapping all global ISIL registries and preparing scraper inventory


Accomplishments Today

1. Completed Extractions

Belarus ISIL Registry:

KB Netherlands Public Libraries:

  • 153 institutions extracted from Excel file (dated 2025-04-01)
  • Source: https://www.bibliotheeknetwerk.nl/
  • Parser: scripts/scrapers/parse_kb_netherlands_isil.py
  • Files: /data/isil/NL/kb_netherlands_public_libraries.csv and .json

Total from previous session:

  • Belgium (KBR): 438 institutions
  • Netherlands: 153 public libraries
  • Belarus: 167 institutions

Major Discovery: Official ISIL Registration Authority Directory

Found the authoritative global ISIL directory maintained by Denmark (ISIL Registration Authority):

Source: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil

Summary Statistics:

  • 41 national agencies listed
  • 30+ countries (73%) with searchable/accessible registries
  • 8 countries without public access (AR, GL, IR, KZ, MD, NP, partial KR)
  • 5 non-national agencies (EUR, GTB, OCLC, ZDB)

Documentation Created: /data/isil/GLOBAL_ISIL_AGENCIES_OFFICIAL.md


Key Findings

Denmark Registry Structure

  • Direct CSV/Excel exports available at https://vip.dbc.dk/librarylists/
  • Categories: Public libraries (main + branches), Research libraries (main + branches)
  • Public interface documented but extraction not yet performed

Japan Registry Structure

  • Comprehensive Excel/CSV downloads with English translations
  • 4 institutional types separated:
    1. Public libraries (~4,000 estimated)
    2. NDL + university + special libraries (~2,000 estimated)
    3. Museums (newly added Oct 2022)
    4. Archives (newly added Oct 2022)
  • Files dated 2025-11-05 (very recent!)
  • Source: http://www.ndl.go.jp/en/library/isil/index.html

Registry Access Status

Tier 1: Direct CSV/Excel downloads (highest priority):

Tier 2: Searchable interfaces (scraping required):

  • 🇺🇸 USA - Library of Congress MARC Organizations (thousands)
  • 🇩🇪 Germany - Staatsbibliothek zu Berlin Sigel (~10,000)
  • 🇫🇷 France - SUDOC
  • 🇮🇹 Italy - ICCU Anagrafe
  • 🇨🇦 Canada - Library and Archives Canada
  • 🇦🇺 Australia - National Library
  • 🇨🇭 Switzerland - Swiss National Library
  • 🇦🇹 Austria - Austrian Library Network
  • Plus 15+ more countries

Tier 3: Unknown/Contact required:

  • 🇦🇷 Argentina, 🇬🇱 Greenland, 🇮🇷 Iran, 🇰🇿 Kazakhstan, 🇲🇩 Moldova, 🇳🇵 Nepal

Blocked:

  • 🇬🇧 UK - British Library offline due to cyber attack (Oct 2023)

Next Steps (Priority Order)

Priority 1: Easy Wins (Direct Downloads)

  1. Denmark - Execute scraper for https://vip.dbc.dk/librarylists/

    • Expected: 500+ public libraries, 200+ research libraries
    • Method: Direct CSV download + parsing
  2. Japan - Download all 4 Excel files

    • Expected: 6,000+ institutions total
    • Method: Direct Excel download + parsing (similar to KB Netherlands)
    • Files available: Libraries (2 files), Museums, Archives, Deleted ISIL

Priority 2: Large Datasets (Scraping Required)

  1. USA - Library of Congress

  2. Germany - Staatsbibliothek zu Berlin

  3. France - SUDOC

Priority 3: Medium Datasets

  1. Canada - https://sigles-symbols.bac-lac.gc.ca/eng/Search
  2. Australia - http://www.nla.gov.au/apps/ilrs
  3. Switzerland - https://www.isil.nb.admin.ch/en/
  4. Austria - https://www.isil.at/
  5. Czech Republic - https://aleph.nkp.cz/...

Files Created/Modified

New Documents:

  1. /data/isil/GLOBAL_ISIL_AGENCIES_OFFICIAL.md - Comprehensive directory of 41 national agencies
  2. /docs/sessions/SESSION_2025-11-17_ISIL_GLOBAL_MAPPING.md - This summary

New Parsers:

  1. /scripts/scrapers/parse_kb_netherlands_isil.py - Excel parser for KB Netherlands

New Data Files:

  1. /data/isil/BY/belarus_isil_all.csv + .json (167 institutions)
  2. /data/isil/NL/kb_netherlands_public_libraries.csv + .json (153 institutions)
  3. /data/isil/KB_Netherlands_ISIL_2025-04-01.xlsx (source file)

Technical Notes

Parser Design Patterns

Excel Parsing (KB Netherlands, Japan):

import openpyxl
workbook = openpyxl.load_workbook(excel_file)
sheet = workbook.active
# Handle multi-row headers
headers = [cell.value for cell in sheet[header_row]]
# Parse data rows
for row in sheet.iter_rows(min_row=data_start_row, values_only=True):
    institution = standardize_fields(row, headers)

HTML Table Scraping (Belarus, Denmark):

from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')

CSV Direct Download (Denmark, Japan):

import pandas as pd
df = pd.read_csv(url_or_file, encoding='utf-8')
# or use csv.DictReader for pure stdlib

Standardized Output Schema

All parsers output consistent fields:

  • isil_code (required)
  • name (required)
  • city (optional)
  • region/province (optional)
  • country (required)
  • postal_code (optional)
  • street_address (optional)
  • registry (data source)
  • source_url (registry homepage)
  • notes (additional info)

Statistics Summary

Institutions Extracted to Date:

  • Belgium: 438
  • Belarus: 167
  • Netherlands: 153
  • Total: 758 institutions

Potential Dataset Size (if all Tier 1 extracted):

  • Denmark: ~700
  • Japan: ~6,000
  • Netherlands (complete): ~1,500 (with research libraries)
  • Belgium: 438
  • Belarus: 167
  • Total potential: ~9,000 institutions from 5 countries

Global Potential (if all 41 national agencies extracted):

  • Estimated 50,000-100,000+ institutions worldwide
  • USA alone: 10,000+
  • Germany alone: 10,000+
  • France: 3,000+
  • Italy: 2,000+

References

Primary Sources:

Registries Documented:

  • 41 national agencies
  • 5 non-national agencies
  • 46 total ISIL allocation agencies

Tools Used:

  • requests + BeautifulSoup for web scraping
  • openpyxl for Excel parsing
  • csv module for CSV exports
  • json module for JSON exports

Next Session Priorities

  1. Extract Denmark ISIL registry (direct CSV download)
  2. Extract Japan ISIL registry (4 Excel files)
  3. Create scraper for USA Library of Congress
  4. Create scraper for Germany Staatsbibliothek Berlin
  5. Investigate Rijksarchief Belgium (archives) - 0 records from previous attempt

Goal: Achieve 15,000+ institutions from 10 countries by end of week


Session End: 2025-11-17 14:02 UTC