glam/data/isil/SCRAPER_INVENTORY.md
2025-11-19 23:25:22 +01:00

41 KiB
Raw Blame History

Global ISIL Registry Scraper Inventory

Version: 1.0
Date: 2025-11-17
Purpose: Technical inventory of data extraction methods for global ISIL registries
Status: Reconnaissance complete, extraction pending


Overview

This document provides detailed technical specifications for extracting heritage institution data from 46 ISIL registries worldwide. For each registry, we document:

  • Access method (direct download, API, web scraping)
  • Data format and structure
  • Estimated record count
  • Required tools and complexity
  • Priority tier for extraction

Total Estimated Records: 50,000-100,000+ institutions globally


Extraction Priority Tiers

Tier 1: Direct Downloads (Easy Wins)

Large datasets with CSV/Excel/JSON downloads requiring minimal processing.

Tier 2: JSON/XML APIs

Programmatic access via REST APIs, SPARQL endpoints, or SRU interfaces.

Tier 3: Web Scraping Required

HTML-based search interfaces requiring BeautifulSoup/Selenium.

Tier 4: Contact Required

No public access; requires direct contact with registry agency.


Tier 1: Direct Downloads (9 Registries)

🇩🇰 Denmark - Danish Agency for Culture and Palaces

Priority: HIGH
URL: https://vip.dbc.dk/librarylists/
Access Method: Direct CSV download (4 separate files)

Available Downloads:

  1. Folkebiblioteker (væsener) - Public libraries (main branches)

  2. Folkebiblioteker (alle betjeningssteder) - Public libraries (all service points)

  3. Forskningsbiblioteker (væsener) - Research libraries (main branches)

  4. Forskningsbiblioteker (alle betjeningssteder) - Research libraries (all service points)

Fields Available:

  • Biblioteksnummer (Library number/ISIL)
  • Navn (Name)
  • Adresse (Address)
  • Postnummer (Postal code)
  • By (City)
  • Telefon (Phone)
  • Email
  • Leder (Director/Manager)

Estimated Records: 700+ institutions
Data Format: CSV (UTF-8)
Required Tools: pandas, requests
Complexity: EASY
Institution Types: Libraries only

Implementation Notes:

  • 4 separate CSV files need to be merged
  • May contain duplicate entries (main branches vs. service points)
  • Danish text fields (no English translation)
  • Direct downloads, no rate limiting concerns

🇯🇵 Japan - National Diet Library

Priority: HIGH
URL: http://www.ndl.go.jp/en/library/isil/index.html#isillist
Access Method: Direct CSV/Excel download (5 datasets)

Available Downloads (Updated 2025-11-05):

  1. Public Libraries

  2. NDL, University Libraries, Special Libraries

  3. Museums

  4. Archives

  5. Deleted ISIL

ISIL Structure:

  • Country code: JP-
  • Type identifier: 1 (Library), 2 (Museum), 3 (Archive)
  • Institution ID: 6 digits (e.g., JP-1123456, JP-2234567, JP-3345678)

Fields Available:

  • ISIL
  • Institution name (English)
  • Institution name (Japanese)
  • Institution name (Japanese kana)
  • Postal code
  • Address
  • Telephone
  • Fax
  • URL
  • Registration date (since April 2020)
  • Last update date (since April 2020)

Estimated Total Records: 6,000+ institutions
Data Format: CSV and Excel (choose one)
Required Tools: pandas, requests, openpyxl (for Excel)
Complexity: EASY
Institution Types: Libraries, Museums, Archives
Update Frequency: Regular (files dated 2025-11-05)

Implementation Notes:

  • Very recent data (November 2025)
  • English field names available
  • 5 separate files need to be merged
  • Deleted ISIL records useful for tracking institutional changes
  • Public domain license (free to use)

🇳🇱 Netherlands - Koninklijke Bibliotheek (KB)

Priority: MEDIUM
URL: https://www.bibliotheeknetwerk.nl/landelijke-digitale-infrastructuur-ldi/isil-codes
Access Method: Direct Excel download

Status: ALREADY EXTRACTED (153 public libraries)

Available Download:

  • Public Libraries List: Excel file from bibliotheeknetwerk.nl
  • File: /data/isil/NL/kb_netherlands_public_libraries.{csv,json}
  • Parser: /scripts/scrapers/parse_kb_netherlands_isil.py

Fields Available:

  • Instellingsnaam (Institution name)
  • Plaatsnaam (City name)
  • ISIL code

Extracted Records: 153 institutions
Data Format: Excel (with header in row 3)
Complexity: EASY
Institution Types: Public libraries only
Completion Date: 2025-11-17

Implementation Notes:

  • Excel header in row 3 (not standard row 1)
  • Parser already implemented and tested
  • Limited coverage (public libraries only, no research/special libraries)
  • May need additional source for complete Netherlands coverage

🇧🇾 Belarus - National Library of Belarus

Priority: LOW (already extracted)
URL: https://nlb.by/en/for-librarians/international-standard-identifier-for-libraries-and-related-organizations-isil/list-of-libraries-organizations-of-the-republic-of-belarus-and-their-isil-codes

Status: ALREADY EXTRACTED (167 institutions)

Available Download:

  • HTML table with all ISIL codes
  • File: /data/isil/BY/belarus_isil_all.{csv,json}
  • Parser: Already exists from previous session

Fields Extracted:

  • ISIL code
  • Institution name
  • Region code (embedded in ISIL: HO, MI, VI, MA, HM, BR, HR)

Extracted Records: 167 institutions
Regional Breakdown:

  • HO (Homyel): 29
  • MI (Minsk): 26
  • VI (Vitsyebsk): 25
  • MA (Mahilyow): 25
  • HM (Hrodna): 23
  • BR (Brest): 20
  • HR (Hrodna variant): 19

Complexity: EASY
Institution Types: Libraries and related organizations
Completion Date: 2025-11-17


🇧🇪 Belgium - Royal Library of Belgium (KBR)

Priority: LOW (already extracted)
URL: http://isil.kbr.be/

Status: ALREADY EXTRACTED (438 institutions)

Extracted Records: 438 institutions
Completion Date: Previous session
Institution Types: Libraries, archives, museums


🇨🇾 Cyprus - Cyprus University of Technology Library

Priority: LOW
URL: http://library.cut.ac.cy/en/node/1194
Access Method: Direct list (HTML or downloadable file)

Estimated Records: 50-100 institutions
Complexity: EASY
Institution Types: Libraries and archives

Implementation Notes:

  • Small dataset, straightforward extraction
  • Likely HTML table or PDF list
  • Low priority due to small size

🇱🇺 Luxembourg - Archives nationales de Luxembourg

Priority: LOW
URL: http://www.anlux.public.lu/fr/nous-connaitre/reseaux-internationaux/ISIL.html
Access Method: Direct list (HTML)

Estimated Records: 20-50 institutions
Complexity: EASY
Institution Types: Archives and libraries

Implementation Notes:

  • Very small dataset
  • Likely single HTML page with table
  • French language

🇭🇺 Hungary - National Széchényi Library

Priority: MEDIUM
URL: https://ki.oszk.hu/dokumentumtar/isil-azonositok
Access Method: Direct list (likely HTML or downloadable file)

Estimated Records: 200-500 institutions
Complexity: EASY
Institution Types: Libraries

Implementation Notes:

  • Medium-sized dataset
  • Hungarian language interface
  • Need to verify file format (HTML vs. downloadable)

🇮🇱 Israel - National Library of Israel

Priority: MEDIUM
URL: http://nli.org.il
Access Method: Direct list (location TBD)

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries and archives

Implementation Notes:

  • Need to locate exact ISIL list URL on NLI website
  • Likely Hebrew and English versions
  • May require navigation through site structure

Tier 2: JSON/XML APIs (3 Registries)

🇩🇪 Germany - Staatsbibliothek zu Berlin (Sigel Database)

Priority: HIGH
URL: https://sigel.staatsbibliothek-berlin.de/
Access Method: JSON-API, SRU, Linked Data Service

API Endpoints:

  1. JSON-API (Primary)

  2. SRU (Secondary)

  3. Linked Data Service (RDF)

JSON-API Features:

  • Pagination: size parameter (default: 10, max: TBD)
  • Page selection: page parameter (default: 1)
  • Compact format: compact=1 for PICA-Plus records
  • JSONP support: callback parameter for cross-domain requests
  • Search syntax: nam=keyword AND ort=city (field-specific search)
  • URL encoding required: Use %3D for =, %20 for spaces

Search Fields:

  • nam - Institution name
  • ort - City/Location
  • Additional fields documented in API docs

Response Structure:

{
  "id": "https://sigel.staatsbibliothek-berlin.de/api/org.jsonld?q=...",
  "type": "Collection",
  "totalItems": 12345,
  "freetextQuery": "search terms",
  "member": [
    {
      "id": "https://sigel.staatsbibliothek-berlin.de/org/DE-12",
      "type": ["Library", "Museum", "Archive", "Other"],
      "identifier": "DE-12",
      "location": "Berlin",
      "addressRegion": "DE-BE",
      "seeAlso": "https://sigel.staatsbibliothek-berlin.de/org/DE-12.rdf",
      "data": { /* PICA-Plus record */ }
    }
  ],
  "view": {
    "id": "current page URL",
    "type": "PartialCollectionView",
    "first": "first page URL",
    "last": "last page URL",
    "next": "next page URL",
    "previous": "previous page URL",
    "totalItems": 10,
    "pageIndex": 1,
    "numberOfPages": 1235,
    "offset": 0,
    "limit": 10
  }
}

Estimated Records: 10,000+ institutions
Data Format: JSON-LD with PICA-Plus embedded
Required Tools: requests, json
Complexity: MEDIUM (pagination required)
Institution Types: Libraries, Archives, Museums, Other
Rate Limiting: Unknown (implement polite scraping with delays)

Implementation Strategy:

  1. Start with broad search (all institutions)
  2. Paginate through results using size=100 (test max size)
  3. Extract JSON-LD records
  4. Parse PICA-Plus data for additional fields
  5. Store in standardized CSV/JSON format
  6. Optionally fetch RDF for Linked Data integration

Implementation Notes:

  • Very comprehensive dataset (~10,000 institutions)
  • Well-documented JSON-API (beta status)
  • Format change announced for September 2025 (check blog post)
  • Linked Data Service provides RDF/Turtle for semantic web integration
  • PICA-Plus format requires additional parsing
  • Contact: Carsten Klee (carsten.klee@sbb.spk-berlin.de, +49 30 266 434402)

🇦🇹 Austria - Die Österreichische Bibliothekenverbund und Service GmbH

Priority: MEDIUM
URL: https://www.isil.at/
Access Method: Searchable interface (check for API)

Estimated Records: 500-1,000 institutions
Complexity: MEDIUM
Institution Types: Libraries

Implementation Notes:

  • Need to investigate if API exists
  • May require web scraping if no API
  • Austrian library network consortium

🇪🇬 Egypt - ENSTINET

Priority: LOW
URL: http://cdex5.sti.sci.eg/dev60cgi/f60cgi?form=DEGL_WEB_SEARCH&userid=degl/degl@sti
Access Method: Search interface (unclear structure)

Estimated Records: 100-500 institutions
Complexity: HARD (interface unclear)
Institution Types: Libraries and research centers

Implementation Notes:

  • Legacy CGI interface (f60cgi)
  • May require authentication (userid parameter)
  • Unclear if publicly accessible
  • Low priority due to complexity

Tier 3: Web Scraping Required (15 Registries)

🇺🇸 United States - Library of Congress

Priority: HIGH
URL: http://www.loc.gov/marc/organizations/
Access Method: Search interface + individual record pages

Search URL: http://www.loc.gov/marc/organizations/org-search.php

Structure:

  • Search interface: Keyword or code search
  • Individual records: Separate pages per organization
  • No bulk download: Must scrape search results + detail pages
  • Pagination: Unknown (test with large result sets)

Code Structure:

  • US codes: Geographic prefix + institution code (e.g., DLC, TxU, NNopo)
  • ISIL format: US-{MARC_CODE} (e.g., US-DLC for Library of Congress)
  • Non-US codes: Also included but with country prefixes

Fields Available:

  • MARC organization code
  • ISIL code (US- prefix added)
  • Organization name
  • Address (street, city, state, postal code)
  • Country (for non-US organizations)
  • Status (valid/obsolete)

Estimated Records: 37,000+ codes (33,000 valid, 4,000+ invalid/obsolete)
Geographic Coverage: Primarily USA, some non-US institutions
Data Format: HTML (requires scraping)
Required Tools: requests, BeautifulSoup4, json
Complexity: HARD (pagination + detail pages)
Institution Types: Libraries, archives, museums, corporations

Implementation Strategy:

  1. Option A: Paginated search scraping

    • Submit broad search (empty query or "*")
    • Parse result pages
    • Extract codes and basic info
    • Follow links to detail pages for full address
  2. Option B: Systematic code generation

    • Generate all possible state + code combinations
    • Test each code via search
    • Extract valid records
    • (Likely inefficient - 50,000+ possible combinations)
  3. Option C: Contact LoC for bulk data

    • Email: ndmso@loc.gov
    • Request: Full MARC organization codes database
    • Justification: Academic research, cultural heritage mapping

Rate Limiting:

  • No explicit rate limit mentioned
  • Implement polite scraping: 1 request per 2 seconds
  • Expect 24-48 hours for full extraction (37,000 records)

Implementation Notes:

  • Largest ISIL registry (37,000+ codes)
  • Historical codes preserved (non-unique obsolete codes)
  • Non-US institutions included (subset of global coverage)
  • MARC format requires understanding of subunit structure
  • Contact option available if scraping blocked

Alternative Agencies:

  • Canada: Library and Archives Canada (separate registry)
  • UK: British Library (offline since October 2023 cyber attack)
  • Germany: Staatsbibliothek zu Berlin (separate registry)
  • Estonia: Eesti Rahvusraamatukogu (separate registry)

🇨🇦 Canada - Library and Archives Canada

Priority: HIGH
URL: https://sigles-symbols.bac-lac.gc.ca/eng/Search
Access Method: Search interface

Estimated Records: 2,000-5,000 institutions
Complexity: HARD
Institution Types: Libraries, archives

Implementation Notes:

  • Bilingual (English/French)
  • Search interface scraping required
  • May have provincial breakdowns
  • Test for bulk export option

🇦🇺 Australia - National Library of Australia

Priority: MEDIUM
URL: http://www.nla.gov.au/apps/ilrs
Access Method: Search interface

Estimated Records: 1,000-2,000 institutions
Complexity: HARD
Institution Types: Libraries

Implementation Notes:

  • ILRS = Interlibrary Loan and Resource Sharing system
  • May require Selenium if JavaScript-heavy
  • Check for CSV export option

🇨🇭 Switzerland - Swiss National Library

Priority: MEDIUM
URL: https://www.isil.nb.admin.ch/en/
Access Method: Searchable interface

Estimated Records: 500-1,000 institutions
Complexity: MEDIUM
Institution Types: Libraries, archives

Implementation Notes:

  • Multilingual (German, French, Italian, English)
  • Check for API or bulk download
  • Swiss federal government site (.admin.ch domain)

🇨🇿 Czech Republic - National Library of the Czech Republic

Priority: MEDIUM
URL: https://aleph.nkp.cz/F/?func=file&file_name=find-b&CON_LNG=ENG&local_base=adr
Access Method: ALEPH library system search interface

Estimated Records: 500-1,000 institutions
Complexity: HARD (ALEPH system)
Institution Types: Libraries

Implementation Notes:

  • ALEPH system (library catalog software)
  • Complex URL structure with session management
  • May require cookie handling
  • English interface available (CON_LNG=ENG)

🇫🇷 France - Agence Bibliographique de l'Enseignement Superieur (ABES)

Priority: HIGH
URL: http://www.sudoc.abes.fr/
Access Method: SUDOC catalog search

Status: ⚠️ URLs in official directory may be outdated (404 errors)
Alternative: https://www.abes.fr/ (ABES main site)

Estimated Records: 3,000+ institutions (university and research libraries)
Complexity: HARD
Institution Types: Academic libraries, research centers

Implementation Notes:

  • SUDOC = Système Universitaire de Documentation
  • Covers French higher education libraries
  • May have API (check ABES documentation)
  • French language interface
  • Need to verify correct ISIL list URL

🇮🇹 Italy - Istituto Centrale per il Catalogo Unico (ICCU)

Priority: HIGH
URL: http://anagrafe.iccu.sbn.it/
Access Method: Anagrafe search interface

Estimated Records: 5,000+ institutions
Complexity: HARD
Institution Types: Libraries, archives

Implementation Notes:

  • Anagrafe = registry/directory
  • SBN = Servizio Bibliotecario Nazionale (National Library Service)
  • Large dataset covering Italian libraries
  • May require Italian language parsing

🇧🇬 Bulgaria - National Library of Bulgaria

Priority: LOW
URL: http://www.nationallibrary.bg/wp/?page_id=5686
Access Method: Search interface

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries


🇫🇮 Finland - National Library of Finland

Priority: MEDIUM
URL: http://isil.kansalliskirjasto.fi/en/
Access Method: Search interface

Estimated Records: 500-1,000 institutions
Complexity: MEDIUM
Institution Types: Libraries

Implementation Notes:

  • English interface available
  • Check for API or bulk download option

🇭🇷 Croatia - Nacionalna i sveučilišna knjižnica u Zagrebu

Priority: LOW
URL: https://nsk.hr/en/
Access Method: ISIL list mentioned but location unclear

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries

Implementation Notes:

  • Need to locate ISIL list on NSK website
  • Croatian language primary, English available

🇳🇴 Norway - National Library of Norway

Priority: MEDIUM
URL: http://www.nb.no/baser/bibliotek/english.html
Access Method: Search interface

Estimated Records: 500-1,000 institutions
Complexity: MEDIUM
Institution Types: Libraries


🇳🇿 New Zealand - National Library of New Zealand

Priority: LOW
URL: http://natlib.govt.nz/
Access Method: Search mentioned but location unclear

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries

Implementation Notes:

  • Need to locate ISIL search on NLNZ website

🇶🇦 Qatar - Qatar National Library

Priority: LOW
URL: http://www.qnl.qa/find-answers/other-libraries/
Access Method: Search interface

Estimated Records: 20-50 institutions
Complexity: EASY
Institution Types: Libraries

Implementation Notes:

  • Small dataset
  • English and Arabic interfaces

🇷🇴 Romania - National Library of Romania

Priority: LOW
URL: https://bibnat.ro/
Access Method: Search interface (location unclear)

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries


🇷🇺 Russia - Russian National Public Library for Science and Technology

Priority: MEDIUM
URL: http://www.gpntb.ru/rfid1/searchdb.php
Access Method: Search interface

Estimated Records: 1,000-2,000 institutions
Complexity: HARD
Institution Types: Libraries, scientific institutions

Implementation Notes:

  • Russian language interface
  • RFID-related URL suggests technical database
  • May require Cyrillic text handling

🇸🇮 Slovenia - National and University Library

Priority: LOW
URL: https://www.nuk.uni-lj.si/eng/node/543
Access Method: ISIL list mentioned

Estimated Records: 100-200 institutions
Complexity: EASY
Institution Types: Libraries


🇸🇰 Slovakia - Slovak National Library

Priority: LOW
URL: http://www.snk.sk/en/information-for/libraries-and-librarians/isil.html
Access Method: ISIL list mentioned

Estimated Records: 200-500 institutions
Complexity: MEDIUM
Institution Types: Libraries


Tier 4: Contact Required (8 Agencies)

🇦🇷 Argentina - IRAM

Priority: LOW
URL: http://www.iram.org.ar/
Status: No public ISIL list available

Estimated Records: 500-1,000 institutions
Contact Required: Yes
Institution Types: Libraries, archives

Implementation Notes:

  • IRAM = Argentine Standardization and Certification Institute
  • May provide list upon request
  • Spanish language

🇬🇱 Greenland - Central and Public Library of Greenland

Priority: LOW
URL: https://www.katak.gl/
Status: No public ISIL list

Estimated Records: 5-10 institutions
Contact Required: Yes
Institution Types: Libraries


🇮🇷 Iran - National Library and Archives of Iran

Priority: LOW
Status: No URL provided, no public list

Estimated Records: 100-500 institutions
Contact Required: Yes
Institution Types: Libraries, archives

Implementation Notes:

  • May require Persian language communication
  • Political/access challenges possible

🇰🇿 Kazakhstan - National Library of Kazakhstan

Priority: LOW
URL: https://www.nlrk.kz/index.php?lang=en
Status: No public ISIL list

Estimated Records: 100-300 institutions
Contact Required: Yes
Institution Types: Libraries


🇲🇩 Moldova - National Library of Moldova

Priority: LOW
URL: http://bnrm.md/
Status: No public ISIL list

Estimated Records: 50-100 institutions
Contact Required: Yes
Institution Types: Libraries


🇳🇵 Nepal

Priority: LOW
Status: No agency listed in official ISIL directory

Estimated Records: Unknown
Contact Required: Yes - contact Danish Agency for information
Institution Types: Unknown


🇰🇷 South Korea - National Library of Korea

Priority: MEDIUM
URL: https://www.nl.go.kr/
Status: Main NLK site doesn't list ISIL directory

Alternative Registry: https://librarian.nl.go.kr/ (discovered via Exa search)

Estimated Records: 1,000-2,000 institutions
Contact Required: Investigate librarian.nl.go.kr before contacting
Institution Types: Libraries

Implementation Notes:

  • Secondary registry exists at librarian subdomain
  • Need to verify if this is official ISIL registry
  • Korean language interface
  • May have English option

🇬🇧 United Kingdom - British Library

Priority: HIGH (when restored)
URL: https://www.bl.uk/help/get-a-marc-organization-code (currently 404)
Status: ⚠️ OFFLINE - Cyber attack October 2023

Estimated Records: 5,000+ institutions
Contact Required: Wait for service restoration
Institution Types: Libraries, archives, museums

Implementation Notes:

  • British Library systems offline since October 2023 cyber attack
  • ISIL/MARC code directory unavailable
  • Monitor BL announcements for restoration
  • Largest UK heritage collection metadata
  • Critical for UK coverage but currently blocked

Non-National Registries (5 Agencies)

EUR - European Organizations

Priority: LOW
URL: https://www.eui.eu/Documents/Research/HistoricalArchivesofEU/ISIL/isil-directory.pdf
Access Method: Direct PDF download

Estimated Records: 20-50 institutions
Complexity: EASY
Institution Types: European Union organizations

Implementation Notes:

  • PDF format requires text extraction
  • Covers EU institutional libraries
  • Small specialized dataset

GTB - GTBib (Spanish-language interlibrary loan)

Priority: LOW
URL: https://directorio.gtbib.net/lista_isil
Access Method: Direct list

Estimated Records: 100-300 institutions
Complexity: EASY
Institution Types: Spanish-language libraries

Implementation Notes:

  • Spanish language
  • Interlibrary loan network
  • May overlap with national registries (ES, MX, AR, etc.)

O - OCLC (RFID tags)

Priority: LOW
URL: http://www.oclc.org/contacts/libraries/
Access Method: Search interface

Estimated Records: Special encoding system (not traditional ISIL)
Complexity: MEDIUM
Institution Types: OCLC member libraries

Implementation Notes:

  • Letter "O" used for OCLC RFID technical encoding
  • Not a traditional national registry
  • Overlaps with US and other national codes

OCLC - OCLC WorldCat Symbol

Priority: MEDIUM
URL: http://www.oclc.org/contacts/libraries/
Access Method: Search interface

Estimated Records: 10,000+ institutions (global)
Complexity: HARD
Institution Types: WorldCat member libraries

Implementation Notes:

  • OCLC WorldCat symbols used globally
  • Different from ISIL but related
  • Massive dataset, may overlap with national registries
  • Useful for cross-referencing and enrichment

ZDB - Zeitschriftendatenbank (German Serials Database)

Priority: MEDIUM
URL: https://zdb-katalog.de/index.xhtml
Access Method: Search interface

Estimated Records: 3,000+ institutions
Complexity: HARD
Institution Types: Libraries holding serials

Implementation Notes:

  • German serials/journals database
  • Managed by Staatsbibliothek zu Berlin
  • May have API (check ZDB documentation)
  • Overlaps with DE- national codes

Extraction Recommendations

Phase 1: Quick Wins (Tier 1 Downloads)

Timeline: 1-2 days
Expected Yield: 7,000-8,000 records

  1. Denmark - 4 CSV downloads (700 records)
  2. Japan - 5 CSV/Excel downloads (6,000+ records)
  3. Hungary - Direct list (200-500 records)
  4. Israel - Direct list (200-500 records)
  5. Cyprus - Direct list (50-100 records)
  6. Luxembourg - Direct list (20-50 records)

Tools Required: pandas, requests, openpyxl
Risk: Low
Effort: Low


Phase 2: Major APIs (Tier 2)

Timeline: 3-5 days
Expected Yield: 10,000-12,000 records

  1. Germany - JSON-API (10,000+ records)
  2. Austria - Search/API (500-1,000 records)

Tools Required: requests, json, pagination logic
Risk: Medium (rate limiting, API changes)
Effort: Medium


Phase 3: High-Value Scraping (Tier 3 Priority)

Timeline: 2-3 weeks
Expected Yield: 50,000+ records

  1. USA - Library of Congress (37,000+ records) - PRIORITY
  2. Canada - Library and Archives Canada (2,000-5,000 records)
  3. France - SUDOC (3,000+ records)
  4. Italy - ICCU Anagrafe (5,000+ records)

Tools Required: requests, BeautifulSoup4, Selenium (if needed)
Risk: High (blocking, CAPTCHA, rate limiting)
Effort: High


Phase 4: Remaining Scraping (Tier 3 Secondary)

Timeline: 2-4 weeks
Expected Yield: 10,000-15,000 records

  1. Australia, Switzerland, Czech Republic, Finland, Norway, Bulgaria, Croatia, New Zealand, Qatar, Romania, Russia, Slovenia, Slovakia

Tools Required: Same as Phase 3
Risk: Medium-High
Effort: Medium-High


Phase 5: Contact-Based (Tier 4)

Timeline: Ongoing (3-6 months for responses)
Expected Yield: 5,000-10,000 records

  1. UK - Wait for British Library restoration (5,000+ records)
  2. South Korea - Investigate librarian.nl.go.kr (1,000-2,000 records)
  3. Argentina, Greenland, Iran, Kazakhstan, Moldova, Nepal - Contact agencies

Tools Required: Email templates, institutional affiliation
Risk: High (may not receive data)
Effort: Low (but long wait times)


Technical Implementation Guide

Standard Parser Template

import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
from datetime import datetime

class ISILRegistryScraper:
    def __init__(self, country_code, registry_name, base_url):
        self.country_code = country_code
        self.registry_name = registry_name
        self.base_url = base_url
        self.headers = {
            'User-Agent': 'GLAM Heritage Institution Scraper/1.0 (Academic Research)'
        }
        self.records = []
    
    def fetch_csv(self, url):
        """Fetch CSV file from direct download URL"""
        response = requests.get(url, headers=self.headers)
        response.raise_for_status()
        return pd.read_csv(io.StringIO(response.text))
    
    def fetch_json_api(self, url, params=None):
        """Fetch data from JSON API with pagination"""
        response = requests.get(url, params=params, headers=self.headers)
        response.raise_for_status()
        return response.json()
    
    def scrape_html_table(self, url):
        """Scrape HTML table from webpage"""
        response = requests.get(url, headers=self.headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        # Parse table logic here
        return soup
    
    def standardize_record(self, raw_record):
        """Convert raw record to standard format"""
        return {
            'isil_code': raw_record.get('isil_code'),
            'name': raw_record.get('name'),
            'city': raw_record.get('city'),
            'region': raw_record.get('region'),
            'country': self.country_code,
            'postal_code': raw_record.get('postal_code'),
            'address': raw_record.get('address'),
            'phone': raw_record.get('phone'),
            'email': raw_record.get('email'),
            'url': raw_record.get('url'),
            'institution_type': raw_record.get('type'),
            'registry': self.registry_name,
            'source_url': self.base_url,
            'extraction_date': datetime.now().isoformat()
        }
    
    def export_csv(self, output_path):
        """Export records to CSV"""
        df = pd.DataFrame(self.records)
        df.to_csv(output_path, index=False, encoding='utf-8')
    
    def export_json(self, output_path):
        """Export records to JSON with metadata"""
        output = {
            'metadata': {
                'country': self.country_code,
                'registry': self.registry_name,
                'source_url': self.base_url,
                'extraction_date': datetime.now().isoformat(),
                'record_count': len(self.records)
            },
            'records': self.records
        }
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(output, f, indent=2, ensure_ascii=False)

Rate Limiting Best Practices

import time
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, requests_per_second=1):
        self.min_interval = 1.0 / requests_per_second
        self.last_request = None
    
    def wait(self):
        """Wait if necessary to respect rate limit"""
        if self.last_request:
            elapsed = (datetime.now() - self.last_request).total_seconds()
            if elapsed < self.min_interval:
                time.sleep(self.min_interval - elapsed)
        self.last_request = datetime.now()

# Usage
limiter = RateLimiter(requests_per_second=1)

for page in range(1, total_pages + 1):
    limiter.wait()
    data = fetch_page(page)
    process_data(data)

Pagination Handler

def paginate_api(base_url, page_size=100, max_pages=None):
    """Generic pagination handler for APIs"""
    page = 1
    all_records = []
    
    while True:
        params = {'page': page, 'size': page_size}
        response = requests.get(base_url, params=params)
        data = response.json()
        
        records = data.get('member', [])
        if not records:
            break
        
        all_records.extend(records)
        
        # Check for last page
        view = data.get('view', {})
        if page >= view.get('numberOfPages', page):
            break
        
        if max_pages and page >= max_pages:
            break
        
        page += 1
        time.sleep(1)  # Polite scraping
    
    return all_records

Required Python Packages

# Core dependencies
pip install pandas requests beautifulsoup4 openpyxl lxml

# For complex scraping
pip install selenium webdriver-manager

# For API interactions
pip install urllib3

# For JSON/CSV handling
pip install jsonlines

# For progress tracking
pip install tqdm

# For parallel processing (optional)
pip install joblib

Output Format Specification

CSV Output Schema

isil_code,name,city,region,country,postal_code,address,phone,email,url,institution_type,registry,source_url,extraction_date
JP-1001234,"Tokyo Metropolitan Library","Tokyo","13","JP","100-0001","1-1 Minami-Osawa, Hachioji-shi","+81-42-123-4567","info@library.metro.tokyo.jp","https://www.library.metro.tokyo.jp","LIBRARY","National Diet Library","https://www.ndl.go.jp/en/library/isil/","2025-11-17T10:30:00Z"

JSON Output Schema

{
  "metadata": {
    "country": "JP",
    "country_name": "Japan",
    "registry": "National Diet Library",
    "registry_agency": "National Diet Library (NDL)",
    "source_url": "https://www.ndl.go.jp/en/library/isil/",
    "extraction_date": "2025-11-17T10:30:00Z",
    "extraction_method": "Direct CSV download",
    "record_count": 6234,
    "institution_types": ["LIBRARY", "MUSEUM", "ARCHIVE"],
    "data_license": "Public Domain (per NDL)",
    "notes": "Includes public libraries, university libraries, museums, and archives"
  },
  "records": [
    {
      "isil_code": "JP-1001234",
      "name": "Tokyo Metropolitan Library",
      "name_japanese": "東京都立図書館",
      "name_kana": "とうきょうとりつとしょかん",
      "city": "Tokyo",
      "region": "13",
      "country": "JP",
      "postal_code": "100-0001",
      "address": "1-1 Minami-Osawa, Hachioji-shi, Tokyo",
      "phone": "+81-42-123-4567",
      "fax": "+81-42-123-4568",
      "email": "info@library.metro.tokyo.jp",
      "url": "https://www.library.metro.tokyo.jp",
      "institution_type": "LIBRARY",
      "registration_date": "2011-10-15",
      "last_update_date": "2024-03-20",
      "registry": "National Diet Library",
      "source_url": "https://www.ndl.go.jp/en/library/isil/",
      "extraction_date": "2025-11-17T10:30:00Z"
    }
  ]
}

Quality Assurance Checklist

Pre-Extraction

  • Verify source URL is accessible
  • Check for API documentation
  • Test sample queries/downloads
  • Review robots.txt and terms of service
  • Implement rate limiting (1 req/sec default)
  • Set up error logging

During Extraction

  • Monitor for HTTP errors (429, 403, 503)
  • Log failed requests for retry
  • Validate ISIL code format (country code + structure)
  • Check for duplicate records
  • Handle encoding issues (UTF-8)
  • Track progress (records extracted / estimated total)

Post-Extraction

  • Validate record count against estimate
  • Check for required fields (ISIL code, name, country)
  • Detect malformed ISIL codes
  • Cross-reference with existing datasets
  • Generate extraction report (metadata)
  • Export to CSV and JSON formats
  • Document any issues or anomalies

Data Quality Metrics

Record Completeness Score

Field Weight Required
ISIL code 20% Yes
Name 20% Yes
City 15% No
Country 10% Yes
Postal code 5% No
Address 10% No
Phone 5% No
Email 5% No
URL 10% No

Completeness = (Present fields × Weight) / Total Weight

ISIL Code Validation

import re

def validate_isil_code(isil_code, country_code):
    """Validate ISIL code format"""
    # Basic format: CC-XXXXX (country code + dash + institution code)
    pattern = rf'^{country_code}-[A-Za-z0-9]+$'
    
    if not re.match(pattern, isil_code):
        return False, "Invalid format"
    
    # Check length (typically 5-10 characters total)
    if len(isil_code) < 4 or len(isil_code) > 15:
        return False, "Invalid length"
    
    return True, "Valid"

# Example
validate_isil_code("JP-1001234", "JP")  # (True, "Valid")
validate_isil_code("INVALID", "JP")      # (False, "Invalid format")

Contact Information for Support

ISIL Registration Authority (Denmark)

Germany (Staatsbibliothek zu Berlin)

USA (Library of Congress)

  • Organization: Network Development and MARC Standards Office
  • Email: ndmso@loc.gov
  • Fax: +1-202-707-0115
  • Mailing Address: 101 Independence Avenue, S.E., Washington, D.C. 20540-4402

Next Steps for Implementation

Immediate Actions (Week 1)

  1. Complete reconnaissance (DONE)
  2. Create extraction scripts for Denmark (Tier 1)
  3. Create extraction scripts for Japan (Tier 1)
  4. Test Germany JSON-API (Tier 2)
  5. Document findings and update inventory

Short-term (Weeks 2-4)

  1. Extract all Tier 1 registries (7,000-8,000 records)
  2. Implement Germany API scraper (10,000 records)
  3. Begin USA Library of Congress scraper (37,000 records)
  4. Set up quality assurance pipeline
  5. Create unified dataset (CSV + JSON)

Medium-term (Months 2-3)

  1. Complete Tier 2 and high-priority Tier 3 scraping
  2. Cross-reference with existing GLAM datasets (Wikidata, OSM)
  3. Enrich records with geocoding (lat/lon)
  4. Generate statistics and visualizations
  5. Publish dataset and documentation

Long-term (Months 4-6)

  1. Contact Tier 4 agencies for bulk data
  2. Monitor UK British Library for restoration
  3. Implement automated update pipeline
  4. Create public API for GLAM dataset
  5. Integrate with LinkML schema and RDF export

Change Log

Version Date Changes
1.0 2025-11-17 Initial scraper inventory created

References


End of Document