41 KiB
Global ISIL Registry Scraper Inventory
Version: 1.0
Date: 2025-11-17
Purpose: Technical inventory of data extraction methods for global ISIL registries
Status: Reconnaissance complete, extraction pending
Overview
This document provides detailed technical specifications for extracting heritage institution data from 46 ISIL registries worldwide. For each registry, we document:
- Access method (direct download, API, web scraping)
- Data format and structure
- Estimated record count
- Required tools and complexity
- Priority tier for extraction
Total Estimated Records: 50,000-100,000+ institutions globally
Extraction Priority Tiers
Tier 1: Direct Downloads (Easy Wins)
Large datasets with CSV/Excel/JSON downloads requiring minimal processing.
Tier 2: JSON/XML APIs
Programmatic access via REST APIs, SPARQL endpoints, or SRU interfaces.
Tier 3: Web Scraping Required
HTML-based search interfaces requiring BeautifulSoup/Selenium.
Tier 4: Contact Required
No public access; requires direct contact with registry agency.
Tier 1: Direct Downloads (9 Registries)
🇩🇰 Denmark - Danish Agency for Culture and Palaces
Priority: ⭐⭐⭐ HIGH
URL: https://vip.dbc.dk/librarylists/
Access Method: Direct CSV download (4 separate files)
Available Downloads:
-
Folkebiblioteker (væsener) - Public libraries (main branches)
- Download: https://vip.dbc.dk/librarylists/publiclibraries/download
- Format: CSV
-
Folkebiblioteker (alle betjeningssteder) - Public libraries (all service points)
- Download: https://vip.dbc.dk/librarylists/publicbranches/download
- Format: CSV
-
Forskningsbiblioteker (væsener) - Research libraries (main branches)
- Download: https://vip.dbc.dk/librarylists/ffulibraries/download
- Format: CSV
-
Forskningsbiblioteker (alle betjeningssteder) - Research libraries (all service points)
- Download: https://vip.dbc.dk/librarylists/ffubranches/download
- Format: CSV
Fields Available:
- Biblioteksnummer (Library number/ISIL)
- Navn (Name)
- Adresse (Address)
- Postnummer (Postal code)
- By (City)
- Telefon (Phone)
- Leder (Director/Manager)
Estimated Records: 700+ institutions
Data Format: CSV (UTF-8)
Required Tools: pandas, requests
Complexity: ⭐ EASY
Institution Types: Libraries only
Implementation Notes:
- 4 separate CSV files need to be merged
- May contain duplicate entries (main branches vs. service points)
- Danish text fields (no English translation)
- Direct downloads, no rate limiting concerns
🇯🇵 Japan - National Diet Library
Priority: ⭐⭐⭐ HIGH
URL: http://www.ndl.go.jp/en/library/isil/index.html#isillist
Access Method: Direct CSV/Excel download (5 datasets)
Available Downloads (Updated 2025-11-05):
-
Public Libraries
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.csv (1.1MB)
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.xlsx (492KB)
- Estimated: 2,500-3,000 records
-
NDL, University Libraries, Special Libraries
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.csv (554KB)
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.xlsx (288KB)
- Estimated: 1,500-2,000 records
-
Museums
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.csv (789KB)
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.xlsx (420KB)
- Estimated: 2,000-2,500 records
-
Archives
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.csv (20KB)
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.xlsx (20KB)
- Estimated: 50-100 records
-
Deleted ISIL
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.csv (18KB)
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.xlsx (19KB)
- Estimated: 100-200 records
ISIL Structure:
- Country code:
JP- - Type identifier:
1(Library),2(Museum),3(Archive) - Institution ID: 6 digits (e.g.,
JP-1123456,JP-2234567,JP-3345678)
Fields Available:
- ISIL
- Institution name (English)
- Institution name (Japanese)
- Institution name (Japanese kana)
- Postal code
- Address
- Telephone
- Fax
- URL
- Registration date (since April 2020)
- Last update date (since April 2020)
Estimated Total Records: 6,000+ institutions
Data Format: CSV and Excel (choose one)
Required Tools: pandas, requests, openpyxl (for Excel)
Complexity: ⭐ EASY
Institution Types: Libraries, Museums, Archives
Update Frequency: Regular (files dated 2025-11-05)
Implementation Notes:
- Very recent data (November 2025)
- English field names available
- 5 separate files need to be merged
- Deleted ISIL records useful for tracking institutional changes
- Public domain license (free to use)
🇳🇱 Netherlands - Koninklijke Bibliotheek (KB)
Priority: ⭐⭐ MEDIUM
URL: https://www.bibliotheeknetwerk.nl/landelijke-digitale-infrastructuur-ldi/isil-codes
Access Method: Direct Excel download
Status: ✅ ALREADY EXTRACTED (153 public libraries)
Available Download:
- Public Libraries List: Excel file from bibliotheeknetwerk.nl
- File:
/data/isil/NL/kb_netherlands_public_libraries.{csv,json} - Parser:
/scripts/scrapers/parse_kb_netherlands_isil.py
Fields Available:
- Instellingsnaam (Institution name)
- Plaatsnaam (City name)
- ISIL code
Extracted Records: 153 institutions
Data Format: Excel (with header in row 3)
Complexity: ⭐ EASY
Institution Types: Public libraries only
Completion Date: 2025-11-17
Implementation Notes:
- Excel header in row 3 (not standard row 1)
- Parser already implemented and tested
- Limited coverage (public libraries only, no research/special libraries)
- May need additional source for complete Netherlands coverage
🇧🇾 Belarus - National Library of Belarus
Priority: ⭐ LOW (already extracted)
URL: https://nlb.by/en/for-librarians/international-standard-identifier-for-libraries-and-related-organizations-isil/list-of-libraries-organizations-of-the-republic-of-belarus-and-their-isil-codes
Status: ✅ ALREADY EXTRACTED (167 institutions)
Available Download:
- HTML table with all ISIL codes
- File:
/data/isil/BY/belarus_isil_all.{csv,json} - Parser: Already exists from previous session
Fields Extracted:
- ISIL code
- Institution name
- Region code (embedded in ISIL: HO, MI, VI, MA, HM, BR, HR)
Extracted Records: 167 institutions
Regional Breakdown:
- HO (Homyel): 29
- MI (Minsk): 26
- VI (Vitsyebsk): 25
- MA (Mahilyow): 25
- HM (Hrodna): 23
- BR (Brest): 20
- HR (Hrodna variant): 19
Complexity: ⭐ EASY
Institution Types: Libraries and related organizations
Completion Date: 2025-11-17
🇧🇪 Belgium - Royal Library of Belgium (KBR)
Priority: ⭐ LOW (already extracted)
URL: http://isil.kbr.be/
Status: ✅ ALREADY EXTRACTED (438 institutions)
Extracted Records: 438 institutions
Completion Date: Previous session
Institution Types: Libraries, archives, museums
🇨🇾 Cyprus - Cyprus University of Technology Library
Priority: ⭐ LOW
URL: http://library.cut.ac.cy/en/node/1194
Access Method: Direct list (HTML or downloadable file)
Estimated Records: 50-100 institutions
Complexity: ⭐ EASY
Institution Types: Libraries and archives
Implementation Notes:
- Small dataset, straightforward extraction
- Likely HTML table or PDF list
- Low priority due to small size
🇱🇺 Luxembourg - Archives nationales de Luxembourg
Priority: ⭐ LOW
URL: http://www.anlux.public.lu/fr/nous-connaitre/reseaux-internationaux/ISIL.html
Access Method: Direct list (HTML)
Estimated Records: 20-50 institutions
Complexity: ⭐ EASY
Institution Types: Archives and libraries
Implementation Notes:
- Very small dataset
- Likely single HTML page with table
- French language
🇭🇺 Hungary - National Széchényi Library
Priority: ⭐⭐ MEDIUM
URL: https://ki.oszk.hu/dokumentumtar/isil-azonositok
Access Method: Direct list (likely HTML or downloadable file)
Estimated Records: 200-500 institutions
Complexity: ⭐ EASY
Institution Types: Libraries
Implementation Notes:
- Medium-sized dataset
- Hungarian language interface
- Need to verify file format (HTML vs. downloadable)
🇮🇱 Israel - National Library of Israel
Priority: ⭐⭐ MEDIUM
URL: http://nli.org.il
Access Method: Direct list (location TBD)
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries and archives
Implementation Notes:
- Need to locate exact ISIL list URL on NLI website
- Likely Hebrew and English versions
- May require navigation through site structure
Tier 2: JSON/XML APIs (3 Registries)
🇩🇪 Germany - Staatsbibliothek zu Berlin (Sigel Database)
Priority: ⭐⭐⭐ HIGH
URL: https://sigel.staatsbibliothek-berlin.de/
Access Method: JSON-API, SRU, Linked Data Service
API Endpoints:
-
JSON-API (Primary)
- Base URL:
https://sigel.staatsbibliothek-berlin.de/api/org.jsonld - Search:
/api/org.jsonld?q={search_terms} - Single record:
/api/org/{ISIL}.jsonld - Example: https://sigel.staatsbibliothek-berlin.de/api/org/DE-12.jsonld
- Base URL:
-
SRU (Secondary)
- Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/sru
- Standard library catalog protocol
-
Linked Data Service (RDF)
- Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/linked-data-service
- RDF/Turtle format available
JSON-API Features:
- Pagination:
sizeparameter (default: 10, max: TBD) - Page selection:
pageparameter (default: 1) - Compact format:
compact=1for PICA-Plus records - JSONP support:
callbackparameter for cross-domain requests - Search syntax:
nam=keyword AND ort=city(field-specific search) - URL encoding required: Use
%3Dfor=,%20for spaces
Search Fields:
nam- Institution nameort- City/Location- Additional fields documented in API docs
Response Structure:
{
"id": "https://sigel.staatsbibliothek-berlin.de/api/org.jsonld?q=...",
"type": "Collection",
"totalItems": 12345,
"freetextQuery": "search terms",
"member": [
{
"id": "https://sigel.staatsbibliothek-berlin.de/org/DE-12",
"type": ["Library", "Museum", "Archive", "Other"],
"identifier": "DE-12",
"location": "Berlin",
"addressRegion": "DE-BE",
"seeAlso": "https://sigel.staatsbibliothek-berlin.de/org/DE-12.rdf",
"data": { /* PICA-Plus record */ }
}
],
"view": {
"id": "current page URL",
"type": "PartialCollectionView",
"first": "first page URL",
"last": "last page URL",
"next": "next page URL",
"previous": "previous page URL",
"totalItems": 10,
"pageIndex": 1,
"numberOfPages": 1235,
"offset": 0,
"limit": 10
}
}
Estimated Records: 10,000+ institutions
Data Format: JSON-LD with PICA-Plus embedded
Required Tools: requests, json
Complexity: ⭐⭐ MEDIUM (pagination required)
Institution Types: Libraries, Archives, Museums, Other
Rate Limiting: Unknown (implement polite scraping with delays)
Implementation Strategy:
- Start with broad search (all institutions)
- Paginate through results using
size=100(test max size) - Extract JSON-LD records
- Parse PICA-Plus data for additional fields
- Store in standardized CSV/JSON format
- Optionally fetch RDF for Linked Data integration
Implementation Notes:
- Very comprehensive dataset (~10,000 institutions)
- Well-documented JSON-API (beta status)
- Format change announced for September 2025 (check blog post)
- Linked Data Service provides RDF/Turtle for semantic web integration
- PICA-Plus format requires additional parsing
- Contact: Carsten Klee (carsten.klee@sbb.spk-berlin.de, +49 30 266 434402)
🇦🇹 Austria - Die Österreichische Bibliothekenverbund und Service GmbH
Priority: ⭐⭐ MEDIUM
URL: https://www.isil.at/
Access Method: Searchable interface (check for API)
Estimated Records: 500-1,000 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
Implementation Notes:
- Need to investigate if API exists
- May require web scraping if no API
- Austrian library network consortium
🇪🇬 Egypt - ENSTINET
Priority: ⭐ LOW
URL: http://cdex5.sti.sci.eg/dev60cgi/f60cgi?form=DEGL_WEB_SEARCH&userid=degl/degl@sti
Access Method: Search interface (unclear structure)
Estimated Records: 100-500 institutions
Complexity: ⭐⭐⭐ HARD (interface unclear)
Institution Types: Libraries and research centers
Implementation Notes:
- Legacy CGI interface (f60cgi)
- May require authentication (userid parameter)
- Unclear if publicly accessible
- Low priority due to complexity
Tier 3: Web Scraping Required (15 Registries)
🇺🇸 United States - Library of Congress
Priority: ⭐⭐⭐ HIGH
URL: http://www.loc.gov/marc/organizations/
Access Method: Search interface + individual record pages
Search URL: http://www.loc.gov/marc/organizations/org-search.php
Structure:
- Search interface: Keyword or code search
- Individual records: Separate pages per organization
- No bulk download: Must scrape search results + detail pages
- Pagination: Unknown (test with large result sets)
Code Structure:
- US codes: Geographic prefix + institution code (e.g.,
DLC,TxU,NNopo) - ISIL format:
US-{MARC_CODE}(e.g.,US-DLCfor Library of Congress) - Non-US codes: Also included but with country prefixes
Fields Available:
- MARC organization code
- ISIL code (US- prefix added)
- Organization name
- Address (street, city, state, postal code)
- Country (for non-US organizations)
- Status (valid/obsolete)
Estimated Records: 37,000+ codes (33,000 valid, 4,000+ invalid/obsolete)
Geographic Coverage: Primarily USA, some non-US institutions
Data Format: HTML (requires scraping)
Required Tools: requests, BeautifulSoup4, json
Complexity: ⭐⭐⭐ HARD (pagination + detail pages)
Institution Types: Libraries, archives, museums, corporations
Implementation Strategy:
-
Option A: Paginated search scraping
- Submit broad search (empty query or "*")
- Parse result pages
- Extract codes and basic info
- Follow links to detail pages for full address
-
Option B: Systematic code generation
- Generate all possible state + code combinations
- Test each code via search
- Extract valid records
- (Likely inefficient - 50,000+ possible combinations)
-
Option C: Contact LoC for bulk data
- Email: ndmso@loc.gov
- Request: Full MARC organization codes database
- Justification: Academic research, cultural heritage mapping
Rate Limiting:
- No explicit rate limit mentioned
- Implement polite scraping: 1 request per 2 seconds
- Expect 24-48 hours for full extraction (37,000 records)
Implementation Notes:
- Largest ISIL registry (37,000+ codes)
- Historical codes preserved (non-unique obsolete codes)
- Non-US institutions included (subset of global coverage)
- MARC format requires understanding of subunit structure
- Contact option available if scraping blocked
Alternative Agencies:
- Canada: Library and Archives Canada (separate registry)
- UK: British Library (offline since October 2023 cyber attack)
- Germany: Staatsbibliothek zu Berlin (separate registry)
- Estonia: Eesti Rahvusraamatukogu (separate registry)
🇨🇦 Canada - Library and Archives Canada
Priority: ⭐⭐⭐ HIGH
URL: https://sigles-symbols.bac-lac.gc.ca/eng/Search
Access Method: Search interface
Estimated Records: 2,000-5,000 institutions
Complexity: ⭐⭐⭐ HARD
Institution Types: Libraries, archives
Implementation Notes:
- Bilingual (English/French)
- Search interface scraping required
- May have provincial breakdowns
- Test for bulk export option
🇦🇺 Australia - National Library of Australia
Priority: ⭐⭐ MEDIUM
URL: http://www.nla.gov.au/apps/ilrs
Access Method: Search interface
Estimated Records: 1,000-2,000 institutions
Complexity: ⭐⭐⭐ HARD
Institution Types: Libraries
Implementation Notes:
- ILRS = Interlibrary Loan and Resource Sharing system
- May require Selenium if JavaScript-heavy
- Check for CSV export option
🇨🇭 Switzerland - Swiss National Library
Priority: ⭐⭐ MEDIUM
URL: https://www.isil.nb.admin.ch/en/
Access Method: Searchable interface
Estimated Records: 500-1,000 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries, archives
Implementation Notes:
- Multilingual (German, French, Italian, English)
- Check for API or bulk download
- Swiss federal government site (.admin.ch domain)
🇨🇿 Czech Republic - National Library of the Czech Republic
Priority: ⭐⭐ MEDIUM
URL: https://aleph.nkp.cz/F/?func=file&file_name=find-b&CON_LNG=ENG&local_base=adr
Access Method: ALEPH library system search interface
Estimated Records: 500-1,000 institutions
Complexity: ⭐⭐⭐ HARD (ALEPH system)
Institution Types: Libraries
Implementation Notes:
- ALEPH system (library catalog software)
- Complex URL structure with session management
- May require cookie handling
- English interface available (CON_LNG=ENG)
🇫🇷 France - Agence Bibliographique de l'Enseignement Superieur (ABES)
Priority: ⭐⭐⭐ HIGH
URL: http://www.sudoc.abes.fr/
Access Method: SUDOC catalog search
Status: ⚠️ URLs in official directory may be outdated (404 errors)
Alternative: https://www.abes.fr/ (ABES main site)
Estimated Records: 3,000+ institutions (university and research libraries)
Complexity: ⭐⭐⭐ HARD
Institution Types: Academic libraries, research centers
Implementation Notes:
- SUDOC = Système Universitaire de Documentation
- Covers French higher education libraries
- May have API (check ABES documentation)
- French language interface
- Need to verify correct ISIL list URL
🇮🇹 Italy - Istituto Centrale per il Catalogo Unico (ICCU)
Priority: ⭐⭐⭐ HIGH
URL: http://anagrafe.iccu.sbn.it/
Access Method: Anagrafe search interface
Estimated Records: 5,000+ institutions
Complexity: ⭐⭐⭐ HARD
Institution Types: Libraries, archives
Implementation Notes:
- Anagrafe = registry/directory
- SBN = Servizio Bibliotecario Nazionale (National Library Service)
- Large dataset covering Italian libraries
- May require Italian language parsing
🇧🇬 Bulgaria - National Library of Bulgaria
Priority: ⭐ LOW
URL: http://www.nationallibrary.bg/wp/?page_id=5686
Access Method: Search interface
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
🇫🇮 Finland - National Library of Finland
Priority: ⭐⭐ MEDIUM
URL: http://isil.kansalliskirjasto.fi/en/
Access Method: Search interface
Estimated Records: 500-1,000 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
Implementation Notes:
- English interface available
- Check for API or bulk download option
🇭🇷 Croatia - Nacionalna i sveučilišna knjižnica u Zagrebu
Priority: ⭐ LOW
URL: https://nsk.hr/en/
Access Method: ISIL list mentioned but location unclear
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
Implementation Notes:
- Need to locate ISIL list on NSK website
- Croatian language primary, English available
🇳🇴 Norway - National Library of Norway
Priority: ⭐⭐ MEDIUM
URL: http://www.nb.no/baser/bibliotek/english.html
Access Method: Search interface
Estimated Records: 500-1,000 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
🇳🇿 New Zealand - National Library of New Zealand
Priority: ⭐ LOW
URL: http://natlib.govt.nz/
Access Method: Search mentioned but location unclear
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
Implementation Notes:
- Need to locate ISIL search on NLNZ website
🇶🇦 Qatar - Qatar National Library
Priority: ⭐ LOW
URL: http://www.qnl.qa/find-answers/other-libraries/
Access Method: Search interface
Estimated Records: 20-50 institutions
Complexity: ⭐ EASY
Institution Types: Libraries
Implementation Notes:
- Small dataset
- English and Arabic interfaces
🇷🇴 Romania - National Library of Romania
Priority: ⭐ LOW
URL: https://bibnat.ro/
Access Method: Search interface (location unclear)
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
🇷🇺 Russia - Russian National Public Library for Science and Technology
Priority: ⭐⭐ MEDIUM
URL: http://www.gpntb.ru/rfid1/searchdb.php
Access Method: Search interface
Estimated Records: 1,000-2,000 institutions
Complexity: ⭐⭐⭐ HARD
Institution Types: Libraries, scientific institutions
Implementation Notes:
- Russian language interface
- RFID-related URL suggests technical database
- May require Cyrillic text handling
🇸🇮 Slovenia - National and University Library
Priority: ⭐ LOW
URL: https://www.nuk.uni-lj.si/eng/node/543
Access Method: ISIL list mentioned
Estimated Records: 100-200 institutions
Complexity: ⭐ EASY
Institution Types: Libraries
🇸🇰 Slovakia - Slovak National Library
Priority: ⭐ LOW
URL: http://www.snk.sk/en/information-for/libraries-and-librarians/isil.html
Access Method: ISIL list mentioned
Estimated Records: 200-500 institutions
Complexity: ⭐⭐ MEDIUM
Institution Types: Libraries
Tier 4: Contact Required (8 Agencies)
🇦🇷 Argentina - IRAM
Priority: ⭐ LOW
URL: http://www.iram.org.ar/
Status: No public ISIL list available
Estimated Records: 500-1,000 institutions
Contact Required: Yes
Institution Types: Libraries, archives
Implementation Notes:
- IRAM = Argentine Standardization and Certification Institute
- May provide list upon request
- Spanish language
🇬🇱 Greenland - Central and Public Library of Greenland
Priority: ⭐ LOW
URL: https://www.katak.gl/
Status: No public ISIL list
Estimated Records: 5-10 institutions
Contact Required: Yes
Institution Types: Libraries
🇮🇷 Iran - National Library and Archives of Iran
Priority: ⭐ LOW
Status: No URL provided, no public list
Estimated Records: 100-500 institutions
Contact Required: Yes
Institution Types: Libraries, archives
Implementation Notes:
- May require Persian language communication
- Political/access challenges possible
🇰🇿 Kazakhstan - National Library of Kazakhstan
Priority: ⭐ LOW
URL: https://www.nlrk.kz/index.php?lang=en
Status: No public ISIL list
Estimated Records: 100-300 institutions
Contact Required: Yes
Institution Types: Libraries
🇲🇩 Moldova - National Library of Moldova
Priority: ⭐ LOW
URL: http://bnrm.md/
Status: No public ISIL list
Estimated Records: 50-100 institutions
Contact Required: Yes
Institution Types: Libraries
🇳🇵 Nepal
Priority: ⭐ LOW
Status: No agency listed in official ISIL directory
Estimated Records: Unknown
Contact Required: Yes - contact Danish Agency for information
Institution Types: Unknown
🇰🇷 South Korea - National Library of Korea
Priority: ⭐⭐ MEDIUM
URL: https://www.nl.go.kr/
Status: Main NLK site doesn't list ISIL directory
Alternative Registry: https://librarian.nl.go.kr/ (discovered via Exa search)
Estimated Records: 1,000-2,000 institutions
Contact Required: Investigate librarian.nl.go.kr before contacting
Institution Types: Libraries
Implementation Notes:
- Secondary registry exists at librarian subdomain
- Need to verify if this is official ISIL registry
- Korean language interface
- May have English option
🇬🇧 United Kingdom - British Library
Priority: ⭐⭐⭐ HIGH (when restored)
URL: https://www.bl.uk/help/get-a-marc-organization-code (currently 404)
Status: ⚠️ OFFLINE - Cyber attack October 2023
Estimated Records: 5,000+ institutions
Contact Required: Wait for service restoration
Institution Types: Libraries, archives, museums
Implementation Notes:
- British Library systems offline since October 2023 cyber attack
- ISIL/MARC code directory unavailable
- Monitor BL announcements for restoration
- Largest UK heritage collection metadata
- Critical for UK coverage but currently blocked
Non-National Registries (5 Agencies)
EUR - European Organizations
Priority: ⭐ LOW
URL: https://www.eui.eu/Documents/Research/HistoricalArchivesofEU/ISIL/isil-directory.pdf
Access Method: Direct PDF download
Estimated Records: 20-50 institutions
Complexity: ⭐ EASY
Institution Types: European Union organizations
Implementation Notes:
- PDF format requires text extraction
- Covers EU institutional libraries
- Small specialized dataset
GTB - GTBib (Spanish-language interlibrary loan)
Priority: ⭐ LOW
URL: https://directorio.gtbib.net/lista_isil
Access Method: Direct list
Estimated Records: 100-300 institutions
Complexity: ⭐ EASY
Institution Types: Spanish-language libraries
Implementation Notes:
- Spanish language
- Interlibrary loan network
- May overlap with national registries (ES, MX, AR, etc.)
O - OCLC (RFID tags)
Priority: ⭐ LOW
URL: http://www.oclc.org/contacts/libraries/
Access Method: Search interface
Estimated Records: Special encoding system (not traditional ISIL)
Complexity: ⭐⭐ MEDIUM
Institution Types: OCLC member libraries
Implementation Notes:
- Letter "O" used for OCLC RFID technical encoding
- Not a traditional national registry
- Overlaps with US and other national codes
OCLC - OCLC WorldCat Symbol
Priority: ⭐⭐ MEDIUM
URL: http://www.oclc.org/contacts/libraries/
Access Method: Search interface
Estimated Records: 10,000+ institutions (global)
Complexity: ⭐⭐⭐ HARD
Institution Types: WorldCat member libraries
Implementation Notes:
- OCLC WorldCat symbols used globally
- Different from ISIL but related
- Massive dataset, may overlap with national registries
- Useful for cross-referencing and enrichment
ZDB - Zeitschriftendatenbank (German Serials Database)
Priority: ⭐⭐ MEDIUM
URL: https://zdb-katalog.de/index.xhtml
Access Method: Search interface
Estimated Records: 3,000+ institutions
Complexity: ⭐⭐⭐ HARD
Institution Types: Libraries holding serials
Implementation Notes:
- German serials/journals database
- Managed by Staatsbibliothek zu Berlin
- May have API (check ZDB documentation)
- Overlaps with DE- national codes
Extraction Recommendations
Phase 1: Quick Wins (Tier 1 Downloads)
Timeline: 1-2 days
Expected Yield: 7,000-8,000 records
- Denmark - 4 CSV downloads (700 records)
- Japan - 5 CSV/Excel downloads (6,000+ records)
- Hungary - Direct list (200-500 records)
- Israel - Direct list (200-500 records)
- Cyprus - Direct list (50-100 records)
- Luxembourg - Direct list (20-50 records)
Tools Required: pandas, requests, openpyxl
Risk: Low
Effort: Low
Phase 2: Major APIs (Tier 2)
Timeline: 3-5 days
Expected Yield: 10,000-12,000 records
- Germany - JSON-API (10,000+ records)
- Austria - Search/API (500-1,000 records)
Tools Required: requests, json, pagination logic
Risk: Medium (rate limiting, API changes)
Effort: Medium
Phase 3: High-Value Scraping (Tier 3 Priority)
Timeline: 2-3 weeks
Expected Yield: 50,000+ records
- USA - Library of Congress (37,000+ records) - PRIORITY
- Canada - Library and Archives Canada (2,000-5,000 records)
- France - SUDOC (3,000+ records)
- Italy - ICCU Anagrafe (5,000+ records)
Tools Required: requests, BeautifulSoup4, Selenium (if needed)
Risk: High (blocking, CAPTCHA, rate limiting)
Effort: High
Phase 4: Remaining Scraping (Tier 3 Secondary)
Timeline: 2-4 weeks
Expected Yield: 10,000-15,000 records
- Australia, Switzerland, Czech Republic, Finland, Norway, Bulgaria, Croatia, New Zealand, Qatar, Romania, Russia, Slovenia, Slovakia
Tools Required: Same as Phase 3
Risk: Medium-High
Effort: Medium-High
Phase 5: Contact-Based (Tier 4)
Timeline: Ongoing (3-6 months for responses)
Expected Yield: 5,000-10,000 records
- UK - Wait for British Library restoration (5,000+ records)
- South Korea - Investigate librarian.nl.go.kr (1,000-2,000 records)
- Argentina, Greenland, Iran, Kazakhstan, Moldova, Nepal - Contact agencies
Tools Required: Email templates, institutional affiliation
Risk: High (may not receive data)
Effort: Low (but long wait times)
Technical Implementation Guide
Standard Parser Template
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
from datetime import datetime
class ISILRegistryScraper:
def __init__(self, country_code, registry_name, base_url):
self.country_code = country_code
self.registry_name = registry_name
self.base_url = base_url
self.headers = {
'User-Agent': 'GLAM Heritage Institution Scraper/1.0 (Academic Research)'
}
self.records = []
def fetch_csv(self, url):
"""Fetch CSV file from direct download URL"""
response = requests.get(url, headers=self.headers)
response.raise_for_status()
return pd.read_csv(io.StringIO(response.text))
def fetch_json_api(self, url, params=None):
"""Fetch data from JSON API with pagination"""
response = requests.get(url, params=params, headers=self.headers)
response.raise_for_status()
return response.json()
def scrape_html_table(self, url):
"""Scrape HTML table from webpage"""
response = requests.get(url, headers=self.headers)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Parse table logic here
return soup
def standardize_record(self, raw_record):
"""Convert raw record to standard format"""
return {
'isil_code': raw_record.get('isil_code'),
'name': raw_record.get('name'),
'city': raw_record.get('city'),
'region': raw_record.get('region'),
'country': self.country_code,
'postal_code': raw_record.get('postal_code'),
'address': raw_record.get('address'),
'phone': raw_record.get('phone'),
'email': raw_record.get('email'),
'url': raw_record.get('url'),
'institution_type': raw_record.get('type'),
'registry': self.registry_name,
'source_url': self.base_url,
'extraction_date': datetime.now().isoformat()
}
def export_csv(self, output_path):
"""Export records to CSV"""
df = pd.DataFrame(self.records)
df.to_csv(output_path, index=False, encoding='utf-8')
def export_json(self, output_path):
"""Export records to JSON with metadata"""
output = {
'metadata': {
'country': self.country_code,
'registry': self.registry_name,
'source_url': self.base_url,
'extraction_date': datetime.now().isoformat(),
'record_count': len(self.records)
},
'records': self.records
}
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(output, f, indent=2, ensure_ascii=False)
Rate Limiting Best Practices
import time
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, requests_per_second=1):
self.min_interval = 1.0 / requests_per_second
self.last_request = None
def wait(self):
"""Wait if necessary to respect rate limit"""
if self.last_request:
elapsed = (datetime.now() - self.last_request).total_seconds()
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = datetime.now()
# Usage
limiter = RateLimiter(requests_per_second=1)
for page in range(1, total_pages + 1):
limiter.wait()
data = fetch_page(page)
process_data(data)
Pagination Handler
def paginate_api(base_url, page_size=100, max_pages=None):
"""Generic pagination handler for APIs"""
page = 1
all_records = []
while True:
params = {'page': page, 'size': page_size}
response = requests.get(base_url, params=params)
data = response.json()
records = data.get('member', [])
if not records:
break
all_records.extend(records)
# Check for last page
view = data.get('view', {})
if page >= view.get('numberOfPages', page):
break
if max_pages and page >= max_pages:
break
page += 1
time.sleep(1) # Polite scraping
return all_records
Required Python Packages
# Core dependencies
pip install pandas requests beautifulsoup4 openpyxl lxml
# For complex scraping
pip install selenium webdriver-manager
# For API interactions
pip install urllib3
# For JSON/CSV handling
pip install jsonlines
# For progress tracking
pip install tqdm
# For parallel processing (optional)
pip install joblib
Output Format Specification
CSV Output Schema
isil_code,name,city,region,country,postal_code,address,phone,email,url,institution_type,registry,source_url,extraction_date
JP-1001234,"Tokyo Metropolitan Library","Tokyo","13","JP","100-0001","1-1 Minami-Osawa, Hachioji-shi","+81-42-123-4567","info@library.metro.tokyo.jp","https://www.library.metro.tokyo.jp","LIBRARY","National Diet Library","https://www.ndl.go.jp/en/library/isil/","2025-11-17T10:30:00Z"
JSON Output Schema
{
"metadata": {
"country": "JP",
"country_name": "Japan",
"registry": "National Diet Library",
"registry_agency": "National Diet Library (NDL)",
"source_url": "https://www.ndl.go.jp/en/library/isil/",
"extraction_date": "2025-11-17T10:30:00Z",
"extraction_method": "Direct CSV download",
"record_count": 6234,
"institution_types": ["LIBRARY", "MUSEUM", "ARCHIVE"],
"data_license": "Public Domain (per NDL)",
"notes": "Includes public libraries, university libraries, museums, and archives"
},
"records": [
{
"isil_code": "JP-1001234",
"name": "Tokyo Metropolitan Library",
"name_japanese": "東京都立図書館",
"name_kana": "とうきょうとりつとしょかん",
"city": "Tokyo",
"region": "13",
"country": "JP",
"postal_code": "100-0001",
"address": "1-1 Minami-Osawa, Hachioji-shi, Tokyo",
"phone": "+81-42-123-4567",
"fax": "+81-42-123-4568",
"email": "info@library.metro.tokyo.jp",
"url": "https://www.library.metro.tokyo.jp",
"institution_type": "LIBRARY",
"registration_date": "2011-10-15",
"last_update_date": "2024-03-20",
"registry": "National Diet Library",
"source_url": "https://www.ndl.go.jp/en/library/isil/",
"extraction_date": "2025-11-17T10:30:00Z"
}
]
}
Quality Assurance Checklist
Pre-Extraction
- Verify source URL is accessible
- Check for API documentation
- Test sample queries/downloads
- Review robots.txt and terms of service
- Implement rate limiting (1 req/sec default)
- Set up error logging
During Extraction
- Monitor for HTTP errors (429, 403, 503)
- Log failed requests for retry
- Validate ISIL code format (country code + structure)
- Check for duplicate records
- Handle encoding issues (UTF-8)
- Track progress (records extracted / estimated total)
Post-Extraction
- Validate record count against estimate
- Check for required fields (ISIL code, name, country)
- Detect malformed ISIL codes
- Cross-reference with existing datasets
- Generate extraction report (metadata)
- Export to CSV and JSON formats
- Document any issues or anomalies
Data Quality Metrics
Record Completeness Score
| Field | Weight | Required |
|---|---|---|
| ISIL code | 20% | Yes |
| Name | 20% | Yes |
| City | 15% | No |
| Country | 10% | Yes |
| Postal code | 5% | No |
| Address | 10% | No |
| Phone | 5% | No |
| 5% | No | |
| URL | 10% | No |
Completeness = (Present fields × Weight) / Total Weight
ISIL Code Validation
import re
def validate_isil_code(isil_code, country_code):
"""Validate ISIL code format"""
# Basic format: CC-XXXXX (country code + dash + institution code)
pattern = rf'^{country_code}-[A-Za-z0-9]+$'
if not re.match(pattern, isil_code):
return False, "Invalid format"
# Check length (typically 5-10 characters total)
if len(isil_code) < 4 or len(isil_code) > 15:
return False, "Invalid length"
return True, "Valid"
# Example
validate_isil_code("JP-1001234", "JP") # (True, "Valid")
validate_isil_code("INVALID", "JP") # (False, "Invalid format")
Contact Information for Support
ISIL Registration Authority (Denmark)
- Organization: Danish Agency for Culture and Palaces
- Email: ISIL@slks.dk
- Website: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
Germany (Staatsbibliothek zu Berlin)
- Contact: Carsten Klee
- Email: carsten.klee@sbb.spk-berlin.de
- Phone: +49 30 266 434402
- Topic: JSON-API questions, ISIL allocation
USA (Library of Congress)
- Organization: Network Development and MARC Standards Office
- Email: ndmso@loc.gov
- Fax: +1-202-707-0115
- Mailing Address: 101 Independence Avenue, S.E., Washington, D.C. 20540-4402
Next Steps for Implementation
Immediate Actions (Week 1)
- ✅ Complete reconnaissance (DONE)
- Create extraction scripts for Denmark (Tier 1)
- Create extraction scripts for Japan (Tier 1)
- Test Germany JSON-API (Tier 2)
- Document findings and update inventory
Short-term (Weeks 2-4)
- Extract all Tier 1 registries (7,000-8,000 records)
- Implement Germany API scraper (10,000 records)
- Begin USA Library of Congress scraper (37,000 records)
- Set up quality assurance pipeline
- Create unified dataset (CSV + JSON)
Medium-term (Months 2-3)
- Complete Tier 2 and high-priority Tier 3 scraping
- Cross-reference with existing GLAM datasets (Wikidata, OSM)
- Enrich records with geocoding (lat/lon)
- Generate statistics and visualizations
- Publish dataset and documentation
Long-term (Months 4-6)
- Contact Tier 4 agencies for bulk data
- Monitor UK British Library for restoration
- Implement automated update pipeline
- Create public API for GLAM dataset
- Integrate with LinkML schema and RDF export
Change Log
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-11-17 | Initial scraper inventory created |
References
- ISO 15511:2019: International Standard Identifier for Libraries and Related Organizations
- ISIL Registration Authority: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
- GLAM Project Documentation:
/docs/AGENTS.md,/data/isil/GLOBAL_ISIL_AGENCIES_OFFICIAL.md - Session Summary:
/docs/sessions/SESSION_2025-11-17_ISIL_GLOBAL_MAPPING.md
End of Document