# Global ISIL Registry Scraper Inventory **Version**: 1.0 **Date**: 2025-11-17 **Purpose**: Technical inventory of data extraction methods for global ISIL registries **Status**: Reconnaissance complete, extraction pending --- ## Overview This document provides detailed technical specifications for extracting heritage institution data from 46 ISIL registries worldwide. For each registry, we document: - Access method (direct download, API, web scraping) - Data format and structure - Estimated record count - Required tools and complexity - Priority tier for extraction **Total Estimated Records**: 50,000-100,000+ institutions globally --- ## Extraction Priority Tiers ### Tier 1: Direct Downloads (Easy Wins) Large datasets with CSV/Excel/JSON downloads requiring minimal processing. ### Tier 2: JSON/XML APIs Programmatic access via REST APIs, SPARQL endpoints, or SRU interfaces. ### Tier 3: Web Scraping Required HTML-based search interfaces requiring BeautifulSoup/Selenium. ### Tier 4: Contact Required No public access; requires direct contact with registry agency. --- ## Tier 1: Direct Downloads (9 Registries) ### 🇩🇰 Denmark - Danish Agency for Culture and Palaces **Priority**: ⭐⭐⭐ HIGH **URL**: https://vip.dbc.dk/librarylists/ **Access Method**: Direct CSV download (4 separate files) **Available Downloads**: 1. **Folkebiblioteker (væsener)** - Public libraries (main branches) - Download: https://vip.dbc.dk/librarylists/publiclibraries/download - Format: CSV 2. **Folkebiblioteker (alle betjeningssteder)** - Public libraries (all service points) - Download: https://vip.dbc.dk/librarylists/publicbranches/download - Format: CSV 3. **Forskningsbiblioteker (væsener)** - Research libraries (main branches) - Download: https://vip.dbc.dk/librarylists/ffulibraries/download - Format: CSV 4. **Forskningsbiblioteker (alle betjeningssteder)** - Research libraries (all service points) - Download: https://vip.dbc.dk/librarylists/ffubranches/download - Format: CSV **Fields Available**: - Biblioteksnummer (Library number/ISIL) - Navn (Name) - Adresse (Address) - Postnummer (Postal code) - By (City) - Telefon (Phone) - Email - Leder (Director/Manager) **Estimated Records**: 700+ institutions **Data Format**: CSV (UTF-8) **Required Tools**: `pandas`, `requests` **Complexity**: ⭐ EASY **Institution Types**: Libraries only **Implementation Notes**: - 4 separate CSV files need to be merged - May contain duplicate entries (main branches vs. service points) - Danish text fields (no English translation) - Direct downloads, no rate limiting concerns --- ### 🇯🇵 Japan - National Diet Library **Priority**: ⭐⭐⭐ HIGH **URL**: http://www.ndl.go.jp/en/library/isil/index.html#isillist **Access Method**: Direct CSV/Excel download (5 datasets) **Available Downloads** (Updated 2025-11-05): 1. **Public Libraries** - CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.csv (1.1MB) - Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.xlsx (492KB) - Estimated: 2,500-3,000 records 2. **NDL, University Libraries, Special Libraries** - CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.csv (554KB) - Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.xlsx (288KB) - Estimated: 1,500-2,000 records 3. **Museums** - CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.csv (789KB) - Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.xlsx (420KB) - Estimated: 2,000-2,500 records 4. **Archives** - CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.csv (20KB) - Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.xlsx (20KB) - Estimated: 50-100 records 5. **Deleted ISIL** - CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.csv (18KB) - Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.xlsx (19KB) - Estimated: 100-200 records **ISIL Structure**: - Country code: `JP-` - Type identifier: `1` (Library), `2` (Museum), `3` (Archive) - Institution ID: 6 digits (e.g., `JP-1123456`, `JP-2234567`, `JP-3345678`) **Fields Available**: - ISIL - Institution name (English) - Institution name (Japanese) - Institution name (Japanese kana) - Postal code - Address - Telephone - Fax - URL - Registration date (since April 2020) - Last update date (since April 2020) **Estimated Total Records**: 6,000+ institutions **Data Format**: CSV and Excel (choose one) **Required Tools**: `pandas`, `requests`, `openpyxl` (for Excel) **Complexity**: ⭐ EASY **Institution Types**: Libraries, Museums, Archives **Update Frequency**: Regular (files dated 2025-11-05) **Implementation Notes**: - Very recent data (November 2025) - English field names available - 5 separate files need to be merged - Deleted ISIL records useful for tracking institutional changes - Public domain license (free to use) --- ### 🇳🇱 Netherlands - Koninklijke Bibliotheek (KB) **Priority**: ⭐⭐ MEDIUM **URL**: https://www.bibliotheeknetwerk.nl/landelijke-digitale-infrastructuur-ldi/isil-codes **Access Method**: Direct Excel download **Status**: ✅ **ALREADY EXTRACTED** (153 public libraries) **Available Download**: - **Public Libraries List**: Excel file from bibliotheeknetwerk.nl - **File**: `/data/isil/NL/kb_netherlands_public_libraries.{csv,json}` - **Parser**: `/scripts/scrapers/parse_kb_netherlands_isil.py` **Fields Available**: - Instellingsnaam (Institution name) - Plaatsnaam (City name) - ISIL code **Extracted Records**: 153 institutions **Data Format**: Excel (with header in row 3) **Complexity**: ⭐ EASY **Institution Types**: Public libraries only **Completion Date**: 2025-11-17 **Implementation Notes**: - Excel header in row 3 (not standard row 1) - Parser already implemented and tested - Limited coverage (public libraries only, no research/special libraries) - May need additional source for complete Netherlands coverage --- ### 🇧🇾 Belarus - National Library of Belarus **Priority**: ⭐ LOW (already extracted) **URL**: https://nlb.by/en/for-librarians/international-standard-identifier-for-libraries-and-related-organizations-isil/list-of-libraries-organizations-of-the-republic-of-belarus-and-their-isil-codes **Status**: ✅ **ALREADY EXTRACTED** (167 institutions) **Available Download**: - HTML table with all ISIL codes - **File**: `/data/isil/BY/belarus_isil_all.{csv,json}` - **Parser**: Already exists from previous session **Fields Extracted**: - ISIL code - Institution name - Region code (embedded in ISIL: HO, MI, VI, MA, HM, BR, HR) **Extracted Records**: 167 institutions **Regional Breakdown**: - HO (Homyel): 29 - MI (Minsk): 26 - VI (Vitsyebsk): 25 - MA (Mahilyow): 25 - HM (Hrodna): 23 - BR (Brest): 20 - HR (Hrodna variant): 19 **Complexity**: ⭐ EASY **Institution Types**: Libraries and related organizations **Completion Date**: 2025-11-17 --- ### 🇧🇪 Belgium - Royal Library of Belgium (KBR) **Priority**: ⭐ LOW (already extracted) **URL**: http://isil.kbr.be/ **Status**: ✅ **ALREADY EXTRACTED** (438 institutions) **Extracted Records**: 438 institutions **Completion Date**: Previous session **Institution Types**: Libraries, archives, museums --- ### 🇨🇾 Cyprus - Cyprus University of Technology Library **Priority**: ⭐ LOW **URL**: http://library.cut.ac.cy/en/node/1194 **Access Method**: Direct list (HTML or downloadable file) **Estimated Records**: 50-100 institutions **Complexity**: ⭐ EASY **Institution Types**: Libraries and archives **Implementation Notes**: - Small dataset, straightforward extraction - Likely HTML table or PDF list - Low priority due to small size --- ### 🇱🇺 Luxembourg - Archives nationales de Luxembourg **Priority**: ⭐ LOW **URL**: http://www.anlux.public.lu/fr/nous-connaitre/reseaux-internationaux/ISIL.html **Access Method**: Direct list (HTML) **Estimated Records**: 20-50 institutions **Complexity**: ⭐ EASY **Institution Types**: Archives and libraries **Implementation Notes**: - Very small dataset - Likely single HTML page with table - French language --- ### 🇭🇺 Hungary - National Széchényi Library **Priority**: ⭐⭐ MEDIUM **URL**: https://ki.oszk.hu/dokumentumtar/isil-azonositok **Access Method**: Direct list (likely HTML or downloadable file) **Estimated Records**: 200-500 institutions **Complexity**: ⭐ EASY **Institution Types**: Libraries **Implementation Notes**: - Medium-sized dataset - Hungarian language interface - Need to verify file format (HTML vs. downloadable) --- ### 🇮🇱 Israel - National Library of Israel **Priority**: ⭐⭐ MEDIUM **URL**: http://nli.org.il **Access Method**: Direct list (location TBD) **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries and archives **Implementation Notes**: - Need to locate exact ISIL list URL on NLI website - Likely Hebrew and English versions - May require navigation through site structure --- ## Tier 2: JSON/XML APIs (3 Registries) ### 🇩🇪 Germany - Staatsbibliothek zu Berlin (Sigel Database) **Priority**: ⭐⭐⭐ HIGH **URL**: https://sigel.staatsbibliothek-berlin.de/ **Access Method**: JSON-API, SRU, Linked Data Service **API Endpoints**: 1. **JSON-API (Primary)** - Base URL: `https://sigel.staatsbibliothek-berlin.de/api/org.jsonld` - Search: `/api/org.jsonld?q={search_terms}` - Single record: `/api/org/{ISIL}.jsonld` - Example: https://sigel.staatsbibliothek-berlin.de/api/org/DE-12.jsonld 2. **SRU (Secondary)** - Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/sru - Standard library catalog protocol 3. **Linked Data Service (RDF)** - Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/linked-data-service - RDF/Turtle format available **JSON-API Features**: - **Pagination**: `size` parameter (default: 10, max: TBD) - **Page selection**: `page` parameter (default: 1) - **Compact format**: `compact=1` for PICA-Plus records - **JSONP support**: `callback` parameter for cross-domain requests - **Search syntax**: `nam=keyword AND ort=city` (field-specific search) - **URL encoding required**: Use `%3D` for `=`, `%20` for spaces **Search Fields**: - `nam` - Institution name - `ort` - City/Location - Additional fields documented in API docs **Response Structure**: ```json { "id": "https://sigel.staatsbibliothek-berlin.de/api/org.jsonld?q=...", "type": "Collection", "totalItems": 12345, "freetextQuery": "search terms", "member": [ { "id": "https://sigel.staatsbibliothek-berlin.de/org/DE-12", "type": ["Library", "Museum", "Archive", "Other"], "identifier": "DE-12", "location": "Berlin", "addressRegion": "DE-BE", "seeAlso": "https://sigel.staatsbibliothek-berlin.de/org/DE-12.rdf", "data": { /* PICA-Plus record */ } } ], "view": { "id": "current page URL", "type": "PartialCollectionView", "first": "first page URL", "last": "last page URL", "next": "next page URL", "previous": "previous page URL", "totalItems": 10, "pageIndex": 1, "numberOfPages": 1235, "offset": 0, "limit": 10 } } ``` **Estimated Records**: 10,000+ institutions **Data Format**: JSON-LD with PICA-Plus embedded **Required Tools**: `requests`, `json` **Complexity**: ⭐⭐ MEDIUM (pagination required) **Institution Types**: Libraries, Archives, Museums, Other **Rate Limiting**: Unknown (implement polite scraping with delays) **Implementation Strategy**: 1. Start with broad search (all institutions) 2. Paginate through results using `size=100` (test max size) 3. Extract JSON-LD records 4. Parse PICA-Plus data for additional fields 5. Store in standardized CSV/JSON format 6. Optionally fetch RDF for Linked Data integration **Implementation Notes**: - Very comprehensive dataset (~10,000 institutions) - Well-documented JSON-API (beta status) - Format change announced for September 2025 (check blog post) - Linked Data Service provides RDF/Turtle for semantic web integration - PICA-Plus format requires additional parsing - Contact: Carsten Klee (carsten.klee@sbb.spk-berlin.de, +49 30 266 434402) --- ### 🇦🇹 Austria - Die Österreichische Bibliothekenverbund und Service GmbH **Priority**: ⭐⭐ MEDIUM **URL**: https://www.isil.at/ **Access Method**: Searchable interface (check for API) **Estimated Records**: 500-1,000 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries **Implementation Notes**: - Need to investigate if API exists - May require web scraping if no API - Austrian library network consortium --- ### 🇪🇬 Egypt - ENSTINET **Priority**: ⭐ LOW **URL**: http://cdex5.sti.sci.eg/dev60cgi/f60cgi?form=DEGL_WEB_SEARCH&userid=degl/degl@sti **Access Method**: Search interface (unclear structure) **Estimated Records**: 100-500 institutions **Complexity**: ⭐⭐⭐ HARD (interface unclear) **Institution Types**: Libraries and research centers **Implementation Notes**: - Legacy CGI interface (f60cgi) - May require authentication (userid parameter) - Unclear if publicly accessible - Low priority due to complexity --- ## Tier 3: Web Scraping Required (15 Registries) ### 🇺🇸 United States - Library of Congress **Priority**: ⭐⭐⭐ HIGH **URL**: http://www.loc.gov/marc/organizations/ **Access Method**: Search interface + individual record pages **Search URL**: http://www.loc.gov/marc/organizations/org-search.php **Structure**: - **Search interface**: Keyword or code search - **Individual records**: Separate pages per organization - **No bulk download**: Must scrape search results + detail pages - **Pagination**: Unknown (test with large result sets) **Code Structure**: - **US codes**: Geographic prefix + institution code (e.g., `DLC`, `TxU`, `NNopo`) - **ISIL format**: `US-{MARC_CODE}` (e.g., `US-DLC` for Library of Congress) - **Non-US codes**: Also included but with country prefixes **Fields Available**: - MARC organization code - ISIL code (US- prefix added) - Organization name - Address (street, city, state, postal code) - Country (for non-US organizations) - Status (valid/obsolete) **Estimated Records**: 37,000+ codes (33,000 valid, 4,000+ invalid/obsolete) **Geographic Coverage**: Primarily USA, some non-US institutions **Data Format**: HTML (requires scraping) **Required Tools**: `requests`, `BeautifulSoup4`, `json` **Complexity**: ⭐⭐⭐ HARD (pagination + detail pages) **Institution Types**: Libraries, archives, museums, corporations **Implementation Strategy**: 1. **Option A**: Paginated search scraping - Submit broad search (empty query or "*") - Parse result pages - Extract codes and basic info - Follow links to detail pages for full address 2. **Option B**: Systematic code generation - Generate all possible state + code combinations - Test each code via search - Extract valid records - (Likely inefficient - 50,000+ possible combinations) 3. **Option C**: Contact LoC for bulk data - Email: ndmso@loc.gov - Request: Full MARC organization codes database - Justification: Academic research, cultural heritage mapping **Rate Limiting**: - No explicit rate limit mentioned - Implement polite scraping: 1 request per 2 seconds - Expect 24-48 hours for full extraction (37,000 records) **Implementation Notes**: - Largest ISIL registry (37,000+ codes) - Historical codes preserved (non-unique obsolete codes) - Non-US institutions included (subset of global coverage) - MARC format requires understanding of subunit structure - Contact option available if scraping blocked **Alternative Agencies**: - **Canada**: Library and Archives Canada (separate registry) - **UK**: British Library (offline since October 2023 cyber attack) - **Germany**: Staatsbibliothek zu Berlin (separate registry) - **Estonia**: Eesti Rahvusraamatukogu (separate registry) --- ### 🇨🇦 Canada - Library and Archives Canada **Priority**: ⭐⭐⭐ HIGH **URL**: https://sigles-symbols.bac-lac.gc.ca/eng/Search **Access Method**: Search interface **Estimated Records**: 2,000-5,000 institutions **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Libraries, archives **Implementation Notes**: - Bilingual (English/French) - Search interface scraping required - May have provincial breakdowns - Test for bulk export option --- ### 🇦🇺 Australia - National Library of Australia **Priority**: ⭐⭐ MEDIUM **URL**: http://www.nla.gov.au/apps/ilrs **Access Method**: Search interface **Estimated Records**: 1,000-2,000 institutions **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Libraries **Implementation Notes**: - ILRS = Interlibrary Loan and Resource Sharing system - May require Selenium if JavaScript-heavy - Check for CSV export option --- ### 🇨🇭 Switzerland - Swiss National Library **Priority**: ⭐⭐ MEDIUM **URL**: https://www.isil.nb.admin.ch/en/ **Access Method**: Searchable interface **Estimated Records**: 500-1,000 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries, archives **Implementation Notes**: - Multilingual (German, French, Italian, English) - Check for API or bulk download - Swiss federal government site (.admin.ch domain) --- ### 🇨🇿 Czech Republic - National Library of the Czech Republic **Priority**: ⭐⭐ MEDIUM **URL**: https://aleph.nkp.cz/F/?func=file&file_name=find-b&CON_LNG=ENG&local_base=adr **Access Method**: ALEPH library system search interface **Estimated Records**: 500-1,000 institutions **Complexity**: ⭐⭐⭐ HARD (ALEPH system) **Institution Types**: Libraries **Implementation Notes**: - ALEPH system (library catalog software) - Complex URL structure with session management - May require cookie handling - English interface available (CON_LNG=ENG) --- ### 🇫🇷 France - Agence Bibliographique de l'Enseignement Superieur (ABES) **Priority**: ⭐⭐⭐ HIGH **URL**: http://www.sudoc.abes.fr/ **Access Method**: SUDOC catalog search **Status**: ⚠️ URLs in official directory may be outdated (404 errors) **Alternative**: https://www.abes.fr/ (ABES main site) **Estimated Records**: 3,000+ institutions (university and research libraries) **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Academic libraries, research centers **Implementation Notes**: - SUDOC = Système Universitaire de Documentation - Covers French higher education libraries - May have API (check ABES documentation) - French language interface - Need to verify correct ISIL list URL --- ### 🇮🇹 Italy - Istituto Centrale per il Catalogo Unico (ICCU) **Priority**: ⭐⭐⭐ HIGH **URL**: http://anagrafe.iccu.sbn.it/ **Access Method**: Anagrafe search interface **Estimated Records**: 5,000+ institutions **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Libraries, archives **Implementation Notes**: - Anagrafe = registry/directory - SBN = Servizio Bibliotecario Nazionale (National Library Service) - Large dataset covering Italian libraries - May require Italian language parsing --- ### 🇧🇬 Bulgaria - National Library of Bulgaria **Priority**: ⭐ LOW **URL**: http://www.nationallibrary.bg/wp/?page_id=5686 **Access Method**: Search interface **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries --- ### 🇫🇮 Finland - National Library of Finland **Priority**: ⭐⭐ MEDIUM **URL**: http://isil.kansalliskirjasto.fi/en/ **Access Method**: Search interface **Estimated Records**: 500-1,000 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries **Implementation Notes**: - English interface available - Check for API or bulk download option --- ### 🇭🇷 Croatia - Nacionalna i sveučilišna knjižnica u Zagrebu **Priority**: ⭐ LOW **URL**: https://nsk.hr/en/ **Access Method**: ISIL list mentioned but location unclear **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries **Implementation Notes**: - Need to locate ISIL list on NSK website - Croatian language primary, English available --- ### 🇳🇴 Norway - National Library of Norway **Priority**: ⭐⭐ MEDIUM **URL**: http://www.nb.no/baser/bibliotek/english.html **Access Method**: Search interface **Estimated Records**: 500-1,000 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries --- ### 🇳🇿 New Zealand - National Library of New Zealand **Priority**: ⭐ LOW **URL**: http://natlib.govt.nz/ **Access Method**: Search mentioned but location unclear **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries **Implementation Notes**: - Need to locate ISIL search on NLNZ website --- ### 🇶🇦 Qatar - Qatar National Library **Priority**: ⭐ LOW **URL**: http://www.qnl.qa/find-answers/other-libraries/ **Access Method**: Search interface **Estimated Records**: 20-50 institutions **Complexity**: ⭐ EASY **Institution Types**: Libraries **Implementation Notes**: - Small dataset - English and Arabic interfaces --- ### 🇷🇴 Romania - National Library of Romania **Priority**: ⭐ LOW **URL**: https://bibnat.ro/ **Access Method**: Search interface (location unclear) **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries --- ### 🇷🇺 Russia - Russian National Public Library for Science and Technology **Priority**: ⭐⭐ MEDIUM **URL**: http://www.gpntb.ru/rfid1/searchdb.php **Access Method**: Search interface **Estimated Records**: 1,000-2,000 institutions **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Libraries, scientific institutions **Implementation Notes**: - Russian language interface - RFID-related URL suggests technical database - May require Cyrillic text handling --- ### 🇸🇮 Slovenia - National and University Library **Priority**: ⭐ LOW **URL**: https://www.nuk.uni-lj.si/eng/node/543 **Access Method**: ISIL list mentioned **Estimated Records**: 100-200 institutions **Complexity**: ⭐ EASY **Institution Types**: Libraries --- ### 🇸🇰 Slovakia - Slovak National Library **Priority**: ⭐ LOW **URL**: http://www.snk.sk/en/information-for/libraries-and-librarians/isil.html **Access Method**: ISIL list mentioned **Estimated Records**: 200-500 institutions **Complexity**: ⭐⭐ MEDIUM **Institution Types**: Libraries --- ## Tier 4: Contact Required (8 Agencies) ### 🇦🇷 Argentina - IRAM **Priority**: ⭐ LOW **URL**: http://www.iram.org.ar/ **Status**: No public ISIL list available **Estimated Records**: 500-1,000 institutions **Contact Required**: Yes **Institution Types**: Libraries, archives **Implementation Notes**: - IRAM = Argentine Standardization and Certification Institute - May provide list upon request - Spanish language --- ### 🇬🇱 Greenland - Central and Public Library of Greenland **Priority**: ⭐ LOW **URL**: https://www.katak.gl/ **Status**: No public ISIL list **Estimated Records**: 5-10 institutions **Contact Required**: Yes **Institution Types**: Libraries --- ### 🇮🇷 Iran - National Library and Archives of Iran **Priority**: ⭐ LOW **Status**: No URL provided, no public list **Estimated Records**: 100-500 institutions **Contact Required**: Yes **Institution Types**: Libraries, archives **Implementation Notes**: - May require Persian language communication - Political/access challenges possible --- ### 🇰🇿 Kazakhstan - National Library of Kazakhstan **Priority**: ⭐ LOW **URL**: https://www.nlrk.kz/index.php?lang=en **Status**: No public ISIL list **Estimated Records**: 100-300 institutions **Contact Required**: Yes **Institution Types**: Libraries --- ### 🇲🇩 Moldova - National Library of Moldova **Priority**: ⭐ LOW **URL**: http://bnrm.md/ **Status**: No public ISIL list **Estimated Records**: 50-100 institutions **Contact Required**: Yes **Institution Types**: Libraries --- ### 🇳🇵 Nepal **Priority**: ⭐ LOW **Status**: No agency listed in official ISIL directory **Estimated Records**: Unknown **Contact Required**: Yes - contact Danish Agency for information **Institution Types**: Unknown --- ### 🇰🇷 South Korea - National Library of Korea **Priority**: ⭐⭐ MEDIUM **URL**: https://www.nl.go.kr/ **Status**: Main NLK site doesn't list ISIL directory **Alternative Registry**: https://librarian.nl.go.kr/ (discovered via Exa search) **Estimated Records**: 1,000-2,000 institutions **Contact Required**: Investigate librarian.nl.go.kr before contacting **Institution Types**: Libraries **Implementation Notes**: - Secondary registry exists at librarian subdomain - Need to verify if this is official ISIL registry - Korean language interface - May have English option --- ### 🇬🇧 United Kingdom - British Library **Priority**: ⭐⭐⭐ HIGH (when restored) **URL**: https://www.bl.uk/help/get-a-marc-organization-code (currently 404) **Status**: ⚠️ **OFFLINE** - Cyber attack October 2023 **Estimated Records**: 5,000+ institutions **Contact Required**: Wait for service restoration **Institution Types**: Libraries, archives, museums **Implementation Notes**: - British Library systems offline since October 2023 cyber attack - ISIL/MARC code directory unavailable - Monitor BL announcements for restoration - Largest UK heritage collection metadata - Critical for UK coverage but currently blocked --- ## Non-National Registries (5 Agencies) ### EUR - European Organizations **Priority**: ⭐ LOW **URL**: https://www.eui.eu/Documents/Research/HistoricalArchivesofEU/ISIL/isil-directory.pdf **Access Method**: Direct PDF download **Estimated Records**: 20-50 institutions **Complexity**: ⭐ EASY **Institution Types**: European Union organizations **Implementation Notes**: - PDF format requires text extraction - Covers EU institutional libraries - Small specialized dataset --- ### GTB - GTBib (Spanish-language interlibrary loan) **Priority**: ⭐ LOW **URL**: https://directorio.gtbib.net/lista_isil **Access Method**: Direct list **Estimated Records**: 100-300 institutions **Complexity**: ⭐ EASY **Institution Types**: Spanish-language libraries **Implementation Notes**: - Spanish language - Interlibrary loan network - May overlap with national registries (ES, MX, AR, etc.) --- ### O - OCLC (RFID tags) **Priority**: ⭐ LOW **URL**: http://www.oclc.org/contacts/libraries/ **Access Method**: Search interface **Estimated Records**: Special encoding system (not traditional ISIL) **Complexity**: ⭐⭐ MEDIUM **Institution Types**: OCLC member libraries **Implementation Notes**: - Letter "O" used for OCLC RFID technical encoding - Not a traditional national registry - Overlaps with US and other national codes --- ### OCLC - OCLC WorldCat Symbol **Priority**: ⭐⭐ MEDIUM **URL**: http://www.oclc.org/contacts/libraries/ **Access Method**: Search interface **Estimated Records**: 10,000+ institutions (global) **Complexity**: ⭐⭐⭐ HARD **Institution Types**: WorldCat member libraries **Implementation Notes**: - OCLC WorldCat symbols used globally - Different from ISIL but related - Massive dataset, may overlap with national registries - Useful for cross-referencing and enrichment --- ### ZDB - Zeitschriftendatenbank (German Serials Database) **Priority**: ⭐⭐ MEDIUM **URL**: https://zdb-katalog.de/index.xhtml **Access Method**: Search interface **Estimated Records**: 3,000+ institutions **Complexity**: ⭐⭐⭐ HARD **Institution Types**: Libraries holding serials **Implementation Notes**: - German serials/journals database - Managed by Staatsbibliothek zu Berlin - May have API (check ZDB documentation) - Overlaps with DE- national codes --- ## Extraction Recommendations ### Phase 1: Quick Wins (Tier 1 Downloads) **Timeline**: 1-2 days **Expected Yield**: 7,000-8,000 records 1. **Denmark** - 4 CSV downloads (700 records) 2. **Japan** - 5 CSV/Excel downloads (6,000+ records) 3. **Hungary** - Direct list (200-500 records) 4. **Israel** - Direct list (200-500 records) 5. **Cyprus** - Direct list (50-100 records) 6. **Luxembourg** - Direct list (20-50 records) **Tools Required**: `pandas`, `requests`, `openpyxl` **Risk**: Low **Effort**: Low --- ### Phase 2: Major APIs (Tier 2) **Timeline**: 3-5 days **Expected Yield**: 10,000-12,000 records 1. **Germany** - JSON-API (10,000+ records) 2. **Austria** - Search/API (500-1,000 records) **Tools Required**: `requests`, `json`, pagination logic **Risk**: Medium (rate limiting, API changes) **Effort**: Medium --- ### Phase 3: High-Value Scraping (Tier 3 Priority) **Timeline**: 2-3 weeks **Expected Yield**: 50,000+ records 1. **USA** - Library of Congress (37,000+ records) - **PRIORITY** 2. **Canada** - Library and Archives Canada (2,000-5,000 records) 3. **France** - SUDOC (3,000+ records) 4. **Italy** - ICCU Anagrafe (5,000+ records) **Tools Required**: `requests`, `BeautifulSoup4`, `Selenium` (if needed) **Risk**: High (blocking, CAPTCHA, rate limiting) **Effort**: High --- ### Phase 4: Remaining Scraping (Tier 3 Secondary) **Timeline**: 2-4 weeks **Expected Yield**: 10,000-15,000 records 1. Australia, Switzerland, Czech Republic, Finland, Norway, Bulgaria, Croatia, New Zealand, Qatar, Romania, Russia, Slovenia, Slovakia **Tools Required**: Same as Phase 3 **Risk**: Medium-High **Effort**: Medium-High --- ### Phase 5: Contact-Based (Tier 4) **Timeline**: Ongoing (3-6 months for responses) **Expected Yield**: 5,000-10,000 records 1. **UK** - Wait for British Library restoration (5,000+ records) 2. **South Korea** - Investigate librarian.nl.go.kr (1,000-2,000 records) 3. Argentina, Greenland, Iran, Kazakhstan, Moldova, Nepal - Contact agencies **Tools Required**: Email templates, institutional affiliation **Risk**: High (may not receive data) **Effort**: Low (but long wait times) --- ## Technical Implementation Guide ### Standard Parser Template ```python import requests import pandas as pd from bs4 import BeautifulSoup import json from datetime import datetime class ISILRegistryScraper: def __init__(self, country_code, registry_name, base_url): self.country_code = country_code self.registry_name = registry_name self.base_url = base_url self.headers = { 'User-Agent': 'GLAM Heritage Institution Scraper/1.0 (Academic Research)' } self.records = [] def fetch_csv(self, url): """Fetch CSV file from direct download URL""" response = requests.get(url, headers=self.headers) response.raise_for_status() return pd.read_csv(io.StringIO(response.text)) def fetch_json_api(self, url, params=None): """Fetch data from JSON API with pagination""" response = requests.get(url, params=params, headers=self.headers) response.raise_for_status() return response.json() def scrape_html_table(self, url): """Scrape HTML table from webpage""" response = requests.get(url, headers=self.headers) response.raise_for_status() soup = BeautifulSoup(response.content, 'html.parser') # Parse table logic here return soup def standardize_record(self, raw_record): """Convert raw record to standard format""" return { 'isil_code': raw_record.get('isil_code'), 'name': raw_record.get('name'), 'city': raw_record.get('city'), 'region': raw_record.get('region'), 'country': self.country_code, 'postal_code': raw_record.get('postal_code'), 'address': raw_record.get('address'), 'phone': raw_record.get('phone'), 'email': raw_record.get('email'), 'url': raw_record.get('url'), 'institution_type': raw_record.get('type'), 'registry': self.registry_name, 'source_url': self.base_url, 'extraction_date': datetime.now().isoformat() } def export_csv(self, output_path): """Export records to CSV""" df = pd.DataFrame(self.records) df.to_csv(output_path, index=False, encoding='utf-8') def export_json(self, output_path): """Export records to JSON with metadata""" output = { 'metadata': { 'country': self.country_code, 'registry': self.registry_name, 'source_url': self.base_url, 'extraction_date': datetime.now().isoformat(), 'record_count': len(self.records) }, 'records': self.records } with open(output_path, 'w', encoding='utf-8') as f: json.dump(output, f, indent=2, ensure_ascii=False) ``` ### Rate Limiting Best Practices ```python import time from datetime import datetime, timedelta class RateLimiter: def __init__(self, requests_per_second=1): self.min_interval = 1.0 / requests_per_second self.last_request = None def wait(self): """Wait if necessary to respect rate limit""" if self.last_request: elapsed = (datetime.now() - self.last_request).total_seconds() if elapsed < self.min_interval: time.sleep(self.min_interval - elapsed) self.last_request = datetime.now() # Usage limiter = RateLimiter(requests_per_second=1) for page in range(1, total_pages + 1): limiter.wait() data = fetch_page(page) process_data(data) ``` ### Pagination Handler ```python def paginate_api(base_url, page_size=100, max_pages=None): """Generic pagination handler for APIs""" page = 1 all_records = [] while True: params = {'page': page, 'size': page_size} response = requests.get(base_url, params=params) data = response.json() records = data.get('member', []) if not records: break all_records.extend(records) # Check for last page view = data.get('view', {}) if page >= view.get('numberOfPages', page): break if max_pages and page >= max_pages: break page += 1 time.sleep(1) # Polite scraping return all_records ``` --- ## Required Python Packages ```bash # Core dependencies pip install pandas requests beautifulsoup4 openpyxl lxml # For complex scraping pip install selenium webdriver-manager # For API interactions pip install urllib3 # For JSON/CSV handling pip install jsonlines # For progress tracking pip install tqdm # For parallel processing (optional) pip install joblib ``` --- ## Output Format Specification ### CSV Output Schema ```csv isil_code,name,city,region,country,postal_code,address,phone,email,url,institution_type,registry,source_url,extraction_date JP-1001234,"Tokyo Metropolitan Library","Tokyo","13","JP","100-0001","1-1 Minami-Osawa, Hachioji-shi","+81-42-123-4567","info@library.metro.tokyo.jp","https://www.library.metro.tokyo.jp","LIBRARY","National Diet Library","https://www.ndl.go.jp/en/library/isil/","2025-11-17T10:30:00Z" ``` ### JSON Output Schema ```json { "metadata": { "country": "JP", "country_name": "Japan", "registry": "National Diet Library", "registry_agency": "National Diet Library (NDL)", "source_url": "https://www.ndl.go.jp/en/library/isil/", "extraction_date": "2025-11-17T10:30:00Z", "extraction_method": "Direct CSV download", "record_count": 6234, "institution_types": ["LIBRARY", "MUSEUM", "ARCHIVE"], "data_license": "Public Domain (per NDL)", "notes": "Includes public libraries, university libraries, museums, and archives" }, "records": [ { "isil_code": "JP-1001234", "name": "Tokyo Metropolitan Library", "name_japanese": "東京都立図書館", "name_kana": "とうきょうとりつとしょかん", "city": "Tokyo", "region": "13", "country": "JP", "postal_code": "100-0001", "address": "1-1 Minami-Osawa, Hachioji-shi, Tokyo", "phone": "+81-42-123-4567", "fax": "+81-42-123-4568", "email": "info@library.metro.tokyo.jp", "url": "https://www.library.metro.tokyo.jp", "institution_type": "LIBRARY", "registration_date": "2011-10-15", "last_update_date": "2024-03-20", "registry": "National Diet Library", "source_url": "https://www.ndl.go.jp/en/library/isil/", "extraction_date": "2025-11-17T10:30:00Z" } ] } ``` --- ## Quality Assurance Checklist ### Pre-Extraction - [ ] Verify source URL is accessible - [ ] Check for API documentation - [ ] Test sample queries/downloads - [ ] Review robots.txt and terms of service - [ ] Implement rate limiting (1 req/sec default) - [ ] Set up error logging ### During Extraction - [ ] Monitor for HTTP errors (429, 403, 503) - [ ] Log failed requests for retry - [ ] Validate ISIL code format (country code + structure) - [ ] Check for duplicate records - [ ] Handle encoding issues (UTF-8) - [ ] Track progress (records extracted / estimated total) ### Post-Extraction - [ ] Validate record count against estimate - [ ] Check for required fields (ISIL code, name, country) - [ ] Detect malformed ISIL codes - [ ] Cross-reference with existing datasets - [ ] Generate extraction report (metadata) - [ ] Export to CSV and JSON formats - [ ] Document any issues or anomalies --- ## Data Quality Metrics ### Record Completeness Score | Field | Weight | Required | |-------|--------|----------| | ISIL code | 20% | Yes | | Name | 20% | Yes | | City | 15% | No | | Country | 10% | Yes | | Postal code | 5% | No | | Address | 10% | No | | Phone | 5% | No | | Email | 5% | No | | URL | 10% | No | **Completeness = (Present fields × Weight) / Total Weight** ### ISIL Code Validation ```python import re def validate_isil_code(isil_code, country_code): """Validate ISIL code format""" # Basic format: CC-XXXXX (country code + dash + institution code) pattern = rf'^{country_code}-[A-Za-z0-9]+$' if not re.match(pattern, isil_code): return False, "Invalid format" # Check length (typically 5-10 characters total) if len(isil_code) < 4 or len(isil_code) > 15: return False, "Invalid length" return True, "Valid" # Example validate_isil_code("JP-1001234", "JP") # (True, "Valid") validate_isil_code("INVALID", "JP") # (False, "Invalid format") ``` --- ## Contact Information for Support ### ISIL Registration Authority (Denmark) - **Organization**: Danish Agency for Culture and Palaces - **Email**: ISIL@slks.dk - **Website**: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil ### Germany (Staatsbibliothek zu Berlin) - **Contact**: Carsten Klee - **Email**: carsten.klee@sbb.spk-berlin.de - **Phone**: +49 30 266 434402 - **Topic**: JSON-API questions, ISIL allocation ### USA (Library of Congress) - **Organization**: Network Development and MARC Standards Office - **Email**: ndmso@loc.gov - **Fax**: +1-202-707-0115 - **Mailing Address**: 101 Independence Avenue, S.E., Washington, D.C. 20540-4402 --- ## Next Steps for Implementation ### Immediate Actions (Week 1) 1. ✅ Complete reconnaissance (DONE) 2. Create extraction scripts for Denmark (Tier 1) 3. Create extraction scripts for Japan (Tier 1) 4. Test Germany JSON-API (Tier 2) 5. Document findings and update inventory ### Short-term (Weeks 2-4) 1. Extract all Tier 1 registries (7,000-8,000 records) 2. Implement Germany API scraper (10,000 records) 3. Begin USA Library of Congress scraper (37,000 records) 4. Set up quality assurance pipeline 5. Create unified dataset (CSV + JSON) ### Medium-term (Months 2-3) 1. Complete Tier 2 and high-priority Tier 3 scraping 2. Cross-reference with existing GLAM datasets (Wikidata, OSM) 3. Enrich records with geocoding (lat/lon) 4. Generate statistics and visualizations 5. Publish dataset and documentation ### Long-term (Months 4-6) 1. Contact Tier 4 agencies for bulk data 2. Monitor UK British Library for restoration 3. Implement automated update pipeline 4. Create public API for GLAM dataset 5. Integrate with LinkML schema and RDF export --- ## Change Log | Version | Date | Changes | |---------|------|---------| | 1.0 | 2025-11-17 | Initial scraper inventory created | --- ## References - **ISO 15511:2019**: International Standard Identifier for Libraries and Related Organizations - **ISIL Registration Authority**: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil - **GLAM Project Documentation**: `/docs/AGENTS.md`, `/data/isil/GLOBAL_ISIL_AGENCIES_OFFICIAL.md` - **Session Summary**: `/docs/sessions/SESSION_2025-11-17_ISIL_GLOBAL_MAPPING.md` --- **End of Document**