1403 lines
41 KiB
Markdown
1403 lines
41 KiB
Markdown
# Global ISIL Registry Scraper Inventory
|
||
|
||
**Version**: 1.0
|
||
**Date**: 2025-11-17
|
||
**Purpose**: Technical inventory of data extraction methods for global ISIL registries
|
||
**Status**: Reconnaissance complete, extraction pending
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This document provides detailed technical specifications for extracting heritage institution data from 46 ISIL registries worldwide. For each registry, we document:
|
||
- Access method (direct download, API, web scraping)
|
||
- Data format and structure
|
||
- Estimated record count
|
||
- Required tools and complexity
|
||
- Priority tier for extraction
|
||
|
||
**Total Estimated Records**: 50,000-100,000+ institutions globally
|
||
|
||
---
|
||
|
||
## Extraction Priority Tiers
|
||
|
||
### Tier 1: Direct Downloads (Easy Wins)
|
||
Large datasets with CSV/Excel/JSON downloads requiring minimal processing.
|
||
|
||
### Tier 2: JSON/XML APIs
|
||
Programmatic access via REST APIs, SPARQL endpoints, or SRU interfaces.
|
||
|
||
### Tier 3: Web Scraping Required
|
||
HTML-based search interfaces requiring BeautifulSoup/Selenium.
|
||
|
||
### Tier 4: Contact Required
|
||
No public access; requires direct contact with registry agency.
|
||
|
||
---
|
||
|
||
## Tier 1: Direct Downloads (9 Registries)
|
||
|
||
### 🇩🇰 Denmark - Danish Agency for Culture and Palaces
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: https://vip.dbc.dk/librarylists/
|
||
**Access Method**: Direct CSV download (4 separate files)
|
||
|
||
**Available Downloads**:
|
||
1. **Folkebiblioteker (væsener)** - Public libraries (main branches)
|
||
- Download: https://vip.dbc.dk/librarylists/publiclibraries/download
|
||
- Format: CSV
|
||
|
||
2. **Folkebiblioteker (alle betjeningssteder)** - Public libraries (all service points)
|
||
- Download: https://vip.dbc.dk/librarylists/publicbranches/download
|
||
- Format: CSV
|
||
|
||
3. **Forskningsbiblioteker (væsener)** - Research libraries (main branches)
|
||
- Download: https://vip.dbc.dk/librarylists/ffulibraries/download
|
||
- Format: CSV
|
||
|
||
4. **Forskningsbiblioteker (alle betjeningssteder)** - Research libraries (all service points)
|
||
- Download: https://vip.dbc.dk/librarylists/ffubranches/download
|
||
- Format: CSV
|
||
|
||
**Fields Available**:
|
||
- Biblioteksnummer (Library number/ISIL)
|
||
- Navn (Name)
|
||
- Adresse (Address)
|
||
- Postnummer (Postal code)
|
||
- By (City)
|
||
- Telefon (Phone)
|
||
- Email
|
||
- Leder (Director/Manager)
|
||
|
||
**Estimated Records**: 700+ institutions
|
||
**Data Format**: CSV (UTF-8)
|
||
**Required Tools**: `pandas`, `requests`
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries only
|
||
|
||
**Implementation Notes**:
|
||
- 4 separate CSV files need to be merged
|
||
- May contain duplicate entries (main branches vs. service points)
|
||
- Danish text fields (no English translation)
|
||
- Direct downloads, no rate limiting concerns
|
||
|
||
---
|
||
|
||
### 🇯🇵 Japan - National Diet Library
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: http://www.ndl.go.jp/en/library/isil/index.html#isillist
|
||
**Access Method**: Direct CSV/Excel download (5 datasets)
|
||
|
||
**Available Downloads** (Updated 2025-11-05):
|
||
|
||
1. **Public Libraries**
|
||
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.csv (1.1MB)
|
||
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_public_20251105_E.xlsx (492KB)
|
||
- Estimated: 2,500-3,000 records
|
||
|
||
2. **NDL, University Libraries, Special Libraries**
|
||
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.csv (554KB)
|
||
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_libraries_20251105_E.xlsx (288KB)
|
||
- Estimated: 1,500-2,000 records
|
||
|
||
3. **Museums**
|
||
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.csv (789KB)
|
||
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_museums_20251105_E.xlsx (420KB)
|
||
- Estimated: 2,000-2,500 records
|
||
|
||
4. **Archives**
|
||
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.csv (20KB)
|
||
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_archives_20251105_E.xlsx (20KB)
|
||
- Estimated: 50-100 records
|
||
|
||
5. **Deleted ISIL**
|
||
- CSV: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.csv (18KB)
|
||
- Excel: https://www.ndl.go.jp/en/library/isil/__icsFiles/afieldfile/2025/11/05/isil_deleted_20251105_E.xlsx (19KB)
|
||
- Estimated: 100-200 records
|
||
|
||
**ISIL Structure**:
|
||
- Country code: `JP-`
|
||
- Type identifier: `1` (Library), `2` (Museum), `3` (Archive)
|
||
- Institution ID: 6 digits (e.g., `JP-1123456`, `JP-2234567`, `JP-3345678`)
|
||
|
||
**Fields Available**:
|
||
- ISIL
|
||
- Institution name (English)
|
||
- Institution name (Japanese)
|
||
- Institution name (Japanese kana)
|
||
- Postal code
|
||
- Address
|
||
- Telephone
|
||
- Fax
|
||
- URL
|
||
- Registration date (since April 2020)
|
||
- Last update date (since April 2020)
|
||
|
||
**Estimated Total Records**: 6,000+ institutions
|
||
**Data Format**: CSV and Excel (choose one)
|
||
**Required Tools**: `pandas`, `requests`, `openpyxl` (for Excel)
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries, Museums, Archives
|
||
**Update Frequency**: Regular (files dated 2025-11-05)
|
||
|
||
**Implementation Notes**:
|
||
- Very recent data (November 2025)
|
||
- English field names available
|
||
- 5 separate files need to be merged
|
||
- Deleted ISIL records useful for tracking institutional changes
|
||
- Public domain license (free to use)
|
||
|
||
---
|
||
|
||
### 🇳🇱 Netherlands - Koninklijke Bibliotheek (KB)
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://www.bibliotheeknetwerk.nl/landelijke-digitale-infrastructuur-ldi/isil-codes
|
||
**Access Method**: Direct Excel download
|
||
|
||
**Status**: ✅ **ALREADY EXTRACTED** (153 public libraries)
|
||
|
||
**Available Download**:
|
||
- **Public Libraries List**: Excel file from bibliotheeknetwerk.nl
|
||
- **File**: `/data/isil/NL/kb_netherlands_public_libraries.{csv,json}`
|
||
- **Parser**: `/scripts/scrapers/parse_kb_netherlands_isil.py`
|
||
|
||
**Fields Available**:
|
||
- Instellingsnaam (Institution name)
|
||
- Plaatsnaam (City name)
|
||
- ISIL code
|
||
|
||
**Extracted Records**: 153 institutions
|
||
**Data Format**: Excel (with header in row 3)
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Public libraries only
|
||
**Completion Date**: 2025-11-17
|
||
|
||
**Implementation Notes**:
|
||
- Excel header in row 3 (not standard row 1)
|
||
- Parser already implemented and tested
|
||
- Limited coverage (public libraries only, no research/special libraries)
|
||
- May need additional source for complete Netherlands coverage
|
||
|
||
---
|
||
|
||
### 🇧🇾 Belarus - National Library of Belarus
|
||
|
||
**Priority**: ⭐ LOW (already extracted)
|
||
**URL**: https://nlb.by/en/for-librarians/international-standard-identifier-for-libraries-and-related-organizations-isil/list-of-libraries-organizations-of-the-republic-of-belarus-and-their-isil-codes
|
||
|
||
**Status**: ✅ **ALREADY EXTRACTED** (167 institutions)
|
||
|
||
**Available Download**:
|
||
- HTML table with all ISIL codes
|
||
- **File**: `/data/isil/BY/belarus_isil_all.{csv,json}`
|
||
- **Parser**: Already exists from previous session
|
||
|
||
**Fields Extracted**:
|
||
- ISIL code
|
||
- Institution name
|
||
- Region code (embedded in ISIL: HO, MI, VI, MA, HM, BR, HR)
|
||
|
||
**Extracted Records**: 167 institutions
|
||
**Regional Breakdown**:
|
||
- HO (Homyel): 29
|
||
- MI (Minsk): 26
|
||
- VI (Vitsyebsk): 25
|
||
- MA (Mahilyow): 25
|
||
- HM (Hrodna): 23
|
||
- BR (Brest): 20
|
||
- HR (Hrodna variant): 19
|
||
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries and related organizations
|
||
**Completion Date**: 2025-11-17
|
||
|
||
---
|
||
|
||
### 🇧🇪 Belgium - Royal Library of Belgium (KBR)
|
||
|
||
**Priority**: ⭐ LOW (already extracted)
|
||
**URL**: http://isil.kbr.be/
|
||
|
||
**Status**: ✅ **ALREADY EXTRACTED** (438 institutions)
|
||
|
||
**Extracted Records**: 438 institutions
|
||
**Completion Date**: Previous session
|
||
**Institution Types**: Libraries, archives, museums
|
||
|
||
---
|
||
|
||
### 🇨🇾 Cyprus - Cyprus University of Technology Library
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://library.cut.ac.cy/en/node/1194
|
||
**Access Method**: Direct list (HTML or downloadable file)
|
||
|
||
**Estimated Records**: 50-100 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries and archives
|
||
|
||
**Implementation Notes**:
|
||
- Small dataset, straightforward extraction
|
||
- Likely HTML table or PDF list
|
||
- Low priority due to small size
|
||
|
||
---
|
||
|
||
### 🇱🇺 Luxembourg - Archives nationales de Luxembourg
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.anlux.public.lu/fr/nous-connaitre/reseaux-internationaux/ISIL.html
|
||
**Access Method**: Direct list (HTML)
|
||
|
||
**Estimated Records**: 20-50 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Archives and libraries
|
||
|
||
**Implementation Notes**:
|
||
- Very small dataset
|
||
- Likely single HTML page with table
|
||
- French language
|
||
|
||
---
|
||
|
||
### 🇭🇺 Hungary - National Széchényi Library
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://ki.oszk.hu/dokumentumtar/isil-azonositok
|
||
**Access Method**: Direct list (likely HTML or downloadable file)
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Medium-sized dataset
|
||
- Hungarian language interface
|
||
- Need to verify file format (HTML vs. downloadable)
|
||
|
||
---
|
||
|
||
### 🇮🇱 Israel - National Library of Israel
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://nli.org.il
|
||
**Access Method**: Direct list (location TBD)
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries and archives
|
||
|
||
**Implementation Notes**:
|
||
- Need to locate exact ISIL list URL on NLI website
|
||
- Likely Hebrew and English versions
|
||
- May require navigation through site structure
|
||
|
||
---
|
||
|
||
## Tier 2: JSON/XML APIs (3 Registries)
|
||
|
||
### 🇩🇪 Germany - Staatsbibliothek zu Berlin (Sigel Database)
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: https://sigel.staatsbibliothek-berlin.de/
|
||
**Access Method**: JSON-API, SRU, Linked Data Service
|
||
|
||
**API Endpoints**:
|
||
|
||
1. **JSON-API (Primary)**
|
||
- Base URL: `https://sigel.staatsbibliothek-berlin.de/api/org.jsonld`
|
||
- Search: `/api/org.jsonld?q={search_terms}`
|
||
- Single record: `/api/org/{ISIL}.jsonld`
|
||
- Example: https://sigel.staatsbibliothek-berlin.de/api/org/DE-12.jsonld
|
||
|
||
2. **SRU (Secondary)**
|
||
- Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/sru
|
||
- Standard library catalog protocol
|
||
|
||
3. **Linked Data Service (RDF)**
|
||
- Documentation: https://sigel.staatsbibliothek-berlin.de/schnittstellen/api/linked-data-service
|
||
- RDF/Turtle format available
|
||
|
||
**JSON-API Features**:
|
||
- **Pagination**: `size` parameter (default: 10, max: TBD)
|
||
- **Page selection**: `page` parameter (default: 1)
|
||
- **Compact format**: `compact=1` for PICA-Plus records
|
||
- **JSONP support**: `callback` parameter for cross-domain requests
|
||
- **Search syntax**: `nam=keyword AND ort=city` (field-specific search)
|
||
- **URL encoding required**: Use `%3D` for `=`, `%20` for spaces
|
||
|
||
**Search Fields**:
|
||
- `nam` - Institution name
|
||
- `ort` - City/Location
|
||
- Additional fields documented in API docs
|
||
|
||
**Response Structure**:
|
||
```json
|
||
{
|
||
"id": "https://sigel.staatsbibliothek-berlin.de/api/org.jsonld?q=...",
|
||
"type": "Collection",
|
||
"totalItems": 12345,
|
||
"freetextQuery": "search terms",
|
||
"member": [
|
||
{
|
||
"id": "https://sigel.staatsbibliothek-berlin.de/org/DE-12",
|
||
"type": ["Library", "Museum", "Archive", "Other"],
|
||
"identifier": "DE-12",
|
||
"location": "Berlin",
|
||
"addressRegion": "DE-BE",
|
||
"seeAlso": "https://sigel.staatsbibliothek-berlin.de/org/DE-12.rdf",
|
||
"data": { /* PICA-Plus record */ }
|
||
}
|
||
],
|
||
"view": {
|
||
"id": "current page URL",
|
||
"type": "PartialCollectionView",
|
||
"first": "first page URL",
|
||
"last": "last page URL",
|
||
"next": "next page URL",
|
||
"previous": "previous page URL",
|
||
"totalItems": 10,
|
||
"pageIndex": 1,
|
||
"numberOfPages": 1235,
|
||
"offset": 0,
|
||
"limit": 10
|
||
}
|
||
}
|
||
```
|
||
|
||
**Estimated Records**: 10,000+ institutions
|
||
**Data Format**: JSON-LD with PICA-Plus embedded
|
||
**Required Tools**: `requests`, `json`
|
||
**Complexity**: ⭐⭐ MEDIUM (pagination required)
|
||
**Institution Types**: Libraries, Archives, Museums, Other
|
||
**Rate Limiting**: Unknown (implement polite scraping with delays)
|
||
|
||
**Implementation Strategy**:
|
||
1. Start with broad search (all institutions)
|
||
2. Paginate through results using `size=100` (test max size)
|
||
3. Extract JSON-LD records
|
||
4. Parse PICA-Plus data for additional fields
|
||
5. Store in standardized CSV/JSON format
|
||
6. Optionally fetch RDF for Linked Data integration
|
||
|
||
**Implementation Notes**:
|
||
- Very comprehensive dataset (~10,000 institutions)
|
||
- Well-documented JSON-API (beta status)
|
||
- Format change announced for September 2025 (check blog post)
|
||
- Linked Data Service provides RDF/Turtle for semantic web integration
|
||
- PICA-Plus format requires additional parsing
|
||
- Contact: Carsten Klee (carsten.klee@sbb.spk-berlin.de, +49 30 266 434402)
|
||
|
||
---
|
||
|
||
### 🇦🇹 Austria - Die Österreichische Bibliothekenverbund und Service GmbH
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://www.isil.at/
|
||
**Access Method**: Searchable interface (check for API)
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Need to investigate if API exists
|
||
- May require web scraping if no API
|
||
- Austrian library network consortium
|
||
|
||
---
|
||
|
||
### 🇪🇬 Egypt - ENSTINET
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://cdex5.sti.sci.eg/dev60cgi/f60cgi?form=DEGL_WEB_SEARCH&userid=degl/degl@sti
|
||
**Access Method**: Search interface (unclear structure)
|
||
|
||
**Estimated Records**: 100-500 institutions
|
||
**Complexity**: ⭐⭐⭐ HARD (interface unclear)
|
||
**Institution Types**: Libraries and research centers
|
||
|
||
**Implementation Notes**:
|
||
- Legacy CGI interface (f60cgi)
|
||
- May require authentication (userid parameter)
|
||
- Unclear if publicly accessible
|
||
- Low priority due to complexity
|
||
|
||
---
|
||
|
||
## Tier 3: Web Scraping Required (15 Registries)
|
||
|
||
### 🇺🇸 United States - Library of Congress
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: http://www.loc.gov/marc/organizations/
|
||
**Access Method**: Search interface + individual record pages
|
||
|
||
**Search URL**: http://www.loc.gov/marc/organizations/org-search.php
|
||
|
||
**Structure**:
|
||
- **Search interface**: Keyword or code search
|
||
- **Individual records**: Separate pages per organization
|
||
- **No bulk download**: Must scrape search results + detail pages
|
||
- **Pagination**: Unknown (test with large result sets)
|
||
|
||
**Code Structure**:
|
||
- **US codes**: Geographic prefix + institution code (e.g., `DLC`, `TxU`, `NNopo`)
|
||
- **ISIL format**: `US-{MARC_CODE}` (e.g., `US-DLC` for Library of Congress)
|
||
- **Non-US codes**: Also included but with country prefixes
|
||
|
||
**Fields Available**:
|
||
- MARC organization code
|
||
- ISIL code (US- prefix added)
|
||
- Organization name
|
||
- Address (street, city, state, postal code)
|
||
- Country (for non-US organizations)
|
||
- Status (valid/obsolete)
|
||
|
||
**Estimated Records**: 37,000+ codes (33,000 valid, 4,000+ invalid/obsolete)
|
||
**Geographic Coverage**: Primarily USA, some non-US institutions
|
||
**Data Format**: HTML (requires scraping)
|
||
**Required Tools**: `requests`, `BeautifulSoup4`, `json`
|
||
**Complexity**: ⭐⭐⭐ HARD (pagination + detail pages)
|
||
**Institution Types**: Libraries, archives, museums, corporations
|
||
|
||
**Implementation Strategy**:
|
||
1. **Option A**: Paginated search scraping
|
||
- Submit broad search (empty query or "*")
|
||
- Parse result pages
|
||
- Extract codes and basic info
|
||
- Follow links to detail pages for full address
|
||
|
||
2. **Option B**: Systematic code generation
|
||
- Generate all possible state + code combinations
|
||
- Test each code via search
|
||
- Extract valid records
|
||
- (Likely inefficient - 50,000+ possible combinations)
|
||
|
||
3. **Option C**: Contact LoC for bulk data
|
||
- Email: ndmso@loc.gov
|
||
- Request: Full MARC organization codes database
|
||
- Justification: Academic research, cultural heritage mapping
|
||
|
||
**Rate Limiting**:
|
||
- No explicit rate limit mentioned
|
||
- Implement polite scraping: 1 request per 2 seconds
|
||
- Expect 24-48 hours for full extraction (37,000 records)
|
||
|
||
**Implementation Notes**:
|
||
- Largest ISIL registry (37,000+ codes)
|
||
- Historical codes preserved (non-unique obsolete codes)
|
||
- Non-US institutions included (subset of global coverage)
|
||
- MARC format requires understanding of subunit structure
|
||
- Contact option available if scraping blocked
|
||
|
||
**Alternative Agencies**:
|
||
- **Canada**: Library and Archives Canada (separate registry)
|
||
- **UK**: British Library (offline since October 2023 cyber attack)
|
||
- **Germany**: Staatsbibliothek zu Berlin (separate registry)
|
||
- **Estonia**: Eesti Rahvusraamatukogu (separate registry)
|
||
|
||
---
|
||
|
||
### 🇨🇦 Canada - Library and Archives Canada
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: https://sigles-symbols.bac-lac.gc.ca/eng/Search
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 2,000-5,000 institutions
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Libraries, archives
|
||
|
||
**Implementation Notes**:
|
||
- Bilingual (English/French)
|
||
- Search interface scraping required
|
||
- May have provincial breakdowns
|
||
- Test for bulk export option
|
||
|
||
---
|
||
|
||
### 🇦🇺 Australia - National Library of Australia
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://www.nla.gov.au/apps/ilrs
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 1,000-2,000 institutions
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- ILRS = Interlibrary Loan and Resource Sharing system
|
||
- May require Selenium if JavaScript-heavy
|
||
- Check for CSV export option
|
||
|
||
---
|
||
|
||
### 🇨🇭 Switzerland - Swiss National Library
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://www.isil.nb.admin.ch/en/
|
||
**Access Method**: Searchable interface
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries, archives
|
||
|
||
**Implementation Notes**:
|
||
- Multilingual (German, French, Italian, English)
|
||
- Check for API or bulk download
|
||
- Swiss federal government site (.admin.ch domain)
|
||
|
||
---
|
||
|
||
### 🇨🇿 Czech Republic - National Library of the Czech Republic
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://aleph.nkp.cz/F/?func=file&file_name=find-b&CON_LNG=ENG&local_base=adr
|
||
**Access Method**: ALEPH library system search interface
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Complexity**: ⭐⭐⭐ HARD (ALEPH system)
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- ALEPH system (library catalog software)
|
||
- Complex URL structure with session management
|
||
- May require cookie handling
|
||
- English interface available (CON_LNG=ENG)
|
||
|
||
---
|
||
|
||
### 🇫🇷 France - Agence Bibliographique de l'Enseignement Superieur (ABES)
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: http://www.sudoc.abes.fr/
|
||
**Access Method**: SUDOC catalog search
|
||
|
||
**Status**: ⚠️ URLs in official directory may be outdated (404 errors)
|
||
**Alternative**: https://www.abes.fr/ (ABES main site)
|
||
|
||
**Estimated Records**: 3,000+ institutions (university and research libraries)
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Academic libraries, research centers
|
||
|
||
**Implementation Notes**:
|
||
- SUDOC = Système Universitaire de Documentation
|
||
- Covers French higher education libraries
|
||
- May have API (check ABES documentation)
|
||
- French language interface
|
||
- Need to verify correct ISIL list URL
|
||
|
||
---
|
||
|
||
### 🇮🇹 Italy - Istituto Centrale per il Catalogo Unico (ICCU)
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH
|
||
**URL**: http://anagrafe.iccu.sbn.it/
|
||
**Access Method**: Anagrafe search interface
|
||
|
||
**Estimated Records**: 5,000+ institutions
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Libraries, archives
|
||
|
||
**Implementation Notes**:
|
||
- Anagrafe = registry/directory
|
||
- SBN = Servizio Bibliotecario Nazionale (National Library Service)
|
||
- Large dataset covering Italian libraries
|
||
- May require Italian language parsing
|
||
|
||
---
|
||
|
||
### 🇧🇬 Bulgaria - National Library of Bulgaria
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.nationallibrary.bg/wp/?page_id=5686
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇫🇮 Finland - National Library of Finland
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://isil.kansalliskirjasto.fi/en/
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- English interface available
|
||
- Check for API or bulk download option
|
||
|
||
---
|
||
|
||
### 🇭🇷 Croatia - Nacionalna i sveučilišna knjižnica u Zagrebu
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://nsk.hr/en/
|
||
**Access Method**: ISIL list mentioned but location unclear
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Need to locate ISIL list on NSK website
|
||
- Croatian language primary, English available
|
||
|
||
---
|
||
|
||
### 🇳🇴 Norway - National Library of Norway
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://www.nb.no/baser/bibliotek/english.html
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇳🇿 New Zealand - National Library of New Zealand
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://natlib.govt.nz/
|
||
**Access Method**: Search mentioned but location unclear
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Need to locate ISIL search on NLNZ website
|
||
|
||
---
|
||
|
||
### 🇶🇦 Qatar - Qatar National Library
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.qnl.qa/find-answers/other-libraries/
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 20-50 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Small dataset
|
||
- English and Arabic interfaces
|
||
|
||
---
|
||
|
||
### 🇷🇴 Romania - National Library of Romania
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://bibnat.ro/
|
||
**Access Method**: Search interface (location unclear)
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇷🇺 Russia - Russian National Public Library for Science and Technology
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://www.gpntb.ru/rfid1/searchdb.php
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 1,000-2,000 institutions
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Libraries, scientific institutions
|
||
|
||
**Implementation Notes**:
|
||
- Russian language interface
|
||
- RFID-related URL suggests technical database
|
||
- May require Cyrillic text handling
|
||
|
||
---
|
||
|
||
### 🇸🇮 Slovenia - National and University Library
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://www.nuk.uni-lj.si/eng/node/543
|
||
**Access Method**: ISIL list mentioned
|
||
|
||
**Estimated Records**: 100-200 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇸🇰 Slovakia - Slovak National Library
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.snk.sk/en/information-for/libraries-and-librarians/isil.html
|
||
**Access Method**: ISIL list mentioned
|
||
|
||
**Estimated Records**: 200-500 institutions
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
## Tier 4: Contact Required (8 Agencies)
|
||
|
||
### 🇦🇷 Argentina - IRAM
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.iram.org.ar/
|
||
**Status**: No public ISIL list available
|
||
|
||
**Estimated Records**: 500-1,000 institutions
|
||
**Contact Required**: Yes
|
||
**Institution Types**: Libraries, archives
|
||
|
||
**Implementation Notes**:
|
||
- IRAM = Argentine Standardization and Certification Institute
|
||
- May provide list upon request
|
||
- Spanish language
|
||
|
||
---
|
||
|
||
### 🇬🇱 Greenland - Central and Public Library of Greenland
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://www.katak.gl/
|
||
**Status**: No public ISIL list
|
||
|
||
**Estimated Records**: 5-10 institutions
|
||
**Contact Required**: Yes
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇮🇷 Iran - National Library and Archives of Iran
|
||
|
||
**Priority**: ⭐ LOW
|
||
**Status**: No URL provided, no public list
|
||
|
||
**Estimated Records**: 100-500 institutions
|
||
**Contact Required**: Yes
|
||
**Institution Types**: Libraries, archives
|
||
|
||
**Implementation Notes**:
|
||
- May require Persian language communication
|
||
- Political/access challenges possible
|
||
|
||
---
|
||
|
||
### 🇰🇿 Kazakhstan - National Library of Kazakhstan
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://www.nlrk.kz/index.php?lang=en
|
||
**Status**: No public ISIL list
|
||
|
||
**Estimated Records**: 100-300 institutions
|
||
**Contact Required**: Yes
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇲🇩 Moldova - National Library of Moldova
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://bnrm.md/
|
||
**Status**: No public ISIL list
|
||
|
||
**Estimated Records**: 50-100 institutions
|
||
**Contact Required**: Yes
|
||
**Institution Types**: Libraries
|
||
|
||
---
|
||
|
||
### 🇳🇵 Nepal
|
||
|
||
**Priority**: ⭐ LOW
|
||
**Status**: No agency listed in official ISIL directory
|
||
|
||
**Estimated Records**: Unknown
|
||
**Contact Required**: Yes - contact Danish Agency for information
|
||
**Institution Types**: Unknown
|
||
|
||
---
|
||
|
||
### 🇰🇷 South Korea - National Library of Korea
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://www.nl.go.kr/
|
||
**Status**: Main NLK site doesn't list ISIL directory
|
||
|
||
**Alternative Registry**: https://librarian.nl.go.kr/ (discovered via Exa search)
|
||
|
||
**Estimated Records**: 1,000-2,000 institutions
|
||
**Contact Required**: Investigate librarian.nl.go.kr before contacting
|
||
**Institution Types**: Libraries
|
||
|
||
**Implementation Notes**:
|
||
- Secondary registry exists at librarian subdomain
|
||
- Need to verify if this is official ISIL registry
|
||
- Korean language interface
|
||
- May have English option
|
||
|
||
---
|
||
|
||
### 🇬🇧 United Kingdom - British Library
|
||
|
||
**Priority**: ⭐⭐⭐ HIGH (when restored)
|
||
**URL**: https://www.bl.uk/help/get-a-marc-organization-code (currently 404)
|
||
**Status**: ⚠️ **OFFLINE** - Cyber attack October 2023
|
||
|
||
**Estimated Records**: 5,000+ institutions
|
||
**Contact Required**: Wait for service restoration
|
||
**Institution Types**: Libraries, archives, museums
|
||
|
||
**Implementation Notes**:
|
||
- British Library systems offline since October 2023 cyber attack
|
||
- ISIL/MARC code directory unavailable
|
||
- Monitor BL announcements for restoration
|
||
- Largest UK heritage collection metadata
|
||
- Critical for UK coverage but currently blocked
|
||
|
||
---
|
||
|
||
## Non-National Registries (5 Agencies)
|
||
|
||
### EUR - European Organizations
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://www.eui.eu/Documents/Research/HistoricalArchivesofEU/ISIL/isil-directory.pdf
|
||
**Access Method**: Direct PDF download
|
||
|
||
**Estimated Records**: 20-50 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: European Union organizations
|
||
|
||
**Implementation Notes**:
|
||
- PDF format requires text extraction
|
||
- Covers EU institutional libraries
|
||
- Small specialized dataset
|
||
|
||
---
|
||
|
||
### GTB - GTBib (Spanish-language interlibrary loan)
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: https://directorio.gtbib.net/lista_isil
|
||
**Access Method**: Direct list
|
||
|
||
**Estimated Records**: 100-300 institutions
|
||
**Complexity**: ⭐ EASY
|
||
**Institution Types**: Spanish-language libraries
|
||
|
||
**Implementation Notes**:
|
||
- Spanish language
|
||
- Interlibrary loan network
|
||
- May overlap with national registries (ES, MX, AR, etc.)
|
||
|
||
---
|
||
|
||
### O - OCLC (RFID tags)
|
||
|
||
**Priority**: ⭐ LOW
|
||
**URL**: http://www.oclc.org/contacts/libraries/
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: Special encoding system (not traditional ISIL)
|
||
**Complexity**: ⭐⭐ MEDIUM
|
||
**Institution Types**: OCLC member libraries
|
||
|
||
**Implementation Notes**:
|
||
- Letter "O" used for OCLC RFID technical encoding
|
||
- Not a traditional national registry
|
||
- Overlaps with US and other national codes
|
||
|
||
---
|
||
|
||
### OCLC - OCLC WorldCat Symbol
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: http://www.oclc.org/contacts/libraries/
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 10,000+ institutions (global)
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: WorldCat member libraries
|
||
|
||
**Implementation Notes**:
|
||
- OCLC WorldCat symbols used globally
|
||
- Different from ISIL but related
|
||
- Massive dataset, may overlap with national registries
|
||
- Useful for cross-referencing and enrichment
|
||
|
||
---
|
||
|
||
### ZDB - Zeitschriftendatenbank (German Serials Database)
|
||
|
||
**Priority**: ⭐⭐ MEDIUM
|
||
**URL**: https://zdb-katalog.de/index.xhtml
|
||
**Access Method**: Search interface
|
||
|
||
**Estimated Records**: 3,000+ institutions
|
||
**Complexity**: ⭐⭐⭐ HARD
|
||
**Institution Types**: Libraries holding serials
|
||
|
||
**Implementation Notes**:
|
||
- German serials/journals database
|
||
- Managed by Staatsbibliothek zu Berlin
|
||
- May have API (check ZDB documentation)
|
||
- Overlaps with DE- national codes
|
||
|
||
---
|
||
|
||
## Extraction Recommendations
|
||
|
||
### Phase 1: Quick Wins (Tier 1 Downloads)
|
||
**Timeline**: 1-2 days
|
||
**Expected Yield**: 7,000-8,000 records
|
||
|
||
1. **Denmark** - 4 CSV downloads (700 records)
|
||
2. **Japan** - 5 CSV/Excel downloads (6,000+ records)
|
||
3. **Hungary** - Direct list (200-500 records)
|
||
4. **Israel** - Direct list (200-500 records)
|
||
5. **Cyprus** - Direct list (50-100 records)
|
||
6. **Luxembourg** - Direct list (20-50 records)
|
||
|
||
**Tools Required**: `pandas`, `requests`, `openpyxl`
|
||
**Risk**: Low
|
||
**Effort**: Low
|
||
|
||
---
|
||
|
||
### Phase 2: Major APIs (Tier 2)
|
||
**Timeline**: 3-5 days
|
||
**Expected Yield**: 10,000-12,000 records
|
||
|
||
1. **Germany** - JSON-API (10,000+ records)
|
||
2. **Austria** - Search/API (500-1,000 records)
|
||
|
||
**Tools Required**: `requests`, `json`, pagination logic
|
||
**Risk**: Medium (rate limiting, API changes)
|
||
**Effort**: Medium
|
||
|
||
---
|
||
|
||
### Phase 3: High-Value Scraping (Tier 3 Priority)
|
||
**Timeline**: 2-3 weeks
|
||
**Expected Yield**: 50,000+ records
|
||
|
||
1. **USA** - Library of Congress (37,000+ records) - **PRIORITY**
|
||
2. **Canada** - Library and Archives Canada (2,000-5,000 records)
|
||
3. **France** - SUDOC (3,000+ records)
|
||
4. **Italy** - ICCU Anagrafe (5,000+ records)
|
||
|
||
**Tools Required**: `requests`, `BeautifulSoup4`, `Selenium` (if needed)
|
||
**Risk**: High (blocking, CAPTCHA, rate limiting)
|
||
**Effort**: High
|
||
|
||
---
|
||
|
||
### Phase 4: Remaining Scraping (Tier 3 Secondary)
|
||
**Timeline**: 2-4 weeks
|
||
**Expected Yield**: 10,000-15,000 records
|
||
|
||
1. Australia, Switzerland, Czech Republic, Finland, Norway, Bulgaria, Croatia, New Zealand, Qatar, Romania, Russia, Slovenia, Slovakia
|
||
|
||
**Tools Required**: Same as Phase 3
|
||
**Risk**: Medium-High
|
||
**Effort**: Medium-High
|
||
|
||
---
|
||
|
||
### Phase 5: Contact-Based (Tier 4)
|
||
**Timeline**: Ongoing (3-6 months for responses)
|
||
**Expected Yield**: 5,000-10,000 records
|
||
|
||
1. **UK** - Wait for British Library restoration (5,000+ records)
|
||
2. **South Korea** - Investigate librarian.nl.go.kr (1,000-2,000 records)
|
||
3. Argentina, Greenland, Iran, Kazakhstan, Moldova, Nepal - Contact agencies
|
||
|
||
**Tools Required**: Email templates, institutional affiliation
|
||
**Risk**: High (may not receive data)
|
||
**Effort**: Low (but long wait times)
|
||
|
||
---
|
||
|
||
## Technical Implementation Guide
|
||
|
||
### Standard Parser Template
|
||
|
||
```python
|
||
import requests
|
||
import pandas as pd
|
||
from bs4 import BeautifulSoup
|
||
import json
|
||
from datetime import datetime
|
||
|
||
class ISILRegistryScraper:
|
||
def __init__(self, country_code, registry_name, base_url):
|
||
self.country_code = country_code
|
||
self.registry_name = registry_name
|
||
self.base_url = base_url
|
||
self.headers = {
|
||
'User-Agent': 'GLAM Heritage Institution Scraper/1.0 (Academic Research)'
|
||
}
|
||
self.records = []
|
||
|
||
def fetch_csv(self, url):
|
||
"""Fetch CSV file from direct download URL"""
|
||
response = requests.get(url, headers=self.headers)
|
||
response.raise_for_status()
|
||
return pd.read_csv(io.StringIO(response.text))
|
||
|
||
def fetch_json_api(self, url, params=None):
|
||
"""Fetch data from JSON API with pagination"""
|
||
response = requests.get(url, params=params, headers=self.headers)
|
||
response.raise_for_status()
|
||
return response.json()
|
||
|
||
def scrape_html_table(self, url):
|
||
"""Scrape HTML table from webpage"""
|
||
response = requests.get(url, headers=self.headers)
|
||
response.raise_for_status()
|
||
soup = BeautifulSoup(response.content, 'html.parser')
|
||
# Parse table logic here
|
||
return soup
|
||
|
||
def standardize_record(self, raw_record):
|
||
"""Convert raw record to standard format"""
|
||
return {
|
||
'isil_code': raw_record.get('isil_code'),
|
||
'name': raw_record.get('name'),
|
||
'city': raw_record.get('city'),
|
||
'region': raw_record.get('region'),
|
||
'country': self.country_code,
|
||
'postal_code': raw_record.get('postal_code'),
|
||
'address': raw_record.get('address'),
|
||
'phone': raw_record.get('phone'),
|
||
'email': raw_record.get('email'),
|
||
'url': raw_record.get('url'),
|
||
'institution_type': raw_record.get('type'),
|
||
'registry': self.registry_name,
|
||
'source_url': self.base_url,
|
||
'extraction_date': datetime.now().isoformat()
|
||
}
|
||
|
||
def export_csv(self, output_path):
|
||
"""Export records to CSV"""
|
||
df = pd.DataFrame(self.records)
|
||
df.to_csv(output_path, index=False, encoding='utf-8')
|
||
|
||
def export_json(self, output_path):
|
||
"""Export records to JSON with metadata"""
|
||
output = {
|
||
'metadata': {
|
||
'country': self.country_code,
|
||
'registry': self.registry_name,
|
||
'source_url': self.base_url,
|
||
'extraction_date': datetime.now().isoformat(),
|
||
'record_count': len(self.records)
|
||
},
|
||
'records': self.records
|
||
}
|
||
with open(output_path, 'w', encoding='utf-8') as f:
|
||
json.dump(output, f, indent=2, ensure_ascii=False)
|
||
```
|
||
|
||
### Rate Limiting Best Practices
|
||
|
||
```python
|
||
import time
|
||
from datetime import datetime, timedelta
|
||
|
||
class RateLimiter:
|
||
def __init__(self, requests_per_second=1):
|
||
self.min_interval = 1.0 / requests_per_second
|
||
self.last_request = None
|
||
|
||
def wait(self):
|
||
"""Wait if necessary to respect rate limit"""
|
||
if self.last_request:
|
||
elapsed = (datetime.now() - self.last_request).total_seconds()
|
||
if elapsed < self.min_interval:
|
||
time.sleep(self.min_interval - elapsed)
|
||
self.last_request = datetime.now()
|
||
|
||
# Usage
|
||
limiter = RateLimiter(requests_per_second=1)
|
||
|
||
for page in range(1, total_pages + 1):
|
||
limiter.wait()
|
||
data = fetch_page(page)
|
||
process_data(data)
|
||
```
|
||
|
||
### Pagination Handler
|
||
|
||
```python
|
||
def paginate_api(base_url, page_size=100, max_pages=None):
|
||
"""Generic pagination handler for APIs"""
|
||
page = 1
|
||
all_records = []
|
||
|
||
while True:
|
||
params = {'page': page, 'size': page_size}
|
||
response = requests.get(base_url, params=params)
|
||
data = response.json()
|
||
|
||
records = data.get('member', [])
|
||
if not records:
|
||
break
|
||
|
||
all_records.extend(records)
|
||
|
||
# Check for last page
|
||
view = data.get('view', {})
|
||
if page >= view.get('numberOfPages', page):
|
||
break
|
||
|
||
if max_pages and page >= max_pages:
|
||
break
|
||
|
||
page += 1
|
||
time.sleep(1) # Polite scraping
|
||
|
||
return all_records
|
||
```
|
||
|
||
---
|
||
|
||
## Required Python Packages
|
||
|
||
```bash
|
||
# Core dependencies
|
||
pip install pandas requests beautifulsoup4 openpyxl lxml
|
||
|
||
# For complex scraping
|
||
pip install selenium webdriver-manager
|
||
|
||
# For API interactions
|
||
pip install urllib3
|
||
|
||
# For JSON/CSV handling
|
||
pip install jsonlines
|
||
|
||
# For progress tracking
|
||
pip install tqdm
|
||
|
||
# For parallel processing (optional)
|
||
pip install joblib
|
||
```
|
||
|
||
---
|
||
|
||
## Output Format Specification
|
||
|
||
### CSV Output Schema
|
||
|
||
```csv
|
||
isil_code,name,city,region,country,postal_code,address,phone,email,url,institution_type,registry,source_url,extraction_date
|
||
JP-1001234,"Tokyo Metropolitan Library","Tokyo","13","JP","100-0001","1-1 Minami-Osawa, Hachioji-shi","+81-42-123-4567","info@library.metro.tokyo.jp","https://www.library.metro.tokyo.jp","LIBRARY","National Diet Library","https://www.ndl.go.jp/en/library/isil/","2025-11-17T10:30:00Z"
|
||
```
|
||
|
||
### JSON Output Schema
|
||
|
||
```json
|
||
{
|
||
"metadata": {
|
||
"country": "JP",
|
||
"country_name": "Japan",
|
||
"registry": "National Diet Library",
|
||
"registry_agency": "National Diet Library (NDL)",
|
||
"source_url": "https://www.ndl.go.jp/en/library/isil/",
|
||
"extraction_date": "2025-11-17T10:30:00Z",
|
||
"extraction_method": "Direct CSV download",
|
||
"record_count": 6234,
|
||
"institution_types": ["LIBRARY", "MUSEUM", "ARCHIVE"],
|
||
"data_license": "Public Domain (per NDL)",
|
||
"notes": "Includes public libraries, university libraries, museums, and archives"
|
||
},
|
||
"records": [
|
||
{
|
||
"isil_code": "JP-1001234",
|
||
"name": "Tokyo Metropolitan Library",
|
||
"name_japanese": "東京都立図書館",
|
||
"name_kana": "とうきょうとりつとしょかん",
|
||
"city": "Tokyo",
|
||
"region": "13",
|
||
"country": "JP",
|
||
"postal_code": "100-0001",
|
||
"address": "1-1 Minami-Osawa, Hachioji-shi, Tokyo",
|
||
"phone": "+81-42-123-4567",
|
||
"fax": "+81-42-123-4568",
|
||
"email": "info@library.metro.tokyo.jp",
|
||
"url": "https://www.library.metro.tokyo.jp",
|
||
"institution_type": "LIBRARY",
|
||
"registration_date": "2011-10-15",
|
||
"last_update_date": "2024-03-20",
|
||
"registry": "National Diet Library",
|
||
"source_url": "https://www.ndl.go.jp/en/library/isil/",
|
||
"extraction_date": "2025-11-17T10:30:00Z"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Quality Assurance Checklist
|
||
|
||
### Pre-Extraction
|
||
- [ ] Verify source URL is accessible
|
||
- [ ] Check for API documentation
|
||
- [ ] Test sample queries/downloads
|
||
- [ ] Review robots.txt and terms of service
|
||
- [ ] Implement rate limiting (1 req/sec default)
|
||
- [ ] Set up error logging
|
||
|
||
### During Extraction
|
||
- [ ] Monitor for HTTP errors (429, 403, 503)
|
||
- [ ] Log failed requests for retry
|
||
- [ ] Validate ISIL code format (country code + structure)
|
||
- [ ] Check for duplicate records
|
||
- [ ] Handle encoding issues (UTF-8)
|
||
- [ ] Track progress (records extracted / estimated total)
|
||
|
||
### Post-Extraction
|
||
- [ ] Validate record count against estimate
|
||
- [ ] Check for required fields (ISIL code, name, country)
|
||
- [ ] Detect malformed ISIL codes
|
||
- [ ] Cross-reference with existing datasets
|
||
- [ ] Generate extraction report (metadata)
|
||
- [ ] Export to CSV and JSON formats
|
||
- [ ] Document any issues or anomalies
|
||
|
||
---
|
||
|
||
## Data Quality Metrics
|
||
|
||
### Record Completeness Score
|
||
|
||
| Field | Weight | Required |
|
||
|-------|--------|----------|
|
||
| ISIL code | 20% | Yes |
|
||
| Name | 20% | Yes |
|
||
| City | 15% | No |
|
||
| Country | 10% | Yes |
|
||
| Postal code | 5% | No |
|
||
| Address | 10% | No |
|
||
| Phone | 5% | No |
|
||
| Email | 5% | No |
|
||
| URL | 10% | No |
|
||
|
||
**Completeness = (Present fields × Weight) / Total Weight**
|
||
|
||
### ISIL Code Validation
|
||
|
||
```python
|
||
import re
|
||
|
||
def validate_isil_code(isil_code, country_code):
|
||
"""Validate ISIL code format"""
|
||
# Basic format: CC-XXXXX (country code + dash + institution code)
|
||
pattern = rf'^{country_code}-[A-Za-z0-9]+$'
|
||
|
||
if not re.match(pattern, isil_code):
|
||
return False, "Invalid format"
|
||
|
||
# Check length (typically 5-10 characters total)
|
||
if len(isil_code) < 4 or len(isil_code) > 15:
|
||
return False, "Invalid length"
|
||
|
||
return True, "Valid"
|
||
|
||
# Example
|
||
validate_isil_code("JP-1001234", "JP") # (True, "Valid")
|
||
validate_isil_code("INVALID", "JP") # (False, "Invalid format")
|
||
```
|
||
|
||
---
|
||
|
||
## Contact Information for Support
|
||
|
||
### ISIL Registration Authority (Denmark)
|
||
- **Organization**: Danish Agency for Culture and Palaces
|
||
- **Email**: ISIL@slks.dk
|
||
- **Website**: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
|
||
|
||
### Germany (Staatsbibliothek zu Berlin)
|
||
- **Contact**: Carsten Klee
|
||
- **Email**: carsten.klee@sbb.spk-berlin.de
|
||
- **Phone**: +49 30 266 434402
|
||
- **Topic**: JSON-API questions, ISIL allocation
|
||
|
||
### USA (Library of Congress)
|
||
- **Organization**: Network Development and MARC Standards Office
|
||
- **Email**: ndmso@loc.gov
|
||
- **Fax**: +1-202-707-0115
|
||
- **Mailing Address**: 101 Independence Avenue, S.E., Washington, D.C. 20540-4402
|
||
|
||
---
|
||
|
||
## Next Steps for Implementation
|
||
|
||
### Immediate Actions (Week 1)
|
||
1. ✅ Complete reconnaissance (DONE)
|
||
2. Create extraction scripts for Denmark (Tier 1)
|
||
3. Create extraction scripts for Japan (Tier 1)
|
||
4. Test Germany JSON-API (Tier 2)
|
||
5. Document findings and update inventory
|
||
|
||
### Short-term (Weeks 2-4)
|
||
1. Extract all Tier 1 registries (7,000-8,000 records)
|
||
2. Implement Germany API scraper (10,000 records)
|
||
3. Begin USA Library of Congress scraper (37,000 records)
|
||
4. Set up quality assurance pipeline
|
||
5. Create unified dataset (CSV + JSON)
|
||
|
||
### Medium-term (Months 2-3)
|
||
1. Complete Tier 2 and high-priority Tier 3 scraping
|
||
2. Cross-reference with existing GLAM datasets (Wikidata, OSM)
|
||
3. Enrich records with geocoding (lat/lon)
|
||
4. Generate statistics and visualizations
|
||
5. Publish dataset and documentation
|
||
|
||
### Long-term (Months 4-6)
|
||
1. Contact Tier 4 agencies for bulk data
|
||
2. Monitor UK British Library for restoration
|
||
3. Implement automated update pipeline
|
||
4. Create public API for GLAM dataset
|
||
5. Integrate with LinkML schema and RDF export
|
||
|
||
---
|
||
|
||
## Change Log
|
||
|
||
| Version | Date | Changes |
|
||
|---------|------|---------|
|
||
| 1.0 | 2025-11-17 | Initial scraper inventory created |
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **ISO 15511:2019**: International Standard Identifier for Libraries and Related Organizations
|
||
- **ISIL Registration Authority**: https://slks.dk/english/work-areas/libraries-and-literature/library-standards/isil
|
||
- **GLAM Project Documentation**: `/docs/AGENTS.md`, `/data/isil/GLOBAL_ISIL_AGENCIES_OFFICIAL.md`
|
||
- **Session Summary**: `/docs/sessions/SESSION_2025-11-17_ISIL_GLOBAL_MAPPING.md`
|
||
|
||
---
|
||
|
||
**End of Document**
|