251 lines
7.6 KiB
Markdown
251 lines
7.6 KiB
Markdown
# Czech ISIL Investigation - Wikidata Extraction Results
|
|
|
|
**Date**: 2025-11-20
|
|
**Task**: Priority 2, Task 6 - Extract ISIL codes from Wikidata
|
|
**Status**: ✅ COMPLETE (minimal results)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
**Wikidata ISIL Coverage for Czech Institutions: MINIMAL**
|
|
|
|
Queried all 6,719 Czech institutions with Wikidata Q-numbers for ISIL codes (P791 property).
|
|
|
|
**Results**:
|
|
- **ISIL codes found**: 2 (0.03%)
|
|
- **In our dataset**: 1 (Národní technická knihovna - CZ-PrSTK)
|
|
- **Not in our dataset**: 1 (Jihočeská vědecká knihovna - OCLC-CVK)
|
|
|
|
**Coverage**: 1 of 8,694 institutions (0.01%) - essentially zero
|
|
|
|
**Conclusion**: Wikidata is NOT a viable source for Czech ISIL codes. Must contact NK ČR directly.
|
|
|
|
---
|
|
|
|
## Detailed Findings
|
|
|
|
### SPARQL Query Results
|
|
|
|
**Query**: Czech heritage institutions (museums, libraries, archives, galleries) with ISIL codes
|
|
|
|
```sparql
|
|
SELECT ?item ?itemLabel ?isil WHERE {
|
|
VALUES ?type { wd:Q33506 wd:Q7075 wd:Q166118 wd:Q1007870 }
|
|
?item wdt:P31/wdt:P279* ?type .
|
|
?item wdt:P17 wd:Q213 .
|
|
?item wdt:P791 ?isil .
|
|
SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
|
|
}
|
|
```
|
|
|
|
**Results**: 2 institutions
|
|
|
|
| Institution | Wikidata | ISIL Code | In Dataset |
|
|
|-------------|----------|-----------|------------|
|
|
| Národní technická knihovna | Q630893 | CZ-PrSTK | ✅ Yes |
|
|
| Jihočeská vědecká knihovna | Q20057526 | OCLC-CVK | ❌ No |
|
|
|
|
### Why So Few ISIL Codes in Wikidata?
|
|
|
|
**Hypothesis 1: ISIL codes rarely added to Wikidata**
|
|
- ISIL property (P791) added to Wikidata in 2016
|
|
- Low priority for Wikidata editors (not visible to end users)
|
|
- Most Wikidata focus on: website, location, founding date
|
|
|
|
**Hypothesis 2: Czech institutions lack ISIL codes entirely**
|
|
- ISIL system in Czech Republic may be underdeveloped
|
|
- NK ČR may not have assigned codes to most institutions
|
|
- Small libraries/archives may not qualify for ISIL codes
|
|
|
|
**Hypothesis 3: Data entry gap**
|
|
- NK ČR has ISIL registry, but not published on Wikidata
|
|
- No bulk import project (unlike VIAF, which has better Wikidata coverage)
|
|
|
|
### Comparison to Original Expectation
|
|
|
|
**Expected**: 306 ISIL codes (from initial enrichment script comment)
|
|
**Actual**: 2 ISIL codes
|
|
**Discrepancy**: 99.3% fewer than expected
|
|
|
|
**Root cause**: Original enrichment script's SPARQL query included `OPTIONAL { ?item wdt:P791 ?isil }`, which meant it returned results even when ISIL was NULL. The 306 number was VIAF IDs, not ISIL codes.
|
|
|
|
---
|
|
|
|
## Script Performance
|
|
|
|
### Batch Query Efficiency
|
|
|
|
**Processed**: 6,719 Q-numbers in 135 batches (50 Q-numbers per batch)
|
|
**Time**: ~2 minutes
|
|
**Rate**: ~56 Q-numbers/second
|
|
**API calls**: 135 SPARQL queries
|
|
|
|
### Code Quality
|
|
|
|
✅ **Batch processing**: Avoids URL length limits
|
|
✅ **Error handling**: Graceful fallback on API errors
|
|
✅ **Provenance tracking**: Records enrichment history
|
|
✅ **Statistics**: Clear before/after reporting
|
|
|
|
---
|
|
|
|
## Current Czech Dataset Status
|
|
|
|
### Identifier Coverage (After Wikidata ISIL Extraction)
|
|
|
|
| Identifier | Count | Coverage |
|
|
|------------|-------|----------|
|
|
| **Wikidata Q-numbers** | 6,719 | 77.3% |
|
|
| **GPS coordinates** | 6,623 | 76.2% |
|
|
| **VIAF IDs** | 306 | 3.5% |
|
|
| **ISIL codes** | 1 | **0.01%** ← CRITICAL GAP |
|
|
|
|
### Implication
|
|
|
|
**ISIL codes are the bottleneck** for:
|
|
- ✅ GHCID generation (requires ISIL or Q-number for collision resolution)
|
|
- ✅ Library community interoperability
|
|
- ✅ Cross-system citation
|
|
- ✅ Cataloging standards compliance
|
|
|
|
**Without ISIL codes**, Czech institutions are:
|
|
- ❌ Not discoverable via ISIL.org search
|
|
- ❌ Not linkable to international library systems
|
|
- ❌ Missing standardized identifiers for archival citation
|
|
|
|
---
|
|
|
|
## Next Steps - Contact NK ČR
|
|
|
|
### Option 1: Email NK ČR ISIL Registry Team ⭐ RECOMMENDED
|
|
|
|
**Contact**: Národní knihovna České republiky (Czech National Library)
|
|
**Department**: ISIL agency (likely part of cataloging/standards division)
|
|
|
|
**Email draft**:
|
|
|
|
```
|
|
Subject: Žádost o přístup k registru ISIL kódů pro výzkumný projekt
|
|
|
|
Dobrý den,
|
|
|
|
Jsem výzkumný pracovník na projektu mapování dědičných institucí po celém světě.
|
|
Chtěl bych požádat o přístup k registru ISIL kódů pro české knihovny, archivy a muzea.
|
|
|
|
V současné době máme dataset 8 694 českých institucí (knihovny z ADR, archivy z ARON),
|
|
ale chybí nám ISIL kódy pro většinu z nich.
|
|
|
|
Bylo by možné získat:
|
|
1. Úplný export registru ISIL kódů (CSV, Excel, nebo jiný formát)
|
|
2. Pokyny, jak požádat o ISIL kódy pro instituce, které je ještě nemají
|
|
|
|
Dataset bude publikován jako otevřená data (CC0 licence) pro výzkumnou komunitu.
|
|
|
|
Děkuji za pomoc,
|
|
[Vaše jméno]
|
|
|
|
---
|
|
|
|
Subject: Request for access to ISIL code registry for research project
|
|
|
|
Dear Sir/Madam,
|
|
|
|
I am a researcher working on a project to map heritage institutions globally.
|
|
I would like to request access to the ISIL code registry for Czech libraries,
|
|
archives, and museums.
|
|
|
|
Currently, we have a dataset of 8,694 Czech institutions (libraries from ADR,
|
|
archives from ARON), but ISIL codes are missing for most of them.
|
|
|
|
Would it be possible to obtain:
|
|
1. A complete export of the ISIL code registry (CSV, Excel, or other format)
|
|
2. Instructions on how to request ISIL codes for institutions that don't have them yet
|
|
|
|
The dataset will be published as open data (CC0 license) for the research community.
|
|
|
|
Thank you for your assistance,
|
|
[Your name]
|
|
```
|
|
|
|
**Contact info**:
|
|
- Website: https://www.nkp.cz
|
|
- Email: info@nkp.cz (general inquiries)
|
|
- ISIL page: https://www.nkp.cz/sluzby/sluzby-knihovnam/isil (check for contact)
|
|
|
|
**Expected outcome**:
|
|
- 60% chance of response
|
|
- Possible export of 500-2,000 ISIL codes
|
|
- Coverage: 0.01% → 5-20%
|
|
|
|
### Option 2: Query ISIL.org Directly
|
|
|
|
**Website**: https://isil.org
|
|
|
|
**Search strategy**:
|
|
1. Manual search by country (Czech Republic)
|
|
2. Filter by institution type
|
|
3. Scrape results (with rate limiting)
|
|
|
|
**Expected outcome**:
|
|
- 100-500 ISIL codes
|
|
- Coverage: 0.01% → 1-5%
|
|
- Time: 2-3 hours manual work
|
|
|
|
### Option 3: Generate Provisional ISIL Codes
|
|
|
|
**Format**: `CZ-[City][InstitutionCode]`
|
|
|
|
**Process**:
|
|
1. Check NK ČR ISIL format conventions
|
|
2. Generate codes for institutions without them
|
|
3. Mark as `provisional_isil: true` in metadata
|
|
4. Submit to NK ČR for official registration
|
|
|
|
**Risk**: Codes may conflict with existing assignments
|
|
|
|
---
|
|
|
|
## Recommendation
|
|
|
|
**Priority order**:
|
|
|
|
1. **Email NK ČR** ⭐ (15 min, high value)
|
|
2. **Wait 1-2 weeks** for response
|
|
3. **Query ISIL.org** (2-3 hours, medium value)
|
|
4. **Generate provisional codes** (last resort, requires NK ČR approval)
|
|
|
|
**Parallel work while waiting**:
|
|
- Generate GHCIDs using Wikidata Q-numbers (6,719 institutions)
|
|
- GHCID will use Q-numbers as collision resolution (not ISIL codes)
|
|
- Can add ISIL codes to GHCID later when available
|
|
|
|
---
|
|
|
|
## Files
|
|
|
|
### Scripts
|
|
- **`scripts/extract_isil_from_wikidata.py`** - Wikidata ISIL extraction (complete)
|
|
- **`scripts/enrich_czech_wikidata.py`** - Wikidata enrichment (complete)
|
|
|
|
### Data
|
|
- **`data/instances/czech_unified.yaml`** - 8,694 institutions (unchanged, 1 ISIL code)
|
|
- **`data/instances/czech_unified_isil.yaml`** - Same as above (no new ISIL codes)
|
|
|
|
### Documentation
|
|
- **`CZECH_ISIL_WIKIDATA_EXTRACTION.md`** (this file)
|
|
|
|
---
|
|
|
|
## Decision Point
|
|
|
|
**Question**: Should we continue with Task 6 (contact NK ČR), or move to GHCID generation?
|
|
|
|
**Option A**: Contact NK ČR now (requires manual email action by user)
|
|
**Option B**: Skip to GHCID generation using Wikidata Q-numbers (automated)
|
|
|
|
**Recommendation**: **Option A** - Email NK ČR, then proceed with Option B while waiting for response.
|
|
|
|
---
|
|
|
|
**Status**: Task 6 investigation complete. Awaiting decision on next action.
|