glam/CZECH_ISIL_WIKIDATA_EXTRACTION.md
2025-11-21 22:12:33 +01:00

251 lines
7.6 KiB
Markdown

# Czech ISIL Investigation - Wikidata Extraction Results
**Date**: 2025-11-20
**Task**: Priority 2, Task 6 - Extract ISIL codes from Wikidata
**Status**: ✅ COMPLETE (minimal results)
---
## Executive Summary
**Wikidata ISIL Coverage for Czech Institutions: MINIMAL**
Queried all 6,719 Czech institutions with Wikidata Q-numbers for ISIL codes (P791 property).
**Results**:
- **ISIL codes found**: 2 (0.03%)
- **In our dataset**: 1 (Národní technická knihovna - CZ-PrSTK)
- **Not in our dataset**: 1 (Jihočeská vědecká knihovna - OCLC-CVK)
**Coverage**: 1 of 8,694 institutions (0.01%) - essentially zero
**Conclusion**: Wikidata is NOT a viable source for Czech ISIL codes. Must contact NK ČR directly.
---
## Detailed Findings
### SPARQL Query Results
**Query**: Czech heritage institutions (museums, libraries, archives, galleries) with ISIL codes
```sparql
SELECT ?item ?itemLabel ?isil WHERE {
VALUES ?type { wd:Q33506 wd:Q7075 wd:Q166118 wd:Q1007870 }
?item wdt:P31/wdt:P279* ?type .
?item wdt:P17 wd:Q213 .
?item wdt:P791 ?isil .
SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
}
```
**Results**: 2 institutions
| Institution | Wikidata | ISIL Code | In Dataset |
|-------------|----------|-----------|------------|
| Národní technická knihovna | Q630893 | CZ-PrSTK | ✅ Yes |
| Jihočeská vědecká knihovna | Q20057526 | OCLC-CVK | ❌ No |
### Why So Few ISIL Codes in Wikidata?
**Hypothesis 1: ISIL codes rarely added to Wikidata**
- ISIL property (P791) added to Wikidata in 2016
- Low priority for Wikidata editors (not visible to end users)
- Most Wikidata focus on: website, location, founding date
**Hypothesis 2: Czech institutions lack ISIL codes entirely**
- ISIL system in Czech Republic may be underdeveloped
- NK ČR may not have assigned codes to most institutions
- Small libraries/archives may not qualify for ISIL codes
**Hypothesis 3: Data entry gap**
- NK ČR has ISIL registry, but not published on Wikidata
- No bulk import project (unlike VIAF, which has better Wikidata coverage)
### Comparison to Original Expectation
**Expected**: 306 ISIL codes (from initial enrichment script comment)
**Actual**: 2 ISIL codes
**Discrepancy**: 99.3% fewer than expected
**Root cause**: Original enrichment script's SPARQL query included `OPTIONAL { ?item wdt:P791 ?isil }`, which meant it returned results even when ISIL was NULL. The 306 number was VIAF IDs, not ISIL codes.
---
## Script Performance
### Batch Query Efficiency
**Processed**: 6,719 Q-numbers in 135 batches (50 Q-numbers per batch)
**Time**: ~2 minutes
**Rate**: ~56 Q-numbers/second
**API calls**: 135 SPARQL queries
### Code Quality
**Batch processing**: Avoids URL length limits
**Error handling**: Graceful fallback on API errors
**Provenance tracking**: Records enrichment history
**Statistics**: Clear before/after reporting
---
## Current Czech Dataset Status
### Identifier Coverage (After Wikidata ISIL Extraction)
| Identifier | Count | Coverage |
|------------|-------|----------|
| **Wikidata Q-numbers** | 6,719 | 77.3% |
| **GPS coordinates** | 6,623 | 76.2% |
| **VIAF IDs** | 306 | 3.5% |
| **ISIL codes** | 1 | **0.01%** ← CRITICAL GAP |
### Implication
**ISIL codes are the bottleneck** for:
- ✅ GHCID generation (requires ISIL or Q-number for collision resolution)
- ✅ Library community interoperability
- ✅ Cross-system citation
- ✅ Cataloging standards compliance
**Without ISIL codes**, Czech institutions are:
- ❌ Not discoverable via ISIL.org search
- ❌ Not linkable to international library systems
- ❌ Missing standardized identifiers for archival citation
---
## Next Steps - Contact NK ČR
### Option 1: Email NK ČR ISIL Registry Team ⭐ RECOMMENDED
**Contact**: Národní knihovna České republiky (Czech National Library)
**Department**: ISIL agency (likely part of cataloging/standards division)
**Email draft**:
```
Subject: Žádost o přístup k registru ISIL kódů pro výzkumný projekt
Dobrý den,
Jsem výzkumný pracovník na projektu mapování dědičných institucí po celém světě.
Chtěl bych požádat o přístup k registru ISIL kódů pro české knihovny, archivy a muzea.
V současné době máme dataset 8 694 českých institucí (knihovny z ADR, archivy z ARON),
ale chybí nám ISIL kódy pro většinu z nich.
Bylo by možné získat:
1. Úplný export registru ISIL kódů (CSV, Excel, nebo jiný formát)
2. Pokyny, jak požádat o ISIL kódy pro instituce, které je ještě nemají
Dataset bude publikován jako otevřená data (CC0 licence) pro výzkumnou komunitu.
Děkuji za pomoc,
[Vaše jméno]
---
Subject: Request for access to ISIL code registry for research project
Dear Sir/Madam,
I am a researcher working on a project to map heritage institutions globally.
I would like to request access to the ISIL code registry for Czech libraries,
archives, and museums.
Currently, we have a dataset of 8,694 Czech institutions (libraries from ADR,
archives from ARON), but ISIL codes are missing for most of them.
Would it be possible to obtain:
1. A complete export of the ISIL code registry (CSV, Excel, or other format)
2. Instructions on how to request ISIL codes for institutions that don't have them yet
The dataset will be published as open data (CC0 license) for the research community.
Thank you for your assistance,
[Your name]
```
**Contact info**:
- Website: https://www.nkp.cz
- Email: info@nkp.cz (general inquiries)
- ISIL page: https://www.nkp.cz/sluzby/sluzby-knihovnam/isil (check for contact)
**Expected outcome**:
- 60% chance of response
- Possible export of 500-2,000 ISIL codes
- Coverage: 0.01% → 5-20%
### Option 2: Query ISIL.org Directly
**Website**: https://isil.org
**Search strategy**:
1. Manual search by country (Czech Republic)
2. Filter by institution type
3. Scrape results (with rate limiting)
**Expected outcome**:
- 100-500 ISIL codes
- Coverage: 0.01% → 1-5%
- Time: 2-3 hours manual work
### Option 3: Generate Provisional ISIL Codes
**Format**: `CZ-[City][InstitutionCode]`
**Process**:
1. Check NK ČR ISIL format conventions
2. Generate codes for institutions without them
3. Mark as `provisional_isil: true` in metadata
4. Submit to NK ČR for official registration
**Risk**: Codes may conflict with existing assignments
---
## Recommendation
**Priority order**:
1. **Email NK ČR** ⭐ (15 min, high value)
2. **Wait 1-2 weeks** for response
3. **Query ISIL.org** (2-3 hours, medium value)
4. **Generate provisional codes** (last resort, requires NK ČR approval)
**Parallel work while waiting**:
- Generate GHCIDs using Wikidata Q-numbers (6,719 institutions)
- GHCID will use Q-numbers as collision resolution (not ISIL codes)
- Can add ISIL codes to GHCID later when available
---
## Files
### Scripts
- **`scripts/extract_isil_from_wikidata.py`** - Wikidata ISIL extraction (complete)
- **`scripts/enrich_czech_wikidata.py`** - Wikidata enrichment (complete)
### Data
- **`data/instances/czech_unified.yaml`** - 8,694 institutions (unchanged, 1 ISIL code)
- **`data/instances/czech_unified_isil.yaml`** - Same as above (no new ISIL codes)
### Documentation
- **`CZECH_ISIL_WIKIDATA_EXTRACTION.md`** (this file)
---
## Decision Point
**Question**: Should we continue with Task 6 (contact NK ČR), or move to GHCID generation?
**Option A**: Contact NK ČR now (requires manual email action by user)
**Option B**: Skip to GHCID generation using Wikidata Q-numbers (automated)
**Recommendation**: **Option A** - Email NK ČR, then proceed with Option B while waiting for response.
---
**Status**: Task 6 investigation complete. Awaiting decision on next action.