glam/CZECH_ISIL_WIKIDATA_EXTRACTION.md
2025-11-21 22:12:33 +01:00

7.6 KiB

Czech ISIL Investigation - Wikidata Extraction Results

Date: 2025-11-20
Task: Priority 2, Task 6 - Extract ISIL codes from Wikidata
Status: COMPLETE (minimal results)


Executive Summary

Wikidata ISIL Coverage for Czech Institutions: MINIMAL

Queried all 6,719 Czech institutions with Wikidata Q-numbers for ISIL codes (P791 property).

Results:

  • ISIL codes found: 2 (0.03%)
  • In our dataset: 1 (Národní technická knihovna - CZ-PrSTK)
  • Not in our dataset: 1 (Jihočeská vědecká knihovna - OCLC-CVK)

Coverage: 1 of 8,694 institutions (0.01%) - essentially zero

Conclusion: Wikidata is NOT a viable source for Czech ISIL codes. Must contact NK ČR directly.


Detailed Findings

SPARQL Query Results

Query: Czech heritage institutions (museums, libraries, archives, galleries) with ISIL codes

SELECT ?item ?itemLabel ?isil WHERE {
  VALUES ?type { wd:Q33506 wd:Q7075 wd:Q166118 wd:Q1007870 }
  ?item wdt:P31/wdt:P279* ?type .
  ?item wdt:P17 wd:Q213 .
  ?item wdt:P791 ?isil .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "cs,en" }
}

Results: 2 institutions

Institution Wikidata ISIL Code In Dataset
Národní technická knihovna Q630893 CZ-PrSTK Yes
Jihočeská vědecká knihovna Q20057526 OCLC-CVK No

Why So Few ISIL Codes in Wikidata?

Hypothesis 1: ISIL codes rarely added to Wikidata

  • ISIL property (P791) added to Wikidata in 2016
  • Low priority for Wikidata editors (not visible to end users)
  • Most Wikidata focus on: website, location, founding date

Hypothesis 2: Czech institutions lack ISIL codes entirely

  • ISIL system in Czech Republic may be underdeveloped
  • NK ČR may not have assigned codes to most institutions
  • Small libraries/archives may not qualify for ISIL codes

Hypothesis 3: Data entry gap

  • NK ČR has ISIL registry, but not published on Wikidata
  • No bulk import project (unlike VIAF, which has better Wikidata coverage)

Comparison to Original Expectation

Expected: 306 ISIL codes (from initial enrichment script comment)
Actual: 2 ISIL codes
Discrepancy: 99.3% fewer than expected

Root cause: Original enrichment script's SPARQL query included OPTIONAL { ?item wdt:P791 ?isil }, which meant it returned results even when ISIL was NULL. The 306 number was VIAF IDs, not ISIL codes.


Script Performance

Batch Query Efficiency

Processed: 6,719 Q-numbers in 135 batches (50 Q-numbers per batch)
Time: ~2 minutes
Rate: ~56 Q-numbers/second
API calls: 135 SPARQL queries

Code Quality

Batch processing: Avoids URL length limits
Error handling: Graceful fallback on API errors
Provenance tracking: Records enrichment history
Statistics: Clear before/after reporting


Current Czech Dataset Status

Identifier Coverage (After Wikidata ISIL Extraction)

Identifier Count Coverage
Wikidata Q-numbers 6,719 77.3%
GPS coordinates 6,623 76.2%
VIAF IDs 306 3.5%
ISIL codes 1 0.01% ← CRITICAL GAP

Implication

ISIL codes are the bottleneck for:

  • GHCID generation (requires ISIL or Q-number for collision resolution)
  • Library community interoperability
  • Cross-system citation
  • Cataloging standards compliance

Without ISIL codes, Czech institutions are:

  • Not discoverable via ISIL.org search
  • Not linkable to international library systems
  • Missing standardized identifiers for archival citation

Next Steps - Contact NK ČR

Contact: Národní knihovna České republiky (Czech National Library)
Department: ISIL agency (likely part of cataloging/standards division)

Email draft:

Subject: Žádost o přístup k registru ISIL kódů pro výzkumný projekt

Dobrý den,

Jsem výzkumný pracovník na projektu mapování dědičných institucí po celém světě. 
Chtěl bych požádat o přístup k registru ISIL kódů pro české knihovny, archivy a muzea.

V současné době máme dataset 8 694 českých institucí (knihovny z ADR, archivy z ARON), 
ale chybí nám ISIL kódy pro většinu z nich. 

Bylo by možné získat:
1. Úplný export registru ISIL kódů (CSV, Excel, nebo jiný formát)
2. Pokyny, jak požádat o ISIL kódy pro instituce, které je ještě nemají

Dataset bude publikován jako otevřená data (CC0 licence) pro výzkumnou komunitu.

Děkuji za pomoc,
[Vaše jméno]

---

Subject: Request for access to ISIL code registry for research project

Dear Sir/Madam,

I am a researcher working on a project to map heritage institutions globally. 
I would like to request access to the ISIL code registry for Czech libraries, 
archives, and museums.

Currently, we have a dataset of 8,694 Czech institutions (libraries from ADR, 
archives from ARON), but ISIL codes are missing for most of them.

Would it be possible to obtain:
1. A complete export of the ISIL code registry (CSV, Excel, or other format)
2. Instructions on how to request ISIL codes for institutions that don't have them yet

The dataset will be published as open data (CC0 license) for the research community.

Thank you for your assistance,
[Your name]

Contact info:

Expected outcome:

  • 60% chance of response
  • Possible export of 500-2,000 ISIL codes
  • Coverage: 0.01% → 5-20%

Option 2: Query ISIL.org Directly

Website: https://isil.org

Search strategy:

  1. Manual search by country (Czech Republic)
  2. Filter by institution type
  3. Scrape results (with rate limiting)

Expected outcome:

  • 100-500 ISIL codes
  • Coverage: 0.01% → 1-5%
  • Time: 2-3 hours manual work

Option 3: Generate Provisional ISIL Codes

Format: CZ-[City][InstitutionCode]

Process:

  1. Check NK ČR ISIL format conventions
  2. Generate codes for institutions without them
  3. Mark as provisional_isil: true in metadata
  4. Submit to NK ČR for official registration

Risk: Codes may conflict with existing assignments


Recommendation

Priority order:

  1. Email NK ČR (15 min, high value)
  2. Wait 1-2 weeks for response
  3. Query ISIL.org (2-3 hours, medium value)
  4. Generate provisional codes (last resort, requires NK ČR approval)

Parallel work while waiting:

  • Generate GHCIDs using Wikidata Q-numbers (6,719 institutions)
  • GHCID will use Q-numbers as collision resolution (not ISIL codes)
  • Can add ISIL codes to GHCID later when available

Files

Scripts

  • scripts/extract_isil_from_wikidata.py - Wikidata ISIL extraction (complete)
  • scripts/enrich_czech_wikidata.py - Wikidata enrichment (complete)

Data

  • data/instances/czech_unified.yaml - 8,694 institutions (unchanged, 1 ISIL code)
  • data/instances/czech_unified_isil.yaml - Same as above (no new ISIL codes)

Documentation

  • CZECH_ISIL_WIKIDATA_EXTRACTION.md (this file)

Decision Point

Question: Should we continue with Task 6 (contact NK ČR), or move to GHCID generation?

Option A: Contact NK ČR now (requires manual email action by user)
Option B: Skip to GHCID generation using Wikidata Q-numbers (automated)

Recommendation: Option A - Email NK ČR, then proceed with Option B while waiting for response.


Status: Task 6 investigation complete. Awaiting decision on next action.