10 KiB
Belarus ISIL Enrichment - Next Session Quick Start
Status: Phase 1 Complete (Data Collection + Sample Enrichment)
Next Phase: Phase 2 (Fuzzy Matching + Full Dataset Generation)
What We Have
✅ Complete ISIL Registry (154 institutions)
- File:
data/isil/belarus_isil_complete_dataset.md - Format: Markdown tables with ISIL codes + names
- Coverage: All 7 regions
✅ Wikidata Entities (32 Belarusian libraries)
- 5 matched to ISIL codes
- 27 candidates for fuzzy matching
- Includes VIAF IDs, websites, coordinates
✅ OpenStreetMap Data (575 library locations)
- File:
data/isil/belarus_osm_libraries.json - 8 with Wikidata links
- 201 with contact info (phone, email, address, hours)
✅ Sample LinkML Dataset (10 enriched records)
- File:
data/instances/belarus_isil_enriched.yaml - Schema: heritage_custodian.yaml v0.2.1
- Data tiers: TIER_1 (ISIL) + TIER_3 (Wikidata/OSM)
What We Need to Do
Priority 1: Fuzzy Name Matching 🔴
Goal: Match remaining 149 ISIL institutions to Wikidata/OSM entities
Approach:
- Extract institution names from ISIL registry
- Normalize names (lowercase, remove punctuation, transliteration)
- Compare against:
- 27 Wikidata entities without ISIL codes
- 575 OSM library entries
- Use
rapidfuzzlibrary for similarity scoring - Threshold: >85% match confidence
- Manual review for borderline cases (80-85%)
Expected Matches:
- Wikidata: ~15-20 additional matches
- OSM: ~50-80 additional matches with coordinates/contact info
Code Template:
from rapidfuzz import fuzz
import json
# Load ISIL names
isil_names = {
'BY-HM0001': 'Republican Scientific and Technical Library',
'BY-BR0000': 'Brest Regional Library named after Maxim Gorky',
# ... 152 more
}
# Load OSM/Wikidata candidates
osm_data = json.load(open('data/isil/belarus_osm_libraries.json'))
wikidata_candidates = [...] # From previous SPARQL query
# Fuzzy match
for isil, name in isil_names.items():
best_match = None
best_score = 0
for osm_entry in osm_data['elements']:
osm_name = osm_entry['tags'].get('name', '')
score = fuzz.ratio(name.lower(), osm_name.lower())
if score > best_score:
best_score = score
best_match = osm_entry
if best_score > 85:
print(f"MATCH: {isil} -> {best_match['tags']['name']} ({best_score}%)")
Priority 2: Full LinkML Dataset Generation 🔴
Goal: Convert all 154 ISIL institutions to LinkML YAML format
Input:
- ISIL registry (154 institutions)
- Fuzzy match results (from Priority 1)
Output:
- File:
data/instances/belarus_complete.yaml - Format: LinkML heritage_custodian.yaml v0.2.1
- Include enrichment where available
Code Template:
from datetime import datetime, timezone
institutions = [] # Load from ISIL registry
enrichments = {} # Load from fuzzy matching results
output_lines = []
output_lines.append("# Belarus Complete ISIL Dataset (LinkML)")
output_lines.append("# Schema: heritage_custodian.yaml v0.2.1")
output_lines.append(f"# Generated: {datetime.now(timezone.utc).isoformat()}")
output_lines.append("---\n")
for inst in institutions:
isil = inst['isil']
enrichment = enrichments.get(isil, {})
# Generate LinkML record
output_lines.append(f"- id: https://w3id.org/heritage/custodian/by/{isil.lower().replace('-', '')}")
output_lines.append(f" name: {inst['name']}")
output_lines.append(f" institution_type: LIBRARY")
# Add locations
output_lines.append(f" locations:")
output_lines.append(f" - city: {inst['city']}")
output_lines.append(f" region: {inst['region']}")
output_lines.append(f" country: BY")
if enrichment.get('coords'):
lat, lon = enrichment['coords']
output_lines.append(f" latitude: {lat}")
output_lines.append(f" longitude: {lon}")
# Add identifiers
output_lines.append(f" identifiers:")
output_lines.append(f" - identifier_scheme: ISIL")
output_lines.append(f" identifier_value: {isil}")
output_lines.append(f" identifier_url: https://isil.org/{isil}")
if enrichment.get('wikidata'):
wd = enrichment['wikidata']
output_lines.append(f" - identifier_scheme: Wikidata")
output_lines.append(f" identifier_value: {wd}")
output_lines.append(f" identifier_url: https://www.wikidata.org/wiki/{wd}")
# Add provenance
output_lines.append(f" provenance:")
output_lines.append(f" data_source: CSV_REGISTRY")
output_lines.append(f" data_tier: TIER_1_AUTHORITATIVE")
output_lines.append(f" extraction_date: \"{datetime.now(timezone.utc).isoformat()}\"")
output_lines.append(f" confidence_score: {0.95 if enrichment else 0.90}")
output_lines.append("") # Blank line
# Write to file
with open('data/instances/belarus_complete.yaml', 'w', encoding='utf-8') as f:
f.write('\n'.join(output_lines))
Priority 3: RDF/JSON-LD Export 🟡
Goal: Convert LinkML YAML to Linked Open Data formats
Tools: linkml-convert command-line tool
Commands:
# Install linkml if needed
pip install linkml
# Convert to RDF Turtle
linkml-convert \
--schema schemas/heritage_custodian.yaml \
--output data/rdf/belarus_complete.ttl \
--input-format yaml \
--output-format turtle \
data/instances/belarus_complete.yaml
# Convert to JSON-LD
linkml-convert \
--schema schemas/heritage_custodian.yaml \
--output data/jsonld/belarus_complete.jsonld \
--input-format yaml \
--output-format json-ld \
data/instances/belarus_complete.yaml
Output Files:
data/rdf/belarus_complete.ttl(RDF Turtle)data/jsonld/belarus_complete.jsonld(JSON-LD)
Quick Commands for Next Session
Load existing data
import json
# Load OSM data
with open('data/isil/belarus_osm_libraries.json', 'r', encoding='utf-8') as f:
osm_data = json.load(f)
# Load ISIL registry (need to parse markdown)
with open('data/isil/belarus_isil_complete_dataset.md', 'r', encoding='utf-8') as f:
isil_md = f.read()
# Load sample enriched records
import yaml
with open('data/instances/belarus_isil_enriched.yaml', 'r', encoding='utf-8') as f:
sample_records = yaml.safe_load(f)
Check progress
# Count files
ls -l data/isil/belarus*
ls -l data/instances/belarus*
# View sample
head -100 data/instances/belarus_isil_enriched.yaml
# Check OSM data size
wc -l data/isil/belarus_osm_libraries.json
ISIL Registry Parsing
The ISIL names need to be extracted from the markdown file. Here's the structure:
### Minsk City (25 institutions)
| ISIL Code | Institution Name |
|-----------|------------------|
| BY-HM0000 | National Library of Belarus |
| BY-HM0001 | Republican Scientific and Technical Library |
...
Parsing code:
import re
def parse_isil_registry(md_file):
"""Extract ISIL codes and names from markdown."""
with open(md_file, 'r', encoding='utf-8') as f:
content = f.read()
institutions = []
current_region = None
# Find region sections
region_pattern = r'### (.+) \((\d+) institutions\)'
table_row_pattern = r'\| (BY-[A-Z]{2}\d{4}) \| (.+) \|'
for line in content.split('\n'):
# Check for region header
region_match = re.match(region_pattern, line)
if region_match:
current_region = region_match.group(1)
continue
# Check for table row
row_match = re.match(table_row_pattern, line)
if row_match and current_region:
isil_code = row_match.group(1)
name = row_match.group(2).strip()
institutions.append({
'isil': isil_code,
'name': name,
'region': current_region,
'country': 'BY'
})
return institutions
# Usage
institutions = parse_isil_registry('data/isil/belarus_isil_complete_dataset.md')
print(f"Parsed {len(institutions)} institutions")
Known Wikidata Matches
These 5 ISIL codes already have Wikidata enrichment:
wikidata_matches = {
'BY-HM0000': {
'wikidata': 'Q948470',
'viaf': '163025395',
'website': 'https://www.nlb.by/',
'coords': (53.931421, 27.645844)
},
'BY-HM0008': {
'wikidata': 'Q2091093',
'website': 'http://preslib.org.by/',
'coords': (53.8960, 27.5466)
},
'BY-HM0005': {
'wikidata': 'Q3918424',
'viaf': '125518437',
'website': 'https://csl.bas-net.by/',
'coords': (53.9201455, 27.6000568)
},
'BY-MI0000': {
'wikidata': 'Q16145114',
'website': 'http://pushlib.org.by/',
'coords': (53.9150869, 27.5879206)
},
'BY-HR0000': {
'wikidata': 'Q13030528',
'website': 'http://grodnolib.by/',
'coords': (53.6806128, 23.8388116)
}
}
Validation Checklist
Before completing the dataset:
- All 154 ISIL institutions have LinkML records
- Schema validation passes (
linkml-validate) - At least 30% enrichment rate (coordinates, websites, or Wikidata)
- Provenance metadata complete for all records
- RDF/Turtle export validates
- JSON-LD export validates
- No duplicate ISIL codes
- All geographic regions represented
Estimated Time
- Fuzzy matching: 2 hours (includes manual review)
- Full dataset generation: 1 hour
- RDF/JSON-LD export: 30 minutes
- Validation & QA: 30 minutes
Total: ~4 hours to completion
Files to Create
scripts/belarus_fuzzy_matcher.py- Fuzzy matching scriptscripts/belarus_linkml_generator.py- Full dataset generatordata/instances/belarus_complete.yaml- 154 LinkML recordsdata/rdf/belarus_complete.ttl- RDF Turtle exportdata/jsonld/belarus_complete.jsonld- JSON-LD exportdata/isil/BELARUS_FINAL_REPORT.md- Completion report
Contact for Questions
National Library of Belarus
- Email: inbox@nlb.by
- Phone: (+375 17) 368 37 37
- Website: https://nlb.by/
ISIL International Agency
- Website: https://isil.org/
- Email: Via website contact form
Last Updated: November 18, 2025
Session Owner: kempersc
Next Session: TBD (fuzzy matching phase)