379 lines
10 KiB
Markdown
379 lines
10 KiB
Markdown
# Belarus ISIL Enrichment - Next Session Quick Start
|
|
|
|
**Status**: Phase 1 Complete (Data Collection + Sample Enrichment)
|
|
**Next Phase**: Phase 2 (Fuzzy Matching + Full Dataset Generation)
|
|
|
|
---
|
|
|
|
## What We Have
|
|
|
|
✅ **Complete ISIL Registry** (154 institutions)
|
|
- File: `data/isil/belarus_isil_complete_dataset.md`
|
|
- Format: Markdown tables with ISIL codes + names
|
|
- Coverage: All 7 regions
|
|
|
|
✅ **Wikidata Entities** (32 Belarusian libraries)
|
|
- 5 matched to ISIL codes
|
|
- 27 candidates for fuzzy matching
|
|
- Includes VIAF IDs, websites, coordinates
|
|
|
|
✅ **OpenStreetMap Data** (575 library locations)
|
|
- File: `data/isil/belarus_osm_libraries.json`
|
|
- 8 with Wikidata links
|
|
- 201 with contact info (phone, email, address, hours)
|
|
|
|
✅ **Sample LinkML Dataset** (10 enriched records)
|
|
- File: `data/instances/belarus_isil_enriched.yaml`
|
|
- Schema: heritage_custodian.yaml v0.2.1
|
|
- Data tiers: TIER_1 (ISIL) + TIER_3 (Wikidata/OSM)
|
|
|
|
---
|
|
|
|
## What We Need to Do
|
|
|
|
### Priority 1: Fuzzy Name Matching 🔴
|
|
|
|
**Goal**: Match remaining 149 ISIL institutions to Wikidata/OSM entities
|
|
|
|
**Approach**:
|
|
1. Extract institution names from ISIL registry
|
|
2. Normalize names (lowercase, remove punctuation, transliteration)
|
|
3. Compare against:
|
|
- 27 Wikidata entities without ISIL codes
|
|
- 575 OSM library entries
|
|
4. Use `rapidfuzz` library for similarity scoring
|
|
5. Threshold: >85% match confidence
|
|
6. Manual review for borderline cases (80-85%)
|
|
|
|
**Expected Matches**:
|
|
- Wikidata: ~15-20 additional matches
|
|
- OSM: ~50-80 additional matches with coordinates/contact info
|
|
|
|
**Code Template**:
|
|
```python
|
|
from rapidfuzz import fuzz
|
|
import json
|
|
|
|
# Load ISIL names
|
|
isil_names = {
|
|
'BY-HM0001': 'Republican Scientific and Technical Library',
|
|
'BY-BR0000': 'Brest Regional Library named after Maxim Gorky',
|
|
# ... 152 more
|
|
}
|
|
|
|
# Load OSM/Wikidata candidates
|
|
osm_data = json.load(open('data/isil/belarus_osm_libraries.json'))
|
|
wikidata_candidates = [...] # From previous SPARQL query
|
|
|
|
# Fuzzy match
|
|
for isil, name in isil_names.items():
|
|
best_match = None
|
|
best_score = 0
|
|
|
|
for osm_entry in osm_data['elements']:
|
|
osm_name = osm_entry['tags'].get('name', '')
|
|
score = fuzz.ratio(name.lower(), osm_name.lower())
|
|
|
|
if score > best_score:
|
|
best_score = score
|
|
best_match = osm_entry
|
|
|
|
if best_score > 85:
|
|
print(f"MATCH: {isil} -> {best_match['tags']['name']} ({best_score}%)")
|
|
```
|
|
|
|
---
|
|
|
|
### Priority 2: Full LinkML Dataset Generation 🔴
|
|
|
|
**Goal**: Convert all 154 ISIL institutions to LinkML YAML format
|
|
|
|
**Input**:
|
|
- ISIL registry (154 institutions)
|
|
- Fuzzy match results (from Priority 1)
|
|
|
|
**Output**:
|
|
- File: `data/instances/belarus_complete.yaml`
|
|
- Format: LinkML heritage_custodian.yaml v0.2.1
|
|
- Include enrichment where available
|
|
|
|
**Code Template**:
|
|
```python
|
|
from datetime import datetime, timezone
|
|
|
|
institutions = [] # Load from ISIL registry
|
|
enrichments = {} # Load from fuzzy matching results
|
|
|
|
output_lines = []
|
|
output_lines.append("# Belarus Complete ISIL Dataset (LinkML)")
|
|
output_lines.append("# Schema: heritage_custodian.yaml v0.2.1")
|
|
output_lines.append(f"# Generated: {datetime.now(timezone.utc).isoformat()}")
|
|
output_lines.append("---\n")
|
|
|
|
for inst in institutions:
|
|
isil = inst['isil']
|
|
enrichment = enrichments.get(isil, {})
|
|
|
|
# Generate LinkML record
|
|
output_lines.append(f"- id: https://w3id.org/heritage/custodian/by/{isil.lower().replace('-', '')}")
|
|
output_lines.append(f" name: {inst['name']}")
|
|
output_lines.append(f" institution_type: LIBRARY")
|
|
|
|
# Add locations
|
|
output_lines.append(f" locations:")
|
|
output_lines.append(f" - city: {inst['city']}")
|
|
output_lines.append(f" region: {inst['region']}")
|
|
output_lines.append(f" country: BY")
|
|
|
|
if enrichment.get('coords'):
|
|
lat, lon = enrichment['coords']
|
|
output_lines.append(f" latitude: {lat}")
|
|
output_lines.append(f" longitude: {lon}")
|
|
|
|
# Add identifiers
|
|
output_lines.append(f" identifiers:")
|
|
output_lines.append(f" - identifier_scheme: ISIL")
|
|
output_lines.append(f" identifier_value: {isil}")
|
|
output_lines.append(f" identifier_url: https://isil.org/{isil}")
|
|
|
|
if enrichment.get('wikidata'):
|
|
wd = enrichment['wikidata']
|
|
output_lines.append(f" - identifier_scheme: Wikidata")
|
|
output_lines.append(f" identifier_value: {wd}")
|
|
output_lines.append(f" identifier_url: https://www.wikidata.org/wiki/{wd}")
|
|
|
|
# Add provenance
|
|
output_lines.append(f" provenance:")
|
|
output_lines.append(f" data_source: CSV_REGISTRY")
|
|
output_lines.append(f" data_tier: TIER_1_AUTHORITATIVE")
|
|
output_lines.append(f" extraction_date: \"{datetime.now(timezone.utc).isoformat()}\"")
|
|
output_lines.append(f" confidence_score: {0.95 if enrichment else 0.90}")
|
|
|
|
output_lines.append("") # Blank line
|
|
|
|
# Write to file
|
|
with open('data/instances/belarus_complete.yaml', 'w', encoding='utf-8') as f:
|
|
f.write('\n'.join(output_lines))
|
|
```
|
|
|
|
---
|
|
|
|
### Priority 3: RDF/JSON-LD Export 🟡
|
|
|
|
**Goal**: Convert LinkML YAML to Linked Open Data formats
|
|
|
|
**Tools**: `linkml-convert` command-line tool
|
|
|
|
**Commands**:
|
|
```bash
|
|
# Install linkml if needed
|
|
pip install linkml
|
|
|
|
# Convert to RDF Turtle
|
|
linkml-convert \
|
|
--schema schemas/heritage_custodian.yaml \
|
|
--output data/rdf/belarus_complete.ttl \
|
|
--input-format yaml \
|
|
--output-format turtle \
|
|
data/instances/belarus_complete.yaml
|
|
|
|
# Convert to JSON-LD
|
|
linkml-convert \
|
|
--schema schemas/heritage_custodian.yaml \
|
|
--output data/jsonld/belarus_complete.jsonld \
|
|
--input-format yaml \
|
|
--output-format json-ld \
|
|
data/instances/belarus_complete.yaml
|
|
```
|
|
|
|
**Output Files**:
|
|
- `data/rdf/belarus_complete.ttl` (RDF Turtle)
|
|
- `data/jsonld/belarus_complete.jsonld` (JSON-LD)
|
|
|
|
---
|
|
|
|
## Quick Commands for Next Session
|
|
|
|
### Load existing data
|
|
```python
|
|
import json
|
|
|
|
# Load OSM data
|
|
with open('data/isil/belarus_osm_libraries.json', 'r', encoding='utf-8') as f:
|
|
osm_data = json.load(f)
|
|
|
|
# Load ISIL registry (need to parse markdown)
|
|
with open('data/isil/belarus_isil_complete_dataset.md', 'r', encoding='utf-8') as f:
|
|
isil_md = f.read()
|
|
|
|
# Load sample enriched records
|
|
import yaml
|
|
with open('data/instances/belarus_isil_enriched.yaml', 'r', encoding='utf-8') as f:
|
|
sample_records = yaml.safe_load(f)
|
|
```
|
|
|
|
### Check progress
|
|
```bash
|
|
# Count files
|
|
ls -l data/isil/belarus*
|
|
ls -l data/instances/belarus*
|
|
|
|
# View sample
|
|
head -100 data/instances/belarus_isil_enriched.yaml
|
|
|
|
# Check OSM data size
|
|
wc -l data/isil/belarus_osm_libraries.json
|
|
```
|
|
|
|
---
|
|
|
|
## ISIL Registry Parsing
|
|
|
|
The ISIL names need to be extracted from the markdown file. Here's the structure:
|
|
|
|
```
|
|
### Minsk City (25 institutions)
|
|
|
|
| ISIL Code | Institution Name |
|
|
|-----------|------------------|
|
|
| BY-HM0000 | National Library of Belarus |
|
|
| BY-HM0001 | Republican Scientific and Technical Library |
|
|
...
|
|
```
|
|
|
|
**Parsing code**:
|
|
```python
|
|
import re
|
|
|
|
def parse_isil_registry(md_file):
|
|
"""Extract ISIL codes and names from markdown."""
|
|
with open(md_file, 'r', encoding='utf-8') as f:
|
|
content = f.read()
|
|
|
|
institutions = []
|
|
current_region = None
|
|
|
|
# Find region sections
|
|
region_pattern = r'### (.+) \((\d+) institutions\)'
|
|
table_row_pattern = r'\| (BY-[A-Z]{2}\d{4}) \| (.+) \|'
|
|
|
|
for line in content.split('\n'):
|
|
# Check for region header
|
|
region_match = re.match(region_pattern, line)
|
|
if region_match:
|
|
current_region = region_match.group(1)
|
|
continue
|
|
|
|
# Check for table row
|
|
row_match = re.match(table_row_pattern, line)
|
|
if row_match and current_region:
|
|
isil_code = row_match.group(1)
|
|
name = row_match.group(2).strip()
|
|
|
|
institutions.append({
|
|
'isil': isil_code,
|
|
'name': name,
|
|
'region': current_region,
|
|
'country': 'BY'
|
|
})
|
|
|
|
return institutions
|
|
|
|
# Usage
|
|
institutions = parse_isil_registry('data/isil/belarus_isil_complete_dataset.md')
|
|
print(f"Parsed {len(institutions)} institutions")
|
|
```
|
|
|
|
---
|
|
|
|
## Known Wikidata Matches
|
|
|
|
These 5 ISIL codes already have Wikidata enrichment:
|
|
|
|
```python
|
|
wikidata_matches = {
|
|
'BY-HM0000': {
|
|
'wikidata': 'Q948470',
|
|
'viaf': '163025395',
|
|
'website': 'https://www.nlb.by/',
|
|
'coords': (53.931421, 27.645844)
|
|
},
|
|
'BY-HM0008': {
|
|
'wikidata': 'Q2091093',
|
|
'website': 'http://preslib.org.by/',
|
|
'coords': (53.8960, 27.5466)
|
|
},
|
|
'BY-HM0005': {
|
|
'wikidata': 'Q3918424',
|
|
'viaf': '125518437',
|
|
'website': 'https://csl.bas-net.by/',
|
|
'coords': (53.9201455, 27.6000568)
|
|
},
|
|
'BY-MI0000': {
|
|
'wikidata': 'Q16145114',
|
|
'website': 'http://pushlib.org.by/',
|
|
'coords': (53.9150869, 27.5879206)
|
|
},
|
|
'BY-HR0000': {
|
|
'wikidata': 'Q13030528',
|
|
'website': 'http://grodnolib.by/',
|
|
'coords': (53.6806128, 23.8388116)
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
Before completing the dataset:
|
|
|
|
- [ ] All 154 ISIL institutions have LinkML records
|
|
- [ ] Schema validation passes (`linkml-validate`)
|
|
- [ ] At least 30% enrichment rate (coordinates, websites, or Wikidata)
|
|
- [ ] Provenance metadata complete for all records
|
|
- [ ] RDF/Turtle export validates
|
|
- [ ] JSON-LD export validates
|
|
- [ ] No duplicate ISIL codes
|
|
- [ ] All geographic regions represented
|
|
|
|
---
|
|
|
|
## Estimated Time
|
|
|
|
- **Fuzzy matching**: 2 hours (includes manual review)
|
|
- **Full dataset generation**: 1 hour
|
|
- **RDF/JSON-LD export**: 30 minutes
|
|
- **Validation & QA**: 30 minutes
|
|
|
|
**Total**: ~4 hours to completion
|
|
|
|
---
|
|
|
|
## Files to Create
|
|
|
|
1. `scripts/belarus_fuzzy_matcher.py` - Fuzzy matching script
|
|
2. `scripts/belarus_linkml_generator.py` - Full dataset generator
|
|
3. `data/instances/belarus_complete.yaml` - 154 LinkML records
|
|
4. `data/rdf/belarus_complete.ttl` - RDF Turtle export
|
|
5. `data/jsonld/belarus_complete.jsonld` - JSON-LD export
|
|
6. `data/isil/BELARUS_FINAL_REPORT.md` - Completion report
|
|
|
|
---
|
|
|
|
## Contact for Questions
|
|
|
|
**National Library of Belarus**
|
|
- Email: inbox@nlb.by
|
|
- Phone: (+375 17) 368 37 37
|
|
- Website: https://nlb.by/
|
|
|
|
**ISIL International Agency**
|
|
- Website: https://isil.org/
|
|
- Email: Via website contact form
|
|
|
|
---
|
|
|
|
**Last Updated**: November 18, 2025
|
|
**Session Owner**: kempersc
|
|
**Next Session**: TBD (fuzzy matching phase)
|