glam/data/isil/BELARUS_NEXT_SESSION.md
2025-11-19 23:25:22 +01:00

379 lines
10 KiB
Markdown

# Belarus ISIL Enrichment - Next Session Quick Start
**Status**: Phase 1 Complete (Data Collection + Sample Enrichment)
**Next Phase**: Phase 2 (Fuzzy Matching + Full Dataset Generation)
---
## What We Have
**Complete ISIL Registry** (154 institutions)
- File: `data/isil/belarus_isil_complete_dataset.md`
- Format: Markdown tables with ISIL codes + names
- Coverage: All 7 regions
**Wikidata Entities** (32 Belarusian libraries)
- 5 matched to ISIL codes
- 27 candidates for fuzzy matching
- Includes VIAF IDs, websites, coordinates
**OpenStreetMap Data** (575 library locations)
- File: `data/isil/belarus_osm_libraries.json`
- 8 with Wikidata links
- 201 with contact info (phone, email, address, hours)
**Sample LinkML Dataset** (10 enriched records)
- File: `data/instances/belarus_isil_enriched.yaml`
- Schema: heritage_custodian.yaml v0.2.1
- Data tiers: TIER_1 (ISIL) + TIER_3 (Wikidata/OSM)
---
## What We Need to Do
### Priority 1: Fuzzy Name Matching 🔴
**Goal**: Match remaining 149 ISIL institutions to Wikidata/OSM entities
**Approach**:
1. Extract institution names from ISIL registry
2. Normalize names (lowercase, remove punctuation, transliteration)
3. Compare against:
- 27 Wikidata entities without ISIL codes
- 575 OSM library entries
4. Use `rapidfuzz` library for similarity scoring
5. Threshold: >85% match confidence
6. Manual review for borderline cases (80-85%)
**Expected Matches**:
- Wikidata: ~15-20 additional matches
- OSM: ~50-80 additional matches with coordinates/contact info
**Code Template**:
```python
from rapidfuzz import fuzz
import json
# Load ISIL names
isil_names = {
'BY-HM0001': 'Republican Scientific and Technical Library',
'BY-BR0000': 'Brest Regional Library named after Maxim Gorky',
# ... 152 more
}
# Load OSM/Wikidata candidates
osm_data = json.load(open('data/isil/belarus_osm_libraries.json'))
wikidata_candidates = [...] # From previous SPARQL query
# Fuzzy match
for isil, name in isil_names.items():
best_match = None
best_score = 0
for osm_entry in osm_data['elements']:
osm_name = osm_entry['tags'].get('name', '')
score = fuzz.ratio(name.lower(), osm_name.lower())
if score > best_score:
best_score = score
best_match = osm_entry
if best_score > 85:
print(f"MATCH: {isil} -> {best_match['tags']['name']} ({best_score}%)")
```
---
### Priority 2: Full LinkML Dataset Generation 🔴
**Goal**: Convert all 154 ISIL institutions to LinkML YAML format
**Input**:
- ISIL registry (154 institutions)
- Fuzzy match results (from Priority 1)
**Output**:
- File: `data/instances/belarus_complete.yaml`
- Format: LinkML heritage_custodian.yaml v0.2.1
- Include enrichment where available
**Code Template**:
```python
from datetime import datetime, timezone
institutions = [] # Load from ISIL registry
enrichments = {} # Load from fuzzy matching results
output_lines = []
output_lines.append("# Belarus Complete ISIL Dataset (LinkML)")
output_lines.append("# Schema: heritage_custodian.yaml v0.2.1")
output_lines.append(f"# Generated: {datetime.now(timezone.utc).isoformat()}")
output_lines.append("---\n")
for inst in institutions:
isil = inst['isil']
enrichment = enrichments.get(isil, {})
# Generate LinkML record
output_lines.append(f"- id: https://w3id.org/heritage/custodian/by/{isil.lower().replace('-', '')}")
output_lines.append(f" name: {inst['name']}")
output_lines.append(f" institution_type: LIBRARY")
# Add locations
output_lines.append(f" locations:")
output_lines.append(f" - city: {inst['city']}")
output_lines.append(f" region: {inst['region']}")
output_lines.append(f" country: BY")
if enrichment.get('coords'):
lat, lon = enrichment['coords']
output_lines.append(f" latitude: {lat}")
output_lines.append(f" longitude: {lon}")
# Add identifiers
output_lines.append(f" identifiers:")
output_lines.append(f" - identifier_scheme: ISIL")
output_lines.append(f" identifier_value: {isil}")
output_lines.append(f" identifier_url: https://isil.org/{isil}")
if enrichment.get('wikidata'):
wd = enrichment['wikidata']
output_lines.append(f" - identifier_scheme: Wikidata")
output_lines.append(f" identifier_value: {wd}")
output_lines.append(f" identifier_url: https://www.wikidata.org/wiki/{wd}")
# Add provenance
output_lines.append(f" provenance:")
output_lines.append(f" data_source: CSV_REGISTRY")
output_lines.append(f" data_tier: TIER_1_AUTHORITATIVE")
output_lines.append(f" extraction_date: \"{datetime.now(timezone.utc).isoformat()}\"")
output_lines.append(f" confidence_score: {0.95 if enrichment else 0.90}")
output_lines.append("") # Blank line
# Write to file
with open('data/instances/belarus_complete.yaml', 'w', encoding='utf-8') as f:
f.write('\n'.join(output_lines))
```
---
### Priority 3: RDF/JSON-LD Export 🟡
**Goal**: Convert LinkML YAML to Linked Open Data formats
**Tools**: `linkml-convert` command-line tool
**Commands**:
```bash
# Install linkml if needed
pip install linkml
# Convert to RDF Turtle
linkml-convert \
--schema schemas/heritage_custodian.yaml \
--output data/rdf/belarus_complete.ttl \
--input-format yaml \
--output-format turtle \
data/instances/belarus_complete.yaml
# Convert to JSON-LD
linkml-convert \
--schema schemas/heritage_custodian.yaml \
--output data/jsonld/belarus_complete.jsonld \
--input-format yaml \
--output-format json-ld \
data/instances/belarus_complete.yaml
```
**Output Files**:
- `data/rdf/belarus_complete.ttl` (RDF Turtle)
- `data/jsonld/belarus_complete.jsonld` (JSON-LD)
---
## Quick Commands for Next Session
### Load existing data
```python
import json
# Load OSM data
with open('data/isil/belarus_osm_libraries.json', 'r', encoding='utf-8') as f:
osm_data = json.load(f)
# Load ISIL registry (need to parse markdown)
with open('data/isil/belarus_isil_complete_dataset.md', 'r', encoding='utf-8') as f:
isil_md = f.read()
# Load sample enriched records
import yaml
with open('data/instances/belarus_isil_enriched.yaml', 'r', encoding='utf-8') as f:
sample_records = yaml.safe_load(f)
```
### Check progress
```bash
# Count files
ls -l data/isil/belarus*
ls -l data/instances/belarus*
# View sample
head -100 data/instances/belarus_isil_enriched.yaml
# Check OSM data size
wc -l data/isil/belarus_osm_libraries.json
```
---
## ISIL Registry Parsing
The ISIL names need to be extracted from the markdown file. Here's the structure:
```
### Minsk City (25 institutions)
| ISIL Code | Institution Name |
|-----------|------------------|
| BY-HM0000 | National Library of Belarus |
| BY-HM0001 | Republican Scientific and Technical Library |
...
```
**Parsing code**:
```python
import re
def parse_isil_registry(md_file):
"""Extract ISIL codes and names from markdown."""
with open(md_file, 'r', encoding='utf-8') as f:
content = f.read()
institutions = []
current_region = None
# Find region sections
region_pattern = r'### (.+) \((\d+) institutions\)'
table_row_pattern = r'\| (BY-[A-Z]{2}\d{4}) \| (.+) \|'
for line in content.split('\n'):
# Check for region header
region_match = re.match(region_pattern, line)
if region_match:
current_region = region_match.group(1)
continue
# Check for table row
row_match = re.match(table_row_pattern, line)
if row_match and current_region:
isil_code = row_match.group(1)
name = row_match.group(2).strip()
institutions.append({
'isil': isil_code,
'name': name,
'region': current_region,
'country': 'BY'
})
return institutions
# Usage
institutions = parse_isil_registry('data/isil/belarus_isil_complete_dataset.md')
print(f"Parsed {len(institutions)} institutions")
```
---
## Known Wikidata Matches
These 5 ISIL codes already have Wikidata enrichment:
```python
wikidata_matches = {
'BY-HM0000': {
'wikidata': 'Q948470',
'viaf': '163025395',
'website': 'https://www.nlb.by/',
'coords': (53.931421, 27.645844)
},
'BY-HM0008': {
'wikidata': 'Q2091093',
'website': 'http://preslib.org.by/',
'coords': (53.8960, 27.5466)
},
'BY-HM0005': {
'wikidata': 'Q3918424',
'viaf': '125518437',
'website': 'https://csl.bas-net.by/',
'coords': (53.9201455, 27.6000568)
},
'BY-MI0000': {
'wikidata': 'Q16145114',
'website': 'http://pushlib.org.by/',
'coords': (53.9150869, 27.5879206)
},
'BY-HR0000': {
'wikidata': 'Q13030528',
'website': 'http://grodnolib.by/',
'coords': (53.6806128, 23.8388116)
}
}
```
---
## Validation Checklist
Before completing the dataset:
- [ ] All 154 ISIL institutions have LinkML records
- [ ] Schema validation passes (`linkml-validate`)
- [ ] At least 30% enrichment rate (coordinates, websites, or Wikidata)
- [ ] Provenance metadata complete for all records
- [ ] RDF/Turtle export validates
- [ ] JSON-LD export validates
- [ ] No duplicate ISIL codes
- [ ] All geographic regions represented
---
## Estimated Time
- **Fuzzy matching**: 2 hours (includes manual review)
- **Full dataset generation**: 1 hour
- **RDF/JSON-LD export**: 30 minutes
- **Validation & QA**: 30 minutes
**Total**: ~4 hours to completion
---
## Files to Create
1. `scripts/belarus_fuzzy_matcher.py` - Fuzzy matching script
2. `scripts/belarus_linkml_generator.py` - Full dataset generator
3. `data/instances/belarus_complete.yaml` - 154 LinkML records
4. `data/rdf/belarus_complete.ttl` - RDF Turtle export
5. `data/jsonld/belarus_complete.jsonld` - JSON-LD export
6. `data/isil/BELARUS_FINAL_REPORT.md` - Completion report
---
## Contact for Questions
**National Library of Belarus**
- Email: inbox@nlb.by
- Phone: (+375 17) 368 37 37
- Website: https://nlb.by/
**ISIL International Agency**
- Website: https://isil.org/
- Email: Via website contact form
---
**Last Updated**: November 18, 2025
**Session Owner**: kempersc
**Next Session**: TBD (fuzzy matching phase)