glam/data/isil/BELARUS_NEXT_SESSION.md

# Belarus ISIL Enrichment - Next Session Quick Start

**Status**: Phase 1 Complete (Data Collection + Sample Enrichment)
**Next Phase**: Phase 2 (Fuzzy Matching + Full Dataset Generation)

---

## What We Have

✅ **Complete ISIL Registry** (154 institutions)
- File: `data/isil/belarus_isil_complete_dataset.md`
- Format: Markdown tables with ISIL codes + names
- Coverage: All 7 regions

✅ **Wikidata Entities** (32 Belarusian libraries)
- 5 matched to ISIL codes
- 27 candidates for fuzzy matching
- Includes VIAF IDs, websites, coordinates

✅ **OpenStreetMap Data** (575 library locations)
- File: `data/isil/belarus_osm_libraries.json`
- 8 with Wikidata links
- 201 with contact info (phone, email, address, hours)

✅ **Sample LinkML Dataset** (10 enriched records)
- File: `data/instances/belarus_isil_enriched.yaml`
- Schema: heritage_custodian.yaml v0.2.1
- Data tiers: TIER_1 (ISIL) + TIER_3 (Wikidata/OSM)

---

## What We Need to Do

### Priority 1: Fuzzy Name Matching 🔴

**Goal**: Match remaining 149 ISIL institutions to Wikidata/OSM entities

**Approach**:
1. Extract institution names from ISIL registry
2. Normalize names (lowercase, remove punctuation, transliteration)
3. Compare against:
   - 27 Wikidata entities without ISIL codes
   - 575 OSM library entries
4. Use `rapidfuzz` library for similarity scoring
5. Threshold: >85% match confidence
6. Manual review for borderline cases (80-85%)

**Expected Matches**:
- Wikidata: ~15-20 additional matches
- OSM: ~50-80 additional matches with coordinates/contact info

**Code Template**:
```python
from rapidfuzz import fuzz
import json

# Load ISIL names
isil_names = {
    'BY-HM0001': 'Republican Scientific and Technical Library',
    'BY-BR0000': 'Brest Regional Library named after Maxim Gorky',
    # ... 152 more
}

# Load OSM/Wikidata candidates
osm_data = json.load(open('data/isil/belarus_osm_libraries.json'))
wikidata_candidates = [...]  # From previous SPARQL query

# Fuzzy match
for isil, name in isil_names.items():
    best_match = None
    best_score = 0

    for osm_entry in osm_data['elements']:
        osm_name = osm_entry['tags'].get('name', '')
        score = fuzz.ratio(name.lower(), osm_name.lower())

        if score > best_score:
            best_score = score
            best_match = osm_entry

    if best_score > 85:
        print(f"MATCH: {isil} -> {best_match['tags']['name']} ({best_score}%)")
```

---

### Priority 2: Full LinkML Dataset Generation 🔴

**Goal**: Convert all 154 ISIL institutions to LinkML YAML format

**Input**:
- ISIL registry (154 institutions)
- Fuzzy match results (from Priority 1)

**Output**:
- File: `data/instances/belarus_complete.yaml`
- Format: LinkML heritage_custodian.yaml v0.2.1
- Include enrichment where available

**Code Template**:
```python
from datetime import datetime, timezone

institutions = []  # Load from ISIL registry
enrichments = {}   # Load from fuzzy matching results

output_lines = []
output_lines.append("# Belarus Complete ISIL Dataset (LinkML)")
output_lines.append("# Schema: heritage_custodian.yaml v0.2.1")
output_lines.append(f"# Generated: {datetime.now(timezone.utc).isoformat()}")
output_lines.append("---\n")

for inst in institutions:
    isil = inst['isil']
    enrichment = enrichments.get(isil, {})

    # Generate LinkML record
    output_lines.append(f"- id: https://w3id.org/heritage/custodian/by/{isil.lower().replace('-', '')}")
    output_lines.append(f"  name: {inst['name']}")
    output_lines.append(f"  institution_type: LIBRARY")

    # Add locations
    output_lines.append(f"  locations:")
    output_lines.append(f"    - city: {inst['city']}")
    output_lines.append(f"      region: {inst['region']}")
    output_lines.append(f"      country: BY")

    if enrichment.get('coords'):
        lat, lon = enrichment['coords']
        output_lines.append(f"      latitude: {lat}")
        output_lines.append(f"      longitude: {lon}")

    # Add identifiers
    output_lines.append(f"  identifiers:")
    output_lines.append(f"    - identifier_scheme: ISIL")
    output_lines.append(f"      identifier_value: {isil}")
    output_lines.append(f"      identifier_url: https://isil.org/{isil}")

    if enrichment.get('wikidata'):
        wd = enrichment['wikidata']
        output_lines.append(f"    - identifier_scheme: Wikidata")
        output_lines.append(f"      identifier_value: {wd}")
        output_lines.append(f"      identifier_url: https://www.wikidata.org/wiki/{wd}")

    # Add provenance
    output_lines.append(f"  provenance:")
    output_lines.append(f"    data_source: CSV_REGISTRY")
    output_lines.append(f"    data_tier: TIER_1_AUTHORITATIVE")
    output_lines.append(f"    extraction_date: \"{datetime.now(timezone.utc).isoformat()}\"")
    output_lines.append(f"    confidence_score: {0.95 if enrichment else 0.90}")

    output_lines.append("")  # Blank line

# Write to file
with open('data/instances/belarus_complete.yaml', 'w', encoding='utf-8') as f:
    f.write('\n'.join(output_lines))
```

---

### Priority 3: RDF/JSON-LD Export 🟡

**Goal**: Convert LinkML YAML to Linked Open Data formats

**Tools**: `linkml-convert` command-line tool

**Commands**:
```bash
# Install linkml if needed
pip install linkml

# Convert to RDF Turtle
linkml-convert \
  --schema schemas/heritage_custodian.yaml \
  --output data/rdf/belarus_complete.ttl \
  --input-format yaml \
  --output-format turtle \
  data/instances/belarus_complete.yaml

# Convert to JSON-LD
linkml-convert \
  --schema schemas/heritage_custodian.yaml \
  --output data/jsonld/belarus_complete.jsonld \
  --input-format yaml \
  --output-format json-ld \
  data/instances/belarus_complete.yaml
```

**Output Files**:
- `data/rdf/belarus_complete.ttl` (RDF Turtle)
- `data/jsonld/belarus_complete.jsonld` (JSON-LD)

---

## Quick Commands for Next Session

### Load existing data
```python
import json

# Load OSM data
with open('data/isil/belarus_osm_libraries.json', 'r', encoding='utf-8') as f:
    osm_data = json.load(f)

# Load ISIL registry (need to parse markdown)
with open('data/isil/belarus_isil_complete_dataset.md', 'r', encoding='utf-8') as f:
    isil_md = f.read()

# Load sample enriched records
import yaml
with open('data/instances/belarus_isil_enriched.yaml', 'r', encoding='utf-8') as f:
    sample_records = yaml.safe_load(f)
```

### Check progress
```bash
# Count files
ls -l data/isil/belarus*
ls -l data/instances/belarus*

# View sample
head -100 data/instances/belarus_isil_enriched.yaml

# Check OSM data size
wc -l data/isil/belarus_osm_libraries.json
```

---

## ISIL Registry Parsing

The ISIL names need to be extracted from the markdown file. Here's the structure:

```
### Minsk City (25 institutions)

| ISIL Code | Institution Name |
|-----------|------------------|
| BY-HM0000 | National Library of Belarus |
| BY-HM0001 | Republican Scientific and Technical Library |
...
```

**Parsing code**:
```python
import re

def parse_isil_registry(md_file):
    """Extract ISIL codes and names from markdown."""
    with open(md_file, 'r', encoding='utf-8') as f:
        content = f.read()

    institutions = []
    current_region = None

    # Find region sections
    region_pattern = r'### (.+) \((\d+) institutions\)'
    table_row_pattern = r'\| (BY-[A-Z]{2}\d{4}) \| (.+) \|'

    for line in content.split('\n'):
        # Check for region header
        region_match = re.match(region_pattern, line)
        if region_match:
            current_region = region_match.group(1)
            continue

        # Check for table row
        row_match = re.match(table_row_pattern, line)
        if row_match and current_region:
            isil_code = row_match.group(1)
            name = row_match.group(2).strip()

            institutions.append({
                'isil': isil_code,
                'name': name,
                'region': current_region,
                'country': 'BY'
            })

    return institutions

# Usage
institutions = parse_isil_registry('data/isil/belarus_isil_complete_dataset.md')
print(f"Parsed {len(institutions)} institutions")
```

---

## Known Wikidata Matches

These 5 ISIL codes already have Wikidata enrichment:

```python
wikidata_matches = {
    'BY-HM0000': {
        'wikidata': 'Q948470',
        'viaf': '163025395',
        'website': 'https://www.nlb.by/',
        'coords': (53.931421, 27.645844)
    },
    'BY-HM0008': {
        'wikidata': 'Q2091093',
        'website': 'http://preslib.org.by/',
        'coords': (53.8960, 27.5466)
    },
    'BY-HM0005': {
        'wikidata': 'Q3918424',
        'viaf': '125518437',
        'website': 'https://csl.bas-net.by/',
        'coords': (53.9201455, 27.6000568)
    },
    'BY-MI0000': {
        'wikidata': 'Q16145114',
        'website': 'http://pushlib.org.by/',
        'coords': (53.9150869, 27.5879206)
    },
    'BY-HR0000': {
        'wikidata': 'Q13030528',
        'website': 'http://grodnolib.by/',
        'coords': (53.6806128, 23.8388116)
    }
}
```

---

## Validation Checklist

Before completing the dataset:

- [ ] All 154 ISIL institutions have LinkML records
- [ ] Schema validation passes (`linkml-validate`)
- [ ] At least 30% enrichment rate (coordinates, websites, or Wikidata)
- [ ] Provenance metadata complete for all records
- [ ] RDF/Turtle export validates
- [ ] JSON-LD export validates
- [ ] No duplicate ISIL codes
- [ ] All geographic regions represented

---

## Estimated Time

- **Fuzzy matching**: 2 hours (includes manual review)
- **Full dataset generation**: 1 hour
- **RDF/JSON-LD export**: 30 minutes
- **Validation & QA**: 30 minutes

**Total**: ~4 hours to completion

---

## Files to Create

1. `scripts/belarus_fuzzy_matcher.py` - Fuzzy matching script
2. `scripts/belarus_linkml_generator.py` - Full dataset generator
3. `data/instances/belarus_complete.yaml` - 154 LinkML records
4. `data/rdf/belarus_complete.ttl` - RDF Turtle export
5. `data/jsonld/belarus_complete.jsonld` - JSON-LD export
6. `data/isil/BELARUS_FINAL_REPORT.md` - Completion report

---

## Contact for Questions

**National Library of Belarus**
- Email: inbox@nlb.by
- Phone: (+375 17) 368 37 37
- Website: https://nlb.by/

**ISIL International Agency**
- Website: https://isil.org/
- Email: Via website contact form

---

**Last Updated**: November 18, 2025
**Session Owner**: kempersc
**Next Session**: TBD (fuzzy matching phase)