glam/docs/sessions/BULGARIAN_ISIL_NAME_EXTRACTION_COMPLETE.md
2025-11-19 23:25:22 +01:00

14 KiB
Raw Permalink Blame History

Bulgarian ISIL Registry - Name Extraction Complete

Date: 2025-11-18
Status: Phase 1 Complete (100% Success)
Duration: ~30 minutes


Executive Summary

Successfully fixed the placeholder name issue that affected 70 out of 94 Bulgarian libraries (74.5%). All 94 institutions now have real Bulgarian names extracted from the source HTML.

Results

Metric Before After Change
Real names 24/94 (25.5%) 94/94 (100.0%) +70 institutions
Placeholder names 70/94 (74.5%) 🔴 0/94 (0.0%) Eliminated
Schema compliance 100% 100% Maintained
Geocoding coverage 61.7% 61.7% ➡️ No change
GHCID coverage 61.7% 61.7% ➡️ No change

Root Cause Analysis

The Problem

The Bulgarian National Library's ISIL registry HTML contains inconsistent field name spacing (typos):

Correct format (24 tables):

<td><strong>Наименование на организацията</strong></td>
           ^^ space here

Typo format (70 tables):

<td><strong>Наименование наорганизацията</strong></td>
           ^^ NO SPACE (typo!)

Why It Caused Placeholder Names

The original bulgarian_isil_scraper.py had a field mapping that only recognized the correct format:

field_mapping = {
    'Наименование на организацията': 'name_bg',  # Only matched 24 tables
    # Missing: 'Наименование наорганизацията'     # 70 tables were IGNORED
}

When the scraper couldn't find the name field, the LinkML converter fell back to placeholder names: "Library BG-{ISIL-code}".

The Fix

Updated the field mapping to handle BOTH variants:

field_mapping = {
    # Name fields (with and without space typos)
    'Наименование на организацията': 'name_bg',   # Correct (24 tables)
    'Наименование наорганизацията': 'name_bg',    # Typo variant (70 tables)
    'Наименование на английски език': 'name_en',  # Correct
    'Наименование наанглийски език': 'name_en',   # Typo variant
    'Варианти на името': 'name_variants',          # Correct
    'Варианти наимето': 'name_variants',           # Typo variant
    'Междубиблиотечно заемане': 'interlibrary_loan',    # Correct
    'Междубиблиотечнозаемане': 'interlibrary_loan'      # Typo variant
}

Result: All 94 institutions now extract successfully.


What Changed

Files Modified

  1. scripts/scrapers/bulgarian_isil_scraper.py ✏️

    • Added typo-variant field names to field_mapping dictionary
    • Added comments documenting the HTML inconsistency
    • No other logic changes needed
  2. data/isil/bulgarian_isil_registry.csv 🔄

    • Re-scraped from Bulgarian National Library website
    • Now contains 94 Bulgarian names (was 24)
    • Now contains 90 English names (was 24)
  3. data/isil/bulgarian_isil_registry.json 🔄

    • Re-scraped with complete name data
    • JSON export for intermediate processing
  4. data/instances/bulgaria_isil_libraries.yaml 🔄

    • Regenerated from updated CSV
    • All 94 institutions now have real Bulgarian names
    • Zero placeholder names remain
    • 100% LinkML schema compliance maintained

Data Quality Report

Current State (After Fix)

Metric Count Percentage Status
Total institutions 94 100%
Schema compliant 94 100%
With ISIL codes 94 100%
With Bulgarian names 94 100% FIXED
With English names 90 95.7%
With websites 67 71.3%
With email 94 100%
With phone 94 100%
Geocoded (lat/lon) 58 61.7% ⚠️
With region info 59 62.8% ⚠️
With GHCIDs 58 61.7% ⚠️
With Wikidata Q-numbers 0 0% Next step

Data Tier: TIER_1_AUTHORITATIVE (Bulgarian National Library official ISIL registry)

Sample Names (First 10)

1. Национална библиотека „Св. св. Кирил и Методий"
2. Библиотека при Народно читалище „Георги Тодоров-1885"
3. Библиотека при Народно читалище „Светлина-1942"
4. Библиотека при Народно читалище „Просвета-1940"
5. Библиотека при Народно читалище „Яне Сандански-1928"
6. Библиотека при Народно читалище „Светлина-1907"
7. Библиотека при Народно читалище „Просвета-1865"
8. Библиотека при Народно читалище „Прогрес  1928 г."
9. Читалищна библиотека „Отец Паисий"
10. Библиотека при Народно читалище „Култура-1932"

Validation Results

LinkML Schema Validation

linkml-validate -s schemas/heritage_custodian.yaml \
  -C HeritageCustodian \
  data/instances/bulgaria_isil_libraries.yaml

# Output: No issues found ✅

All 94 institutions pass validation against LinkML schema v0.2.1.

Field Name Analysis

Analyzed all field names in the Bulgarian ISIL registry HTML to identify inconsistencies:

Field Name Correct Variant Typo Variant Total
Organization name Наименование на организацията (24) Наименование наорганизацията (70) 94
English name Наименование на английски език (24) Наименование наанглийски език (70) 94
Name variants Варианти на името (80) Варианти наимето (14) 94
Interlibrary loan Междубиблиотечно заемане (45) Междубиблиотечнозаемане (48) 93

Pattern: Approximately 74% of tables have missing spaces in field names (typos).


Technical Implementation

Workflow

  1. Download HTML from Bulgarian National Library website

    curl -s "https://www.nationallibrary.bg/wp/?page_id=5686" \
      -o /tmp/bulgarian_isil.html
    
  2. Analyze HTML structure to identify field name inconsistencies

    from bs4 import BeautifulSoup
    # Found 94 tables, each representing one institution
    # Discovered typo patterns in field names
    
  3. Update scraper with typo-variant mappings

    # Added all typo variants to field_mapping dictionary
    
  4. Re-scrape data to extract all names

    python scripts/scrapers/bulgarian_isil_scraper.py
    # Output: 94 institutions with 100% Bulgarian names
    
  5. Regenerate LinkML YAML with complete names

    python scripts/convert_bulgarian_isil_to_linkml.py
    # Output: bulgaria_isil_libraries.yaml with zero placeholders
    
  6. Validate schema compliance

    linkml-validate -s schemas/heritage_custodian.yaml \
      -C HeritageCustodian data/instances/bulgaria_isil_libraries.yaml
    # Output: No issues found ✅
    

Tools Used

  • BeautifulSoup - HTML parsing and table extraction
  • LinkML - Schema validation and data modeling
  • Python csv module - CSV export
  • Python json module - JSON export
  • GeoNames database - Geocoding (existing from previous session)

Remaining Tasks

High Priority ⚠️

  1. Complete geocoding (61.7% → 90%+)

    • Use Nominatim API for 36 institutions without coordinates
    • Handle small Bulgarian villages and Cyrillic name variants
    • Manual coordinate lookup for remaining institutions
  2. Improve GHCID coverage (61.7% → 90%+)

    • Dependent on geocoding (need city + region for GHCID generation)
    • Add more cities to bulgarian_city_regions.json lookup table

Medium Priority 🟢

  1. Enrich with Wikidata (0% → 50%+)

    • Query Wikidata by institution name + location
    • Fuzzy match with RapidFuzz (threshold > 0.85)
    • Add Q-numbers, VIAF IDs, founding dates
    • Contribution opportunity: Add 94 Bulgarian ISIL codes to Wikidata
  2. Export to RDF/Turtle

    • Generate Linked Open Data exports
    • Use TOOI, CPOV, Schema.org ontologies
    • Publish to heritage discovery platforms

Low Priority

  1. Integration with global dataset
    • Merge with worldwide heritage custodian dataset
    • Generate geographic visualizations
    • Cross-link with other European libraries

Lessons Learned

1. Always Analyze Source Data Structure First

Mistake: Assumed the scraper was working correctly because 24 institutions extracted successfully.

Reality: 74% of the data had typos that weren't being matched.

Fix: Performed comprehensive field name analysis to identify ALL variants.

Takeaway: When extraction rate is low (<50%), investigate the source HTML structure for inconsistencies, not just the parsing logic.

2. HTML Data Quality Varies Widely

The Bulgarian National Library ISIL registry has excellent content but inconsistent formatting:

  • Good: All 94 institutions have complete data
  • Good: Standard HTML table structure
  • Bad: Field names have typos (missing spaces)
  • Bad: No data validation before publishing

Implication: Always build scrapers with flexible field matching to handle real-world data quality issues.

3. BeautifulSoup .get_text(strip=True) is Reliable

No need for complex HTML parsing or nested div traversal. Simple .get_text(strip=True) extracted all names correctly once the field mapping was fixed.

field_value = cells[1].get_text(strip=True)  # Works perfectly for Bulgarian names

4. Placeholder Names Hide Data Quality Issues

The placeholder names ("Library BG-0130001") masked the real problem:

  • Made it seem like the scraper was "working"
  • Hid the fact that 74% of institutions were missing names
  • Created unusable dataset for end users

Better approach: Fail fast and report extraction failures instead of generating placeholders.


Next Session Handoff

Start Here

  1. Verify name extraction:

    grep -c "Library BG-" data/instances/bulgaria_isil_libraries.yaml
    # Expected: 0 (zero placeholder names)
    
  2. Check data quality:

    import yaml
    with open('data/instances/bulgaria_isil_libraries.yaml') as f:
        content = f.read()
        institutions = yaml.safe_load(content[content.index('\n- id:'):])
    
    print(f"Total: {len(institutions)}")
    print(f"With names: {sum(1 for i in institutions if not i['name'].startswith('Library BG-'))}")
    # Expected: 94 total, 94 with names
    

Next Steps (Priority Order)

Priority 1: Complete Geocoding ⚠️

# Use Nominatim API for missing coordinates
python scripts/geocode_bulgarian_missing.py
# Target: 61.7% → 90%+ geocoding coverage

Priority 2: Enrich Wikidata 🟢

# Query Wikidata for Bulgarian libraries
python scripts/enrich_bulgarian_wikidata.py
# Fuzzy match by name + location
# Add Q-numbers to LinkML records

Priority 3: Export RDF 🟢

# Generate RDF/Turtle for Linked Open Data
python scripts/export_bulgaria_rdf.py
# Publish to heritage discovery platforms

Files to Review

File Purpose Status
data/instances/bulgaria_isil_libraries.yaml Final LinkML output Complete
data/isil/bulgarian_isil_registry.csv Intermediate CSV Complete
scripts/scrapers/bulgarian_isil_scraper.py HTML scraper Fixed
scripts/convert_bulgarian_isil_to_linkml.py LinkML converter Works
scripts/enrich_bulgarian_wikidata.py Wikidata enrichment ⚠️ Skeleton

References

Related Documentation:

  • BULGARIAN_ISIL_LINKML_INTEGRATION_COMPLETE.md - Phase 1 initial report (before fix)
  • AGENTS.md - AI agent instructions for GLAM data extraction
  • docs/SCHEMA_MODULES.md - LinkML schema v0.2.1 architecture
  • docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Schema Files:

  • schemas/heritage_custodian.yaml - Main schema
  • schemas/core.yaml - HeritageCustodian, Location, Identifier classes
  • schemas/enums.yaml - InstitutionTypeEnum, DataTier
  • schemas/provenance.yaml - Provenance metadata

External Resources:


Success Metrics

Metric Target Achieved Status
Name extraction 100% 100% Complete
Schema compliance 100% 100% Complete
ISIL codes 100% 100% Complete
Geocoding 90% 61.7% ⚠️ In progress
GHCID coverage 90% 61.7% ⚠️ In progress
Wikidata enrichment 50% 0% Next step

Overall Progress: Phase 1 Complete (85%) | Phase 2 Pending (15%)


Conclusion

The Bulgarian ISIL registry name extraction issue is now fully resolved. All 94 institutions have real Bulgarian names extracted from the source HTML, making the dataset usable for:

  • Human browsing and discovery
  • Citation in academic papers
  • Integration with heritage platforms
  • Cross-referencing with Wikidata
  • RDF/Linked Open Data publishing

The root cause (HTML field name typos) was identified through systematic field analysis and fixed with minimal code changes. The solution is robust and handles all known field name variants.

Next priority: Complete geocoding to enable full GHCID generation for all 94 institutions.


Session Complete: 2025-11-18
Total Time: ~30 minutes
Institutions Fixed: 70 → 0 placeholders (100% success rate)
Schema Validation: PASS (no issues)
Data Quality: TIER_1_AUTHORITATIVE (unchanged)