kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

14 KiB

Raw Permalink Blame History

Bulgarian ISIL Registry - Name Extraction Complete ✅

Date: 2025-11-18
Status: Phase 1 Complete (100% Success)
Duration: ~30 minutes

Executive Summary

Successfully fixed the placeholder name issue that affected 70 out of 94 Bulgarian libraries (74.5%). All 94 institutions now have real Bulgarian names extracted from the source HTML.

Results

Metric	Before	After	Change
Real names	24/94 (25.5%)	94/94 (100.0%)	✅ +70 institutions
Placeholder names	70/94 (74.5%) 🔴	0/94 (0.0%)	✅ Eliminated
Schema compliance	100%	100%	✅ Maintained
Geocoding coverage	61.7%	61.7%	➡️ No change
GHCID coverage	61.7%	61.7%	➡️ No change

Root Cause Analysis

The Problem

The Bulgarian National Library's ISIL registry HTML contains inconsistent field name spacing (typos):

Correct format (24 tables):

<td><strong>Наименование на организацията</strong></td>
           ^^ space here

Typo format (70 tables):

<td><strong>Наименование наорганизацията</strong></td>
           ^^ NO SPACE (typo!)

Why It Caused Placeholder Names

The original bulgarian_isil_scraper.py had a field mapping that only recognized the correct format:

field_mapping = {
    'Наименование на организацията': 'name_bg',  # Only matched 24 tables
    # Missing: 'Наименование наорганизацията'     # 70 tables were IGNORED
}

When the scraper couldn't find the name field, the LinkML converter fell back to placeholder names: "Library BG-{ISIL-code}".

The Fix

Updated the field mapping to handle BOTH variants:

field_mapping = {
    # Name fields (with and without space typos)
    'Наименование на организацията': 'name_bg',   # Correct (24 tables)
    'Наименование наорганизацията': 'name_bg',    # Typo variant (70 tables)
    'Наименование на английски език': 'name_en',  # Correct
    'Наименование наанглийски език': 'name_en',   # Typo variant
    'Варианти на името': 'name_variants',          # Correct
    'Варианти наимето': 'name_variants',           # Typo variant
    'Междубиблиотечно заемане': 'interlibrary_loan',    # Correct
    'Междубиблиотечнозаемане': 'interlibrary_loan'      # Typo variant
}

Result: All 94 institutions now extract successfully.

What Changed

Files Modified

scripts/scrapers/bulgarian_isil_scraper.py ✏️
- Added typo-variant field names to field_mapping dictionary
- Added comments documenting the HTML inconsistency
- No other logic changes needed
data/isil/bulgarian_isil_registry.csv 🔄
- Re-scraped from Bulgarian National Library website
- Now contains 94 Bulgarian names (was 24)
- Now contains 90 English names (was 24)
data/isil/bulgarian_isil_registry.json 🔄
- Re-scraped with complete name data
- JSON export for intermediate processing
data/instances/bulgaria_isil_libraries.yaml 🔄
- Regenerated from updated CSV
- All 94 institutions now have real Bulgarian names
- Zero placeholder names remain
- 100% LinkML schema compliance maintained

Data Quality Report

Current State (After Fix)

Metric	Count	Percentage	Status
Total institutions	94	100%	✅
Schema compliant	94	100%	✅
With ISIL codes	94	100%	✅
With Bulgarian names	94	100%	✅ FIXED
With English names	90	95.7%	✅
With websites	67	71.3%	✅
With email	94	100%	✅
With phone	94	100%	✅
Geocoded (lat/lon)	58	61.7%	⚠️
With region info	59	62.8%	⚠️
With GHCIDs	58	61.7%	⚠️
With Wikidata Q-numbers	0	0%	⬜ Next step

Data Tier: TIER_1_AUTHORITATIVE (Bulgarian National Library official ISIL registry)

Sample Names (First 10)

1. Национална библиотека „Св. св. Кирил и Методий"
2. Библиотека при Народно читалище „Георги Тодоров-1885"
3. Библиотека при Народно читалище „Светлина-1942"
4. Библиотека при Народно читалище „Просвета-1940"
5. Библиотека при Народно читалище „Яне Сандански-1928"
6. Библиотека при Народно читалище „Светлина-1907"
7. Библиотека при Народно читалище „Просвета-1865"
8. Библиотека при Народно читалище „Прогрес – 1928 г."
9. Читалищна библиотека „Отец Паисий"
10. Библиотека при Народно читалище „Култура-1932"

Validation Results

LinkML Schema Validation ✅

linkml-validate -s schemas/heritage_custodian.yaml \
  -C HeritageCustodian \
  data/instances/bulgaria_isil_libraries.yaml

# Output: No issues found ✅

All 94 institutions pass validation against LinkML schema v0.2.1.

Field Name Analysis

Analyzed all field names in the Bulgarian ISIL registry HTML to identify inconsistencies:

Field Name	Correct Variant	Typo Variant	Total
Organization name	`Наименование на организацията` (24)	`Наименование наорганизацията` (70)	94
English name	`Наименование на английски език` (24)	`Наименование наанглийски език` (70)	94
Name variants	`Варианти на името` (80)	`Варианти наимето` (14)	94
Interlibrary loan	`Междубиблиотечно заемане` (45)	`Междубиблиотечнозаемане` (48)	93

Pattern: Approximately 74% of tables have missing spaces in field names (typos).

Technical Implementation

Workflow

Download HTML from Bulgarian National Library website

curl -s "https://www.nationallibrary.bg/wp/?page_id=5686" \
  -o /tmp/bulgarian_isil.html

Analyze HTML structure to identify field name inconsistencies

from bs4 import BeautifulSoup
# Found 94 tables, each representing one institution
# Discovered typo patterns in field names

Update scraper with typo-variant mappings

# Added all typo variants to field_mapping dictionary

Re-scrape data to extract all names

python scripts/scrapers/bulgarian_isil_scraper.py
# Output: 94 institutions with 100% Bulgarian names

Regenerate LinkML YAML with complete names

python scripts/convert_bulgarian_isil_to_linkml.py
# Output: bulgaria_isil_libraries.yaml with zero placeholders

Validate schema compliance

linkml-validate -s schemas/heritage_custodian.yaml \
  -C HeritageCustodian data/instances/bulgaria_isil_libraries.yaml
# Output: No issues found ✅

Tools Used

BeautifulSoup - HTML parsing and table extraction
LinkML - Schema validation and data modeling
Python csv module - CSV export
Python json module - JSON export
GeoNames database - Geocoding (existing from previous session)

Remaining Tasks

High Priority ⚠️

Complete geocoding (61.7% → 90%+)
- Use Nominatim API for 36 institutions without coordinates
- Handle small Bulgarian villages and Cyrillic name variants
- Manual coordinate lookup for remaining institutions
Improve GHCID coverage (61.7% → 90%+)
- Dependent on geocoding (need city + region for GHCID generation)
- Add more cities to bulgarian_city_regions.json lookup table

Medium Priority 🟢

Enrich with Wikidata (0% → 50%+)
- Query Wikidata by institution name + location
- Fuzzy match with RapidFuzz (threshold > 0.85)
- Add Q-numbers, VIAF IDs, founding dates
- Contribution opportunity: Add 94 Bulgarian ISIL codes to Wikidata
Export to RDF/Turtle
- Generate Linked Open Data exports
- Use TOOI, CPOV, Schema.org ontologies
- Publish to heritage discovery platforms

Low Priority ⬜

Integration with global dataset
- Merge with worldwide heritage custodian dataset
- Generate geographic visualizations
- Cross-link with other European libraries

Lessons Learned

1. Always Analyze Source Data Structure First

Mistake: Assumed the scraper was working correctly because 24 institutions extracted successfully.

Reality: 74% of the data had typos that weren't being matched.

Fix: Performed comprehensive field name analysis to identify ALL variants.

Takeaway: When extraction rate is low (<50%), investigate the source HTML structure for inconsistencies, not just the parsing logic.

2. HTML Data Quality Varies Widely

The Bulgarian National Library ISIL registry has excellent content but inconsistent formatting:

✅ Good: All 94 institutions have complete data
✅ Good: Standard HTML table structure
❌ Bad: Field names have typos (missing spaces)
❌ Bad: No data validation before publishing

Implication: Always build scrapers with flexible field matching to handle real-world data quality issues.

3. BeautifulSoup `.get_text(strip=True)` is Reliable

No need for complex HTML parsing or nested div traversal. Simple .get_text(strip=True) extracted all names correctly once the field mapping was fixed.

field_value = cells[1].get_text(strip=True)  # Works perfectly for Bulgarian names

4. Placeholder Names Hide Data Quality Issues

The placeholder names ("Library BG-0130001") masked the real problem:

Made it seem like the scraper was "working"
Hid the fact that 74% of institutions were missing names
Created unusable dataset for end users

Better approach: Fail fast and report extraction failures instead of generating placeholders.

Next Session Handoff

Start Here

Verify name extraction:

grep -c "Library BG-" data/instances/bulgaria_isil_libraries.yaml
# Expected: 0 (zero placeholder names)

Check data quality:

import yaml
with open('data/instances/bulgaria_isil_libraries.yaml') as f:
    content = f.read()
    institutions = yaml.safe_load(content[content.index('\n- id:'):])

print(f"Total: {len(institutions)}")
print(f"With names: {sum(1 for i in institutions if not i['name'].startswith('Library BG-'))}")
# Expected: 94 total, 94 with names

Next Steps (Priority Order)

Priority 1: Complete Geocoding ⚠️

# Use Nominatim API for missing coordinates
python scripts/geocode_bulgarian_missing.py
# Target: 61.7% → 90%+ geocoding coverage

Priority 2: Enrich Wikidata 🟢

# Query Wikidata for Bulgarian libraries
python scripts/enrich_bulgarian_wikidata.py
# Fuzzy match by name + location
# Add Q-numbers to LinkML records

Priority 3: Export RDF 🟢

# Generate RDF/Turtle for Linked Open Data
python scripts/export_bulgaria_rdf.py
# Publish to heritage discovery platforms

Files to Review

File	Purpose	Status
`data/instances/bulgaria_isil_libraries.yaml`	Final LinkML output	✅ Complete
`data/isil/bulgarian_isil_registry.csv`	Intermediate CSV	✅ Complete
`scripts/scrapers/bulgarian_isil_scraper.py`	HTML scraper	✅ Fixed
`scripts/convert_bulgarian_isil_to_linkml.py`	LinkML converter	✅ Works
`scripts/enrich_bulgarian_wikidata.py`	Wikidata enrichment	⚠️ Skeleton

References

Related Documentation:

BULGARIAN_ISIL_LINKML_INTEGRATION_COMPLETE.md - Phase 1 initial report (before fix)
AGENTS.md - AI agent instructions for GLAM data extraction
docs/SCHEMA_MODULES.md - LinkML schema v0.2.1 architecture
docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Schema Files:

schemas/heritage_custodian.yaml - Main schema
schemas/core.yaml - HeritageCustodian, Location, Identifier classes
schemas/enums.yaml - InstitutionTypeEnum, DataTier
schemas/provenance.yaml - Provenance metadata

External Resources:

Bulgarian National Library ISIL Registry: https://www.nationallibrary.bg/wp/?page_id=5686
ISIL International Standard: https://isil.org/
Wikidata SPARQL endpoint: https://query.wikidata.org/

Success Metrics

Metric	Target	Achieved	Status
Name extraction	100%	100%	✅ Complete
Schema compliance	100%	100%	✅ Complete
ISIL codes	100%	100%	✅ Complete
Geocoding	90%	61.7%	⚠️ In progress
GHCID coverage	90%	61.7%	⚠️ In progress
Wikidata enrichment	50%	0%	⬜ Next step

Overall Progress: Phase 1 Complete (85%) | Phase 2 Pending (15%)

Conclusion

The Bulgarian ISIL registry name extraction issue is now fully resolved. All 94 institutions have real Bulgarian names extracted from the source HTML, making the dataset usable for:

Human browsing and discovery
Citation in academic papers
Integration with heritage platforms
Cross-referencing with Wikidata
RDF/Linked Open Data publishing

The root cause (HTML field name typos) was identified through systematic field analysis and fixed with minimal code changes. The solution is robust and handles all known field name variants.

Next priority: Complete geocoding to enable full GHCID generation for all 94 institutions.

Session Complete: 2025-11-18
Total Time: ~30 minutes
Institutions Fixed: 70 → 0 placeholders (100% success rate)
Schema Validation: ✅ PASS (no issues)
Data Quality: TIER_1_AUTHORITATIVE (unchanged)

14 KiB Raw Permalink Blame History Unescape Escape