14 KiB
Bulgarian ISIL Registry - Name Extraction Complete ✅
Date: 2025-11-18
Status: Phase 1 Complete (100% Success)
Duration: ~30 minutes
Executive Summary
Successfully fixed the placeholder name issue that affected 70 out of 94 Bulgarian libraries (74.5%). All 94 institutions now have real Bulgarian names extracted from the source HTML.
Results
| Metric | Before | After | Change |
|---|---|---|---|
| Real names | 24/94 (25.5%) | 94/94 (100.0%) | ✅ +70 institutions |
| Placeholder names | 70/94 (74.5%) 🔴 | 0/94 (0.0%) | ✅ Eliminated |
| Schema compliance | 100% | 100% | ✅ Maintained |
| Geocoding coverage | 61.7% | 61.7% | ➡️ No change |
| GHCID coverage | 61.7% | 61.7% | ➡️ No change |
Root Cause Analysis
The Problem
The Bulgarian National Library's ISIL registry HTML contains inconsistent field name spacing (typos):
Correct format (24 tables):
<td><strong>Наименование на организацията</strong></td>
^^ space here
Typo format (70 tables):
<td><strong>Наименование наорганизацията</strong></td>
^^ NO SPACE (typo!)
Why It Caused Placeholder Names
The original bulgarian_isil_scraper.py had a field mapping that only recognized the correct format:
field_mapping = {
'Наименование на организацията': 'name_bg', # Only matched 24 tables
# Missing: 'Наименование наорганизацията' # 70 tables were IGNORED
}
When the scraper couldn't find the name field, the LinkML converter fell back to placeholder names: "Library BG-{ISIL-code}".
The Fix
Updated the field mapping to handle BOTH variants:
field_mapping = {
# Name fields (with and without space typos)
'Наименование на организацията': 'name_bg', # Correct (24 tables)
'Наименование наорганизацията': 'name_bg', # Typo variant (70 tables)
'Наименование на английски език': 'name_en', # Correct
'Наименование наанглийски език': 'name_en', # Typo variant
'Варианти на името': 'name_variants', # Correct
'Варианти наимето': 'name_variants', # Typo variant
'Междубиблиотечно заемане': 'interlibrary_loan', # Correct
'Междубиблиотечнозаемане': 'interlibrary_loan' # Typo variant
}
Result: All 94 institutions now extract successfully.
What Changed
Files Modified
-
scripts/scrapers/bulgarian_isil_scraper.py✏️- Added typo-variant field names to
field_mappingdictionary - Added comments documenting the HTML inconsistency
- No other logic changes needed
- Added typo-variant field names to
-
data/isil/bulgarian_isil_registry.csv🔄- Re-scraped from Bulgarian National Library website
- Now contains 94 Bulgarian names (was 24)
- Now contains 90 English names (was 24)
-
data/isil/bulgarian_isil_registry.json🔄- Re-scraped with complete name data
- JSON export for intermediate processing
-
data/instances/bulgaria_isil_libraries.yaml🔄- Regenerated from updated CSV
- All 94 institutions now have real Bulgarian names
- Zero placeholder names remain
- 100% LinkML schema compliance maintained
Data Quality Report
Current State (After Fix)
| Metric | Count | Percentage | Status |
|---|---|---|---|
| Total institutions | 94 | 100% | ✅ |
| Schema compliant | 94 | 100% | ✅ |
| With ISIL codes | 94 | 100% | ✅ |
| With Bulgarian names | 94 | 100% | ✅ FIXED |
| With English names | 90 | 95.7% | ✅ |
| With websites | 67 | 71.3% | ✅ |
| With email | 94 | 100% | ✅ |
| With phone | 94 | 100% | ✅ |
| Geocoded (lat/lon) | 58 | 61.7% | ⚠️ |
| With region info | 59 | 62.8% | ⚠️ |
| With GHCIDs | 58 | 61.7% | ⚠️ |
| With Wikidata Q-numbers | 0 | 0% | ⬜ Next step |
Data Tier: TIER_1_AUTHORITATIVE (Bulgarian National Library official ISIL registry)
Sample Names (First 10)
1. Национална библиотека „Св. св. Кирил и Методий"
2. Библиотека при Народно читалище „Георги Тодоров-1885"
3. Библиотека при Народно читалище „Светлина-1942"
4. Библиотека при Народно читалище „Просвета-1940"
5. Библиотека при Народно читалище „Яне Сандански-1928"
6. Библиотека при Народно читалище „Светлина-1907"
7. Библиотека при Народно читалище „Просвета-1865"
8. Библиотека при Народно читалище „Прогрес – 1928 г."
9. Читалищна библиотека „Отец Паисий"
10. Библиотека при Народно читалище „Култура-1932"
Validation Results
LinkML Schema Validation ✅
linkml-validate -s schemas/heritage_custodian.yaml \
-C HeritageCustodian \
data/instances/bulgaria_isil_libraries.yaml
# Output: No issues found ✅
All 94 institutions pass validation against LinkML schema v0.2.1.
Field Name Analysis
Analyzed all field names in the Bulgarian ISIL registry HTML to identify inconsistencies:
| Field Name | Correct Variant | Typo Variant | Total |
|---|---|---|---|
| Organization name | Наименование на организацията (24) |
Наименование наорганизацията (70) |
94 |
| English name | Наименование на английски език (24) |
Наименование наанглийски език (70) |
94 |
| Name variants | Варианти на името (80) |
Варианти наимето (14) |
94 |
| Interlibrary loan | Междубиблиотечно заемане (45) |
Междубиблиотечнозаемане (48) |
93 |
Pattern: Approximately 74% of tables have missing spaces in field names (typos).
Technical Implementation
Workflow
-
Download HTML from Bulgarian National Library website
curl -s "https://www.nationallibrary.bg/wp/?page_id=5686" \ -o /tmp/bulgarian_isil.html -
Analyze HTML structure to identify field name inconsistencies
from bs4 import BeautifulSoup # Found 94 tables, each representing one institution # Discovered typo patterns in field names -
Update scraper with typo-variant mappings
# Added all typo variants to field_mapping dictionary -
Re-scrape data to extract all names
python scripts/scrapers/bulgarian_isil_scraper.py # Output: 94 institutions with 100% Bulgarian names -
Regenerate LinkML YAML with complete names
python scripts/convert_bulgarian_isil_to_linkml.py # Output: bulgaria_isil_libraries.yaml with zero placeholders -
Validate schema compliance
linkml-validate -s schemas/heritage_custodian.yaml \ -C HeritageCustodian data/instances/bulgaria_isil_libraries.yaml # Output: No issues found ✅
Tools Used
- BeautifulSoup - HTML parsing and table extraction
- LinkML - Schema validation and data modeling
- Python csv module - CSV export
- Python json module - JSON export
- GeoNames database - Geocoding (existing from previous session)
Remaining Tasks
High Priority ⚠️
-
Complete geocoding (61.7% → 90%+)
- Use Nominatim API for 36 institutions without coordinates
- Handle small Bulgarian villages and Cyrillic name variants
- Manual coordinate lookup for remaining institutions
-
Improve GHCID coverage (61.7% → 90%+)
- Dependent on geocoding (need city + region for GHCID generation)
- Add more cities to
bulgarian_city_regions.jsonlookup table
Medium Priority 🟢
-
Enrich with Wikidata (0% → 50%+)
- Query Wikidata by institution name + location
- Fuzzy match with RapidFuzz (threshold > 0.85)
- Add Q-numbers, VIAF IDs, founding dates
- Contribution opportunity: Add 94 Bulgarian ISIL codes to Wikidata
-
Export to RDF/Turtle
- Generate Linked Open Data exports
- Use TOOI, CPOV, Schema.org ontologies
- Publish to heritage discovery platforms
Low Priority ⬜
- Integration with global dataset
- Merge with worldwide heritage custodian dataset
- Generate geographic visualizations
- Cross-link with other European libraries
Lessons Learned
1. Always Analyze Source Data Structure First
Mistake: Assumed the scraper was working correctly because 24 institutions extracted successfully.
Reality: 74% of the data had typos that weren't being matched.
Fix: Performed comprehensive field name analysis to identify ALL variants.
Takeaway: When extraction rate is low (<50%), investigate the source HTML structure for inconsistencies, not just the parsing logic.
2. HTML Data Quality Varies Widely
The Bulgarian National Library ISIL registry has excellent content but inconsistent formatting:
- ✅ Good: All 94 institutions have complete data
- ✅ Good: Standard HTML table structure
- ❌ Bad: Field names have typos (missing spaces)
- ❌ Bad: No data validation before publishing
Implication: Always build scrapers with flexible field matching to handle real-world data quality issues.
3. BeautifulSoup .get_text(strip=True) is Reliable
No need for complex HTML parsing or nested div traversal. Simple .get_text(strip=True) extracted all names correctly once the field mapping was fixed.
field_value = cells[1].get_text(strip=True) # Works perfectly for Bulgarian names
4. Placeholder Names Hide Data Quality Issues
The placeholder names ("Library BG-0130001") masked the real problem:
- Made it seem like the scraper was "working"
- Hid the fact that 74% of institutions were missing names
- Created unusable dataset for end users
Better approach: Fail fast and report extraction failures instead of generating placeholders.
Next Session Handoff
Start Here
-
Verify name extraction:
grep -c "Library BG-" data/instances/bulgaria_isil_libraries.yaml # Expected: 0 (zero placeholder names) -
Check data quality:
import yaml with open('data/instances/bulgaria_isil_libraries.yaml') as f: content = f.read() institutions = yaml.safe_load(content[content.index('\n- id:'):]) print(f"Total: {len(institutions)}") print(f"With names: {sum(1 for i in institutions if not i['name'].startswith('Library BG-'))}") # Expected: 94 total, 94 with names
Next Steps (Priority Order)
Priority 1: Complete Geocoding ⚠️
# Use Nominatim API for missing coordinates
python scripts/geocode_bulgarian_missing.py
# Target: 61.7% → 90%+ geocoding coverage
Priority 2: Enrich Wikidata 🟢
# Query Wikidata for Bulgarian libraries
python scripts/enrich_bulgarian_wikidata.py
# Fuzzy match by name + location
# Add Q-numbers to LinkML records
Priority 3: Export RDF 🟢
# Generate RDF/Turtle for Linked Open Data
python scripts/export_bulgaria_rdf.py
# Publish to heritage discovery platforms
Files to Review
| File | Purpose | Status |
|---|---|---|
data/instances/bulgaria_isil_libraries.yaml |
Final LinkML output | ✅ Complete |
data/isil/bulgarian_isil_registry.csv |
Intermediate CSV | ✅ Complete |
scripts/scrapers/bulgarian_isil_scraper.py |
HTML scraper | ✅ Fixed |
scripts/convert_bulgarian_isil_to_linkml.py |
LinkML converter | ✅ Works |
scripts/enrich_bulgarian_wikidata.py |
Wikidata enrichment | ⚠️ Skeleton |
References
Related Documentation:
BULGARIAN_ISIL_LINKML_INTEGRATION_COMPLETE.md- Phase 1 initial report (before fix)AGENTS.md- AI agent instructions for GLAM data extractiondocs/SCHEMA_MODULES.md- LinkML schema v0.2.1 architecturedocs/PERSISTENT_IDENTIFIERS.md- GHCID specification
Schema Files:
schemas/heritage_custodian.yaml- Main schemaschemas/core.yaml- HeritageCustodian, Location, Identifier classesschemas/enums.yaml- InstitutionTypeEnum, DataTierschemas/provenance.yaml- Provenance metadata
External Resources:
- Bulgarian National Library ISIL Registry: https://www.nationallibrary.bg/wp/?page_id=5686
- ISIL International Standard: https://isil.org/
- Wikidata SPARQL endpoint: https://query.wikidata.org/
Success Metrics
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Name extraction | 100% | 100% | ✅ Complete |
| Schema compliance | 100% | 100% | ✅ Complete |
| ISIL codes | 100% | 100% | ✅ Complete |
| Geocoding | 90% | 61.7% | ⚠️ In progress |
| GHCID coverage | 90% | 61.7% | ⚠️ In progress |
| Wikidata enrichment | 50% | 0% | ⬜ Next step |
Overall Progress: Phase 1 Complete (85%) | Phase 2 Pending (15%)
Conclusion
The Bulgarian ISIL registry name extraction issue is now fully resolved. All 94 institutions have real Bulgarian names extracted from the source HTML, making the dataset usable for:
- Human browsing and discovery
- Citation in academic papers
- Integration with heritage platforms
- Cross-referencing with Wikidata
- RDF/Linked Open Data publishing
The root cause (HTML field name typos) was identified through systematic field analysis and fixed with minimal code changes. The solution is robust and handles all known field name variants.
Next priority: Complete geocoding to enable full GHCID generation for all 94 institutions.
Session Complete: 2025-11-18
Total Time: ~30 minutes
Institutions Fixed: 70 → 0 placeholders (100% success rate)
Schema Validation: ✅ PASS (no issues)
Data Quality: TIER_1_AUTHORITATIVE (unchanged)