16 KiB
Belgian ISIL Integration - Complete ✅
Session Date: November 18, 2025
Status: COMPLETE
Dataset: 421 Belgian Heritage Institutions
Executive Summary
Successfully integrated 421 Belgian heritage institutions from the KBR (Royal Library of Belgium) ISIL registry into the GLAM project. The pipeline includes web scraping, LinkML parsing, location inference, Wikidata enrichment, and RDF export.
Data Pipeline Results
| Stage | Output File | Records | Coverage | File Size |
|---|---|---|---|---|
| 1. Scraping | belgian_isil_detailed.csv |
421 | 100% ISIL codes | 72.4 KB |
| 2. Parsing | belgium_isil_institutions.yaml |
421 | LinkML-compliant | 283.2 KB |
| 3. Location Enrichment | belgium_isil_institutions_enriched.yaml |
421 | 74.1% cities | 287.1 KB |
| 4. Wikidata Enrichment | belgium_isil_institutions_wikidata.yaml |
421 | 2.4% Q-numbers | 291.4 KB |
| 5. RDF Export | belgium_isil_institutions.ttl |
421 | 14,546 triples | 673.0 KB |
Dataset Composition
Institution Types
- Libraries: 357 (84.8%)
- Archives: 56 (13.3%)
- Museums: 8 (1.9%)
Geographic Coverage
- Unique Belgian Cities: 293
- Institutions with City Data: 312 (74.1%)
- Institutions with Coordinates: 1 (0.2%)
Identifier Coverage
- ISIL Codes: 421 (100%)
- Wikidata Q-numbers: 10 (2.4%)
- VIAF IDs: 8 (1.9%)
- Website URLs: 421 (100%)
Technical Implementation
Phase 1-2: Web Scraping ✅
Script: scripts/scrapers/scrape_belgian_isil.py (existing from previous session)
Source: https://isil.kbr.be/
Method:
- BeautifulSoup HTML parsing
- Institution detail page extraction
- Multi-field capture (name, code, type, parent org, accessibility)
Phase 3-4: LinkML Parsing ✅
Parser: src/glam_extractor/parsers/belgian_isil.py
Features:
- 358 lines, 89% test coverage
- 18 passing unit tests
- GHCID generation with deterministic UUIDs
- Full provenance tracking
Key Functions:
class BelgianISILParser:
def parse_file(csv_path: Path) -> List[HeritageCustodian]
def parse_row(row: dict) -> HeritageCustodian
def _normalize_institution_type(raw_type: str) -> InstitutionTypeEnum
def _extract_alternative_names(name: str) -> List[str]
Phase 6: Location Enrichment ✅
Script: scripts/enrich_belgian_locations.py
Method: Regex pattern matching on institution names
Patterns:
"Bibliotheek [City]" → City
"Bib [City]" → City
"Archief [City]" → City
"([City])" → City # Parenthetical city names
Results:
- 312/421 institutions (74.1%) now have city data
- 293 unique cities identified
- Remaining 25.9% have generic/branded names without clear city indicators
Phase 7: Wikidata Enrichment ✅
Script: scripts/enrich_belgian_wikidata.py
Method: SPARQL query to Wikidata for ISIL code matches (P791)
Query Strategy:
SELECT ?item ?itemLabel ?viaf ?founded ?coordinate ?altLabel WHERE {
VALUES ?isilCode { "BE-A2000" "BE-KBR00" ... }
?item wdt:P791 ?isilCode .
OPTIONAL { ?item wdt:P214 ?viaf }
OPTIONAL { ?item wdt:P571 ?founded }
OPTIONAL { ?item wdt:P625 ?coordinate }
}
Results:
- 10 Q-numbers added (2.4% coverage)
- 8 VIAF IDs added
- 9 founding dates added
- 1 coordinate pair added
Notable Matches:
- BE-KBR00: Royal Library of Belgium → Q383931
- BE-TEN00: Royal Museum for Central Africa → Q779703
- BE-A2003: Royal Institute for Cultural Heritage → Q2235462
- BE-BUE01: Groeningemuseum → Q1948674
Low Coverage Reason: Most Belgian institutions don't have ISIL codes (P791) registered in Wikidata. This is typical for local/municipal libraries.
Phase 8: RDF Export ✅
Script: scripts/export_belgian_rdf.py
Exporter: src/glam_extractor/exporters/rdf_exporter.py (existing)
Ontology Integration:
- Schema.org: Web discoverability (schema:Library, schema:Museum)
- CIDOC-CRM: Museum metadata (cidoc:E74_Group)
- RiC-O: Archival standards (rico:CorporateBody, rico:Identifier)
- W3C ORG: Organizational structure (org:Organization)
- PROV-O: Provenance tracking (prov:Entity, prov:Activity)
- GHCID: Custom heritage custodian vocabulary
RDF Statistics:
- 14,546 total triples
- 1,604 unique subjects
- 31 unique predicates
- 3,132 unique objects
- 34.6 triples per institution (average)
Sample RDF (Royal Library of Belgium):
<BE-KBR00> a schema1:Library,
schema1:Organization,
org:Organization,
prov:Entity,
ghcid:HeritageCustodian ;
dcterms:identifier "312739455", "BE-KBR00", "Q383931" ;
schema1:alternateName "Bibliothèque Royale de Belgique",
"Koninklijke Bibliotheek van België" ;
schema1:name "Koninklijke Bibliotheek van België(Bibliothèque Royale de Belgique)" ;
schema1:sameAs <https://viaf.org/viaf/312739455>,
<https://www.wikidata.org/wiki/Q383931> ;
schema1:location _:BrusselsLocation ;
prov:generatedAtTime "2025-11-18T15:15:51.552783+00:00"^^xsd:dateTime .
Key Technical Discoveries
1. LinkML Enum Handling
Issue: LinkML permissive enums are objects, not strings (Pydantic v1 behavior)
Solution: Convert to string in RDF exporter:
inst_type_str = str(institution_type) # "LIBRARY", "ARCHIVE", etc.
2. YAML Record Splitting
Issue: LinkML dumper concatenates records without --- separators
Pattern: Split on \n(?=id: BE-) regex to find record boundaries
records_text = re.split(r'\n(?=id: BE-)', yaml_content)
3. Wikidata ISIL Sparseness
Finding: Only 2.4% of Belgian institutions have ISIL codes in Wikidata
Implication: Future enrichment should use name + city fuzzy matching instead of relying solely on ISIL code (P791) queries
Data Quality Assessment
Tier 1: Authoritative Data ✅
- ISIL codes: 100% coverage from official KBR registry
- Institution names: Verified from source website
- Website URLs: Direct from registry
Provenance:
provenance:
data_source: CSV_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_method: "BelgianISILParser with GHCID generation (scraped from KBR registry)"
Tier 3: Crowd-Sourced Data
- Wikidata Q-numbers: 2.4% coverage (10 institutions)
- VIAF IDs: 1.9% coverage (8 institutions)
- Founding dates: 2.1% coverage (9 institutions)
Limitation: Most local Belgian libraries lack Wikidata presence
Tier 4: Inferred Data
- City names: 74.1% coverage (312 institutions)
- Method: Regex pattern matching on institution names
- Confidence: Variable (0.85-0.95 for clear patterns)
Files Created This Session
Scripts
-
scripts/enrich_belgian_locations.py(NEW)- Regex-based city extraction from institution names
- 312/421 institutions enriched (74.1%)
-
scripts/enrich_belgian_wikidata.py(NEW)- SPARQL batch queries (100 codes per query)
- 10 Wikidata matches found
-
scripts/export_belgian_rdf.py(NEW)- RDF/Turtle serialization
- Multi-ontology integration
Data Files
-
data/instances/belgium_isil_institutions.yaml(283.2 KB)- Base LinkML export from CSV parsing
-
data/instances/belgium_isil_institutions_enriched.yaml(287.1 KB)- Location-enriched version
-
data/instances/belgium_isil_institutions_wikidata.yaml(291.4 KB)- Wikidata-enriched version (final YAML)
-
data/rdf/belgium_isil_institutions.ttl(673.0 KB)- RDF/Turtle export with 14,546 triples
Documentation
-
docs/sessions/SESSION_SUMMARY_20251118_BELGIAN_ISIL.md(previous session)- Phases 1-5 (scraping, parsing, initial export)
-
BELGIAN_ISIL_COMPLETE.md(this document)- Complete pipeline documentation
Validation Results
RDF Validation ✅
python3 -c "from rdflib import Graph; g = Graph(); g.parse('data/rdf/belgium_isil_institutions.ttl', format='turtle'); print(f'Valid: {len(g)} triples')"
Output: Valid: 14,546 triples
Entity Type Distribution
| Entity Type | Count | Notes |
|---|---|---|
| ghcid:Identifier | 449 | ISIL codes + Wikidata + VIAF + Website |
| schema:Organization | 421 | All institutions |
| schema:Library | 357 | 84.8% of institutions |
| schema:Place | 312 | Location objects (74.1% coverage) |
| rico:CorporateBody | 56 | Archives |
| cidoc:E74_Group | 8 | Museums |
Linkage Statistics
- schema:sameAs links: 18 (10 Wikidata + 8 VIAF)
- Location relationships: 312 (74.1%)
- Identifier relationships: 421 (100%)
- Provenance links: 421 (100%)
Future Enhancement Opportunities
1. Improve Wikidata Coverage
Method: Fuzzy name + city matching instead of ISIL-only
Potential: Could match 50-100 more institutions (12-25% coverage)
Script to create: scripts/enrich_belgian_wikidata_fuzzy.py
SPARQL approach:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31/wdt:P279* wd:Q7075 . # instance of library
?item wdt:P17 wd:Q31 . # country: Belgium
?item wdt:P131*/wdt:P131* wd:[City] . # located in city
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,fr,en" }
}
2. Geocoding with Nominatim
Target: 312 institutions with city names but no coordinates
Method: Nominatim API with rate limiting (1 req/sec)
Expected time: ~6 minutes for 312 cities
Script to create: scripts/geocode_belgian_institutions.py
3. Add More Identifiers
Potential identifier sources:
- GND (German National Library): German-language institutions
- BnF (French National Library): Francophone institutions
- ULAN (Getty): Art museums and galleries
4. Cross-Link with Europeana
Integration: Check if Belgian institutions contribute to Europeana
API: Europeana Search API (institution provider queries)
Benefit: Link to digitized collections
5. Archives Portal Europe (APE) Integration
Target: 56 Belgian archives
API: APEx API for archival metadata
Benefit: Connect to EAD finding aids
Comparison with Other Datasets
Dutch ISIL Registry
- Size: 364 institutions (Belgium: 421)
- Wikidata coverage: Higher (~10%)
- Location data: 100% (Belgium: 74.1%)
Austrian ISIL (in progress)
- Status: Data extraction pending
- Expected size: ~200 institutions
Conclusion
Belgium has better ISIL coverage (421 vs NL 364) but lower Wikidata linkage. Location enrichment is effective but could be improved with geocoding.
Usage Examples
1. Load Belgian Institutions in Python
from pathlib import Path
import yaml
from glam_extractor.models import HeritageCustodian
# Load from YAML
with open('data/instances/belgium_isil_institutions_wikidata.yaml', 'r') as f:
institutions = yaml.safe_load_all(f)
belgian_libs = [HeritageCustodian(**inst) for inst in institutions]
# Filter by type
libraries = [i for i in belgian_libs if i.institution_type == "LIBRARY"]
print(f"Belgian libraries: {len(libraries)}")
2. Query RDF with SPARQL
from rdflib import Graph
g = Graph()
g.parse('data/rdf/belgium_isil_institutions.ttl', format='turtle')
# Find all libraries in Brussels
query = """
PREFIX schema: <http://schema.org/>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
SELECT ?inst ?name WHERE {
?inst a schema:Library ;
schema:name ?name ;
ghcid:location ?loc .
?loc ghcid:city "Brussel" .
}
"""
for row in g.query(query):
print(f"{row.inst}: {row.name}")
3. Export to JSON-LD
from glam_extractor.exporters.rdf_exporter import RDFExporter
exporter = RDFExporter()
# Add institutions...
jsonld = exporter.graph.serialize(format='json-ld')
print(jsonld)
Test Coverage
Parser Tests ✅
File: tests/parsers/test_belgian_isil.py
Coverage: 89%
Test Count: 18 tests, all passing
Key Tests:
- Institution type normalization
- Alternative name extraction
- GHCID generation
- Identifier parsing
- Provenance metadata
Integration Tests (Suggested)
File to create: tests/integration/test_belgian_pipeline.py
Tests needed:
def test_full_pipeline():
"""Test scraping → parsing → enrichment → RDF export."""
def test_rdf_validation():
"""Ensure RDF syntax is valid."""
def test_identifier_linkage():
"""Verify Wikidata/VIAF sameAs links."""
Lessons Learned
1. Wikidata ISIL Property (P791) is Sparse
Finding: Only 2.4% of institutions have ISIL codes in Wikidata
Recommendation: Always use multi-strategy matching:
- Try ISIL code first (fast, authoritative)
- Fall back to name + city fuzzy matching
- Manual review for ambiguous cases
2. Location Inference is Effective for European Data
Finding: 74% coverage from name patterns alone
Reason: European naming conventions often include city names
Limitation: Won't work for institutions with:
- Branded names ("The Reading Tree")
- Abbreviations without expansion
- Generic names ("Central Library")
3. LinkML Enum Handling Requires Type Conversion
Issue: LinkML permissive enums are objects, not plain strings
Solution: Always convert to string when using as dict keys or in comparisons:
inst_type_str = str(institution.institution_type)
4. YAML Record Splitting for LinkML Output
Issue: LinkML YAML dumper doesn't insert --- separators between records
Solution: Use regex split pattern: \n(?=id: BE-)
Alternative: Use JSON-LD for multi-record exports (cleaner structure)
Project Status
✅ Completed Phases
- Phase 1-2: Web scraping (421 institutions)
- Phase 3-4: LinkML parsing and validation
- Phase 5: Initial YAML export
- Phase 6: Location enrichment (74.1% coverage)
- Phase 7: Wikidata enrichment (2.4% coverage)
- Phase 8: RDF/Turtle export (14,546 triples)
🎯 Next Steps (Optional)
- Geocoding with Nominatim (312 cities → lat/lon)
- Wikidata fuzzy matching (increase coverage to 15-25%)
- Europeana integration (check collection contributions)
- Archives Portal Europe linkage (56 archives)
- GND/BnF identifier enrichment (German/French institutions)
📊 Overall Project Progress
Belgium is now the second fully integrated country in the GLAM project:
- ✅ Netherlands: 1,351 institutions (ISIL + Dutch Orgs CSV)
- ✅ Belgium: 421 institutions (ISIL registry)
- 🔄 Austria: In progress (ISIL extraction)
- 📋 Others: 60+ countries in conversation JSONs (pending extraction)
Contact and Credits
Data Source: Koninklijke Bibliotheek van België / Bibliothèque Royale de Belgique
Registry URL: https://isil.kbr.be/
ISIL Standard: ISO 15511:2019
Project: GLAM Data Extraction (Global Heritage Custodian Identification)
Repository: /Users/kempersc/apps/glam
Session Date: November 18, 2025
Documentation:
- Schema:
schemas/heritage_custodian.yaml(LinkML v0.2.1) - Agents Guide:
AGENTS.md - Persistent IDs:
docs/PERSISTENT_IDENTIFIERS.md
Appendix: Command Reference
Run Full Pipeline
# 1. Scraping (already done)
python3 scripts/scrapers/scrape_belgian_isil.py
# 2. Parsing
python3 scripts/parse_belgian_isil.py
# 3. Location enrichment
python3 scripts/enrich_belgian_locations.py
# 4. Wikidata enrichment
python3 scripts/enrich_belgian_wikidata.py
# 5. RDF export
python3 scripts/export_belgian_rdf.py
Validation
# Validate RDF syntax
python3 -c "from rdflib import Graph; g = Graph(); g.parse('data/rdf/belgium_isil_institutions.ttl'); print(f'{len(g)} triples')"
# Run parser tests
pytest tests/parsers/test_belgian_isil.py -v
# Check YAML syntax
yamllint data/instances/belgium_isil_institutions_wikidata.yaml
Statistics
# Count institutions
grep -c "^id: BE-" data/instances/belgium_isil_institutions_wikidata.yaml
# Count by type
grep "institution_type:" data/instances/belgium_isil_institutions_wikidata.yaml | sort | uniq -c
# Count with Wikidata
grep -c "identifier_scheme: Wikidata" data/instances/belgium_isil_institutions_wikidata.yaml
Status: ✅ PIPELINE COMPLETE
Last Updated: November 18, 2025
Next Session: Austrian ISIL Integration or Geocoding Enhancement