glam/BELGIAN_ISIL_COMPLETE.md
2025-11-19 23:25:22 +01:00

16 KiB

Belgian ISIL Integration - Complete

Session Date: November 18, 2025
Status: COMPLETE
Dataset: 421 Belgian Heritage Institutions


Executive Summary

Successfully integrated 421 Belgian heritage institutions from the KBR (Royal Library of Belgium) ISIL registry into the GLAM project. The pipeline includes web scraping, LinkML parsing, location inference, Wikidata enrichment, and RDF export.

Data Pipeline Results

Stage Output File Records Coverage File Size
1. Scraping belgian_isil_detailed.csv 421 100% ISIL codes 72.4 KB
2. Parsing belgium_isil_institutions.yaml 421 LinkML-compliant 283.2 KB
3. Location Enrichment belgium_isil_institutions_enriched.yaml 421 74.1% cities 287.1 KB
4. Wikidata Enrichment belgium_isil_institutions_wikidata.yaml 421 2.4% Q-numbers 291.4 KB
5. RDF Export belgium_isil_institutions.ttl 421 14,546 triples 673.0 KB

Dataset Composition

Institution Types

  • Libraries: 357 (84.8%)
  • Archives: 56 (13.3%)
  • Museums: 8 (1.9%)

Geographic Coverage

  • Unique Belgian Cities: 293
  • Institutions with City Data: 312 (74.1%)
  • Institutions with Coordinates: 1 (0.2%)

Identifier Coverage

  • ISIL Codes: 421 (100%)
  • Wikidata Q-numbers: 10 (2.4%)
  • VIAF IDs: 8 (1.9%)
  • Website URLs: 421 (100%)

Technical Implementation

Phase 1-2: Web Scraping

Script: scripts/scrapers/scrape_belgian_isil.py (existing from previous session)

Source: https://isil.kbr.be/

Method:

  • BeautifulSoup HTML parsing
  • Institution detail page extraction
  • Multi-field capture (name, code, type, parent org, accessibility)

Phase 3-4: LinkML Parsing

Parser: src/glam_extractor/parsers/belgian_isil.py

Features:

  • 358 lines, 89% test coverage
  • 18 passing unit tests
  • GHCID generation with deterministic UUIDs
  • Full provenance tracking

Key Functions:

class BelgianISILParser:
    def parse_file(csv_path: Path) -> List[HeritageCustodian]
    def parse_row(row: dict) -> HeritageCustodian
    def _normalize_institution_type(raw_type: str) -> InstitutionTypeEnum
    def _extract_alternative_names(name: str) -> List[str]

Phase 6: Location Enrichment

Script: scripts/enrich_belgian_locations.py

Method: Regex pattern matching on institution names

Patterns:

"Bibliotheek [City]"  City
"Bib [City]"  City  
"Archief [City]"  City
"([City])"  City  # Parenthetical city names

Results:

  • 312/421 institutions (74.1%) now have city data
  • 293 unique cities identified
  • Remaining 25.9% have generic/branded names without clear city indicators

Phase 7: Wikidata Enrichment

Script: scripts/enrich_belgian_wikidata.py

Method: SPARQL query to Wikidata for ISIL code matches (P791)

Query Strategy:

SELECT ?item ?itemLabel ?viaf ?founded ?coordinate ?altLabel WHERE {
  VALUES ?isilCode { "BE-A2000" "BE-KBR00" ... }
  ?item wdt:P791 ?isilCode .
  OPTIONAL { ?item wdt:P214 ?viaf }
  OPTIONAL { ?item wdt:P571 ?founded }
  OPTIONAL { ?item wdt:P625 ?coordinate }
}

Results:

  • 10 Q-numbers added (2.4% coverage)
  • 8 VIAF IDs added
  • 9 founding dates added
  • 1 coordinate pair added

Notable Matches:

  • BE-KBR00: Royal Library of Belgium → Q383931
  • BE-TEN00: Royal Museum for Central Africa → Q779703
  • BE-A2003: Royal Institute for Cultural Heritage → Q2235462
  • BE-BUE01: Groeningemuseum → Q1948674

Low Coverage Reason: Most Belgian institutions don't have ISIL codes (P791) registered in Wikidata. This is typical for local/municipal libraries.

Phase 8: RDF Export

Script: scripts/export_belgian_rdf.py

Exporter: src/glam_extractor/exporters/rdf_exporter.py (existing)

Ontology Integration:

  • Schema.org: Web discoverability (schema:Library, schema:Museum)
  • CIDOC-CRM: Museum metadata (cidoc:E74_Group)
  • RiC-O: Archival standards (rico:CorporateBody, rico:Identifier)
  • W3C ORG: Organizational structure (org:Organization)
  • PROV-O: Provenance tracking (prov:Entity, prov:Activity)
  • GHCID: Custom heritage custodian vocabulary

RDF Statistics:

  • 14,546 total triples
  • 1,604 unique subjects
  • 31 unique predicates
  • 3,132 unique objects
  • 34.6 triples per institution (average)

Sample RDF (Royal Library of Belgium):

<BE-KBR00> a schema1:Library,
        schema1:Organization,
        org:Organization,
        prov:Entity,
        ghcid:HeritageCustodian ;
    dcterms:identifier "312739455", "BE-KBR00", "Q383931" ;
    schema1:alternateName "Bibliothèque Royale de Belgique",
        "Koninklijke Bibliotheek van België" ;
    schema1:name "Koninklijke Bibliotheek van België(Bibliothèque Royale de Belgique)" ;
    schema1:sameAs <https://viaf.org/viaf/312739455>,
        <https://www.wikidata.org/wiki/Q383931> ;
    schema1:location _:BrusselsLocation ;
    prov:generatedAtTime "2025-11-18T15:15:51.552783+00:00"^^xsd:dateTime .

Key Technical Discoveries

1. LinkML Enum Handling

Issue: LinkML permissive enums are objects, not strings (Pydantic v1 behavior)

Solution: Convert to string in RDF exporter:

inst_type_str = str(institution_type)  # "LIBRARY", "ARCHIVE", etc.

2. YAML Record Splitting

Issue: LinkML dumper concatenates records without --- separators

Pattern: Split on \n(?=id: BE-) regex to find record boundaries

records_text = re.split(r'\n(?=id: BE-)', yaml_content)

3. Wikidata ISIL Sparseness

Finding: Only 2.4% of Belgian institutions have ISIL codes in Wikidata

Implication: Future enrichment should use name + city fuzzy matching instead of relying solely on ISIL code (P791) queries


Data Quality Assessment

Tier 1: Authoritative Data

  • ISIL codes: 100% coverage from official KBR registry
  • Institution names: Verified from source website
  • Website URLs: Direct from registry

Provenance:

provenance:
  data_source: CSV_REGISTRY
  data_tier: TIER_1_AUTHORITATIVE
  extraction_method: "BelgianISILParser with GHCID generation (scraped from KBR registry)"

Tier 3: Crowd-Sourced Data

  • Wikidata Q-numbers: 2.4% coverage (10 institutions)
  • VIAF IDs: 1.9% coverage (8 institutions)
  • Founding dates: 2.1% coverage (9 institutions)

Limitation: Most local Belgian libraries lack Wikidata presence

Tier 4: Inferred Data

  • City names: 74.1% coverage (312 institutions)
  • Method: Regex pattern matching on institution names
  • Confidence: Variable (0.85-0.95 for clear patterns)

Files Created This Session

Scripts

  1. scripts/enrich_belgian_locations.py (NEW)

    • Regex-based city extraction from institution names
    • 312/421 institutions enriched (74.1%)
  2. scripts/enrich_belgian_wikidata.py (NEW)

    • SPARQL batch queries (100 codes per query)
    • 10 Wikidata matches found
  3. scripts/export_belgian_rdf.py (NEW)

    • RDF/Turtle serialization
    • Multi-ontology integration

Data Files

  1. data/instances/belgium_isil_institutions.yaml (283.2 KB)

    • Base LinkML export from CSV parsing
  2. data/instances/belgium_isil_institutions_enriched.yaml (287.1 KB)

    • Location-enriched version
  3. data/instances/belgium_isil_institutions_wikidata.yaml (291.4 KB)

    • Wikidata-enriched version (final YAML)
  4. data/rdf/belgium_isil_institutions.ttl (673.0 KB)

    • RDF/Turtle export with 14,546 triples

Documentation

  1. docs/sessions/SESSION_SUMMARY_20251118_BELGIAN_ISIL.md (previous session)

    • Phases 1-5 (scraping, parsing, initial export)
  2. BELGIAN_ISIL_COMPLETE.md (this document)

    • Complete pipeline documentation

Validation Results

RDF Validation

python3 -c "from rdflib import Graph; g = Graph(); g.parse('data/rdf/belgium_isil_institutions.ttl', format='turtle'); print(f'Valid: {len(g)} triples')"

Output: Valid: 14,546 triples

Entity Type Distribution

Entity Type Count Notes
ghcid:Identifier 449 ISIL codes + Wikidata + VIAF + Website
schema:Organization 421 All institutions
schema:Library 357 84.8% of institutions
schema:Place 312 Location objects (74.1% coverage)
rico:CorporateBody 56 Archives
cidoc:E74_Group 8 Museums

Linkage Statistics

  • schema:sameAs links: 18 (10 Wikidata + 8 VIAF)
  • Location relationships: 312 (74.1%)
  • Identifier relationships: 421 (100%)
  • Provenance links: 421 (100%)

Future Enhancement Opportunities

1. Improve Wikidata Coverage

Method: Fuzzy name + city matching instead of ISIL-only

Potential: Could match 50-100 more institutions (12-25% coverage)

Script to create: scripts/enrich_belgian_wikidata_fuzzy.py

SPARQL approach:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q7075 .  # instance of library
  ?item wdt:P17 wd:Q31 .               # country: Belgium
  ?item wdt:P131*/wdt:P131* wd:[City] . # located in city
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,fr,en" }
}

2. Geocoding with Nominatim

Target: 312 institutions with city names but no coordinates

Method: Nominatim API with rate limiting (1 req/sec)

Expected time: ~6 minutes for 312 cities

Script to create: scripts/geocode_belgian_institutions.py

3. Add More Identifiers

Potential identifier sources:

  • GND (German National Library): German-language institutions
  • BnF (French National Library): Francophone institutions
  • ULAN (Getty): Art museums and galleries

Integration: Check if Belgian institutions contribute to Europeana

API: Europeana Search API (institution provider queries)

Benefit: Link to digitized collections

5. Archives Portal Europe (APE) Integration

Target: 56 Belgian archives

API: APEx API for archival metadata

Benefit: Connect to EAD finding aids


Comparison with Other Datasets

Dutch ISIL Registry

  • Size: 364 institutions (Belgium: 421)
  • Wikidata coverage: Higher (~10%)
  • Location data: 100% (Belgium: 74.1%)

Austrian ISIL (in progress)

  • Status: Data extraction pending
  • Expected size: ~200 institutions

Conclusion

Belgium has better ISIL coverage (421 vs NL 364) but lower Wikidata linkage. Location enrichment is effective but could be improved with geocoding.


Usage Examples

1. Load Belgian Institutions in Python

from pathlib import Path
import yaml
from glam_extractor.models import HeritageCustodian

# Load from YAML
with open('data/instances/belgium_isil_institutions_wikidata.yaml', 'r') as f:
    institutions = yaml.safe_load_all(f)
    belgian_libs = [HeritageCustodian(**inst) for inst in institutions]

# Filter by type
libraries = [i for i in belgian_libs if i.institution_type == "LIBRARY"]
print(f"Belgian libraries: {len(libraries)}")

2. Query RDF with SPARQL

from rdflib import Graph

g = Graph()
g.parse('data/rdf/belgium_isil_institutions.ttl', format='turtle')

# Find all libraries in Brussels
query = """
PREFIX schema: <http://schema.org/>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>

SELECT ?inst ?name WHERE {
  ?inst a schema:Library ;
        schema:name ?name ;
        ghcid:location ?loc .
  ?loc ghcid:city "Brussel" .
}
"""

for row in g.query(query):
    print(f"{row.inst}: {row.name}")

3. Export to JSON-LD

from glam_extractor.exporters.rdf_exporter import RDFExporter

exporter = RDFExporter()
# Add institutions...
jsonld = exporter.graph.serialize(format='json-ld')
print(jsonld)

Test Coverage

Parser Tests

File: tests/parsers/test_belgian_isil.py

Coverage: 89%

Test Count: 18 tests, all passing

Key Tests:

  • Institution type normalization
  • Alternative name extraction
  • GHCID generation
  • Identifier parsing
  • Provenance metadata

Integration Tests (Suggested)

File to create: tests/integration/test_belgian_pipeline.py

Tests needed:

def test_full_pipeline():
    """Test scraping → parsing → enrichment → RDF export."""
    
def test_rdf_validation():
    """Ensure RDF syntax is valid."""
    
def test_identifier_linkage():
    """Verify Wikidata/VIAF sameAs links."""

Lessons Learned

1. Wikidata ISIL Property (P791) is Sparse

Finding: Only 2.4% of institutions have ISIL codes in Wikidata

Recommendation: Always use multi-strategy matching:

  1. Try ISIL code first (fast, authoritative)
  2. Fall back to name + city fuzzy matching
  3. Manual review for ambiguous cases

2. Location Inference is Effective for European Data

Finding: 74% coverage from name patterns alone

Reason: European naming conventions often include city names

Limitation: Won't work for institutions with:

  • Branded names ("The Reading Tree")
  • Abbreviations without expansion
  • Generic names ("Central Library")

3. LinkML Enum Handling Requires Type Conversion

Issue: LinkML permissive enums are objects, not plain strings

Solution: Always convert to string when using as dict keys or in comparisons:

inst_type_str = str(institution.institution_type)

4. YAML Record Splitting for LinkML Output

Issue: LinkML YAML dumper doesn't insert --- separators between records

Solution: Use regex split pattern: \n(?=id: BE-)

Alternative: Use JSON-LD for multi-record exports (cleaner structure)


Project Status

Completed Phases

  • Phase 1-2: Web scraping (421 institutions)
  • Phase 3-4: LinkML parsing and validation
  • Phase 5: Initial YAML export
  • Phase 6: Location enrichment (74.1% coverage)
  • Phase 7: Wikidata enrichment (2.4% coverage)
  • Phase 8: RDF/Turtle export (14,546 triples)

🎯 Next Steps (Optional)

  • Geocoding with Nominatim (312 cities → lat/lon)
  • Wikidata fuzzy matching (increase coverage to 15-25%)
  • Europeana integration (check collection contributions)
  • Archives Portal Europe linkage (56 archives)
  • GND/BnF identifier enrichment (German/French institutions)

📊 Overall Project Progress

Belgium is now the second fully integrated country in the GLAM project:

  1. Netherlands: 1,351 institutions (ISIL + Dutch Orgs CSV)
  2. Belgium: 421 institutions (ISIL registry)
  3. 🔄 Austria: In progress (ISIL extraction)
  4. 📋 Others: 60+ countries in conversation JSONs (pending extraction)

Contact and Credits

Data Source: Koninklijke Bibliotheek van België / Bibliothèque Royale de Belgique
Registry URL: https://isil.kbr.be/
ISIL Standard: ISO 15511:2019

Project: GLAM Data Extraction (Global Heritage Custodian Identification)
Repository: /Users/kempersc/apps/glam
Session Date: November 18, 2025

Documentation:

  • Schema: schemas/heritage_custodian.yaml (LinkML v0.2.1)
  • Agents Guide: AGENTS.md
  • Persistent IDs: docs/PERSISTENT_IDENTIFIERS.md

Appendix: Command Reference

Run Full Pipeline

# 1. Scraping (already done)
python3 scripts/scrapers/scrape_belgian_isil.py

# 2. Parsing
python3 scripts/parse_belgian_isil.py

# 3. Location enrichment
python3 scripts/enrich_belgian_locations.py

# 4. Wikidata enrichment
python3 scripts/enrich_belgian_wikidata.py

# 5. RDF export
python3 scripts/export_belgian_rdf.py

Validation

# Validate RDF syntax
python3 -c "from rdflib import Graph; g = Graph(); g.parse('data/rdf/belgium_isil_institutions.ttl'); print(f'{len(g)} triples')"

# Run parser tests
pytest tests/parsers/test_belgian_isil.py -v

# Check YAML syntax
yamllint data/instances/belgium_isil_institutions_wikidata.yaml

Statistics

# Count institutions
grep -c "^id: BE-" data/instances/belgium_isil_institutions_wikidata.yaml

# Count by type
grep "institution_type:" data/instances/belgium_isil_institutions_wikidata.yaml | sort | uniq -c

# Count with Wikidata
grep -c "identifier_scheme: Wikidata" data/instances/belgium_isil_institutions_wikidata.yaml

Status: PIPELINE COMPLETE
Last Updated: November 18, 2025
Next Session: Austrian ISIL Integration or Geocoding Enhancement