7 KiB
Canadian ISIL Extraction - Session Complete
Date: November 18, 2025
Status: ✅ COMPLETE
Duration: ~3 hours
Summary
Successfully extracted and converted 9,237 Canadian heritage institutions from Library and Archives Canada's directory to LinkML format.
Accomplishments
1. Data Extraction (COMPLETE ✅)
Source: Library and Archives Canada - Canadian Library Symbols Registry
URL: https://sigles-symbols.bac-lac.gc.ca/eng/Search
Scraped Data:
- 9,566 total library records (6,520 active + 3,046 closed)
- Extraction time: ~4 minutes
- Success rate: 100%
- Output files:
data/isil/canada/canadian_libraries_active.json(2.2 MB)data/isil/canada/canadian_libraries_closed.json(1.1 MB)data/isil/canada/canadian_libraries_all.json(3.3 MB)
2. LinkML Conversion (COMPLETE ✅)
Parser Created: src/glam_extractor/parsers/canadian_isil.py
Conversion Results:
- 9,237 institutions successfully converted (96.6% success rate)
- 329 records failed (3.4%) - due to city names with special characters
- Output files:
data/instances/canada/canadian_heritage_custodians.json(13 MB)data/instances/canada/canadian_heritage_custodians_sample.yaml(116 KB)
Data Statistics
By Institution Type
| Type | Count | Percentage |
|---|---|---|
| LIBRARY | 4,490 | 48.6% |
| EDUCATION_PROVIDER | 2,011 | 21.8% |
| OFFICIAL_INSTITUTION | 1,200 | 13.0% |
| RESEARCH_CENTER | 1,096 | 11.9% |
| ARCHIVE | 235 | 2.5% |
| MUSEUM | 205 | 2.2% |
By Province (Top 5)
| Province | Count |
|---|---|
| Ontario | 3,335 |
| Quebec | 1,812 |
| Alberta | 1,259 |
| British Columbia | 923 |
| Saskatchewan | 564 |
Data Quality
- Data Tier: TIER_1_AUTHORITATIVE (government registry)
- Confidence Score: 0.98
- GHCID Format:
CA-[PROVINCE]-[CITY]-[TYPE]-[ABBREV] - UUID Strategy: UUID v5 (SHA-1) + UUID v8 (SHA-256) + numeric identifier
Technical Implementation
Parser Features
The CanadianISILParser class implements:
- Institution Type Inference: Detects library, archive, museum, education, government, research types from name patterns
- City Name Normalization: Handles ALL CAPS city names, converts to title case
- Province Code Mapping: Maps full province names to 2-letter codes (AB, BC, ON, QC, etc.)
- GHCID Generation: Creates deterministic identifiers with Canadian format
- Status Mapping: Maps "active"/"closed" to ACTIVE/INACTIVE enum values
- ISIL Validation: Validates Canadian ISIL format (CA-XXXX)
Known Issues
329 records failed conversion due to invalid city LOCODE generation:
- Special characters: Cities with accents (Québec → "QUÉ"), apostrophes (L'Anse → "L'A")
- Abbreviations: "St." (Saint), "La" (short names), numbers ("100 Mile House" → "100")
- Spaces: City names with spaces create invalid 2-letter codes
Solution: Implement better city code normalization:
- Remove accents (Québec → Quebec → QUE)
- Expand abbreviations (St. → Saint → SAI)
- Handle apostrophes (L'Anse → Lanse → LAN)
- Filter out numbers/special characters
Files Created/Modified
New Files
- Scraper:
scripts/scrapers/scrape_canadian_isil_fast.py- Fast list-page scraper - Parser:
src/glam_extractor/parsers/canadian_isil.py- LinkML converter - Test:
test_canadian_parser.py- Parser validation script - Converter:
convert_canadian_to_linkml.py- Bulk conversion script - Data:
data/isil/canada/*.json- Raw scraped datadata/instances/canada/*.json|yaml- LinkML-formatted output
- Documentation:
docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md- Session log
Modified Files
- Parser Init:
src/glam_extractor/parsers/__init__.py- Added CanadianISILParser export
Next Steps (Future Work)
Task 1: Fix City Code Normalization (HIGH PRIORITY)
Improve _create_city_locode() method to handle:
- Accented characters (é, è, ê, ô, etc.)
- Apostrophes and hyphens
- Abbreviations (St., Ste., Mt., etc.)
- Short names (< 3 characters)
- Special cases (numbers, symbols)
Impact: Will recover 329 failed records (3.4%)
Task 2: Enrich with Detail Pages (MEDIUM PRIORITY)
Extract additional fields from detail pages:
- Contact information: Address, phone, email, website
- Operational details: Hours, services, policies
- Administrative info: Director, founded date, notes
Tool: Modify scripts/scrapers/scrape_canadian_isil.py (slow but comprehensive)
Time estimate: ~2.5 hours for 9,566 records
Task 3: Geocoding (LOW PRIORITY)
Add latitude/longitude coordinates:
- Use Nominatim API with rate limiting (1 req/sec)
- Cache results to avoid repeated lookups
- Handle ambiguous city names
Script: scripts/enrich_geocoding.py (already exists)
Task 4: Integration with Global Dataset (LOW PRIORITY)
Merge Canadian data with main GLAM dataset:
- Cross-link with conversation-extracted Canadian institutions
- Deduplicate by ISIL code
- Resolve conflicts (use TIER_1 Canadian data as authoritative)
Schema Compliance
All converted records conform to LinkML schema v0.2.0:
- Core Classes:
HeritageCustodian,Location,Identifier,Provenance - Enumerations:
InstitutionTypeEnum,DataSourceEnum,DataTierEnum,OrganizationStatusEnum - GHCID: Proper UUID v5, UUID v8, and numeric identifier generation
- Provenance: Complete extraction metadata with confidence scores
Example Record
- id: https://w3id.org/heritage/custodian/ca/aa
name: Andrew Municipal Library
institution_type: LIBRARY
organization_status: ACTIVE
ghcid_uuid: 2d0444bb-8934-571c-89d6-027bf0c87df4
ghcid_current: CA-AB-AND-L-AML
locations:
- city: Andrew
region: Alberta
country: CA
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-AA
identifier_url: https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000
provenance:
data_source: CSV_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_date: '2025-11-18T20:23:09.715778'
confidence_score: 0.98
Lessons Learned
- City code generation is fragile: Need robust normalization for international characters
- Enum validation is strict: Must use exact uppercase values (ACTIVE vs active)
- LinkML models use different serialization: Use
_as_json_obj()notmodel_dump() - Fast scraping is effective: List-page scraping is 100x faster than detail-page scraping
- Province diversity: Canadian data spans 13 provinces/territories with distinct patterns
References
- LAC Directory: https://sigles-symbols.bac-lac.gc.ca/eng/Search
- ISIL Standard: ISO 15511 (International Standard Identifier for Libraries)
- LinkML Schema:
schemas/heritage_custodian.yaml(v0.2.0) - GHCID Specification:
docs/GHCID_PID_SCHEME.md
Session Status: ✅ COMPLETE
Next Session: Fix city code normalization and recover 329 failed records