# Canadian ISIL Extraction - Session Complete **Date**: November 18, 2025 **Status**: ✅ COMPLETE **Duration**: ~3 hours ## Summary Successfully extracted and converted **9,237 Canadian heritage institutions** from Library and Archives Canada's directory to LinkML format. --- ## Accomplishments ### 1. Data Extraction (COMPLETE ✅) **Source**: Library and Archives Canada - Canadian Library Symbols Registry **URL**: https://sigles-symbols.bac-lac.gc.ca/eng/Search **Scraped Data**: - **9,566 total library records** (6,520 active + 3,046 closed) - **Extraction time**: ~4 minutes - **Success rate**: 100% - **Output files**: - `data/isil/canada/canadian_libraries_active.json` (2.2 MB) - `data/isil/canada/canadian_libraries_closed.json` (1.1 MB) - `data/isil/canada/canadian_libraries_all.json` (3.3 MB) ### 2. LinkML Conversion (COMPLETE ✅) **Parser Created**: `src/glam_extractor/parsers/canadian_isil.py` **Conversion Results**: - **9,237 institutions successfully converted** (96.6% success rate) - **329 records failed** (3.4%) - due to city names with special characters - **Output files**: - `data/instances/canada/canadian_heritage_custodians.json` (13 MB) - `data/instances/canada/canadian_heritage_custodians_sample.yaml` (116 KB) --- ## Data Statistics ### By Institution Type | Type | Count | Percentage | |------|-------|------------| | **LIBRARY** | 4,490 | 48.6% | | **EDUCATION_PROVIDER** | 2,011 | 21.8% | | **OFFICIAL_INSTITUTION** | 1,200 | 13.0% | | **RESEARCH_CENTER** | 1,096 | 11.9% | | **ARCHIVE** | 235 | 2.5% | | **MUSEUM** | 205 | 2.2% | ### By Province (Top 5) | Province | Count | |----------|-------| | Ontario | 3,335 | | Quebec | 1,812 | | Alberta | 1,259 | | British Columbia | 923 | | Saskatchewan | 564 | ### Data Quality - **Data Tier**: TIER_1_AUTHORITATIVE (government registry) - **Confidence Score**: 0.98 - **GHCID Format**: `CA-[PROVINCE]-[CITY]-[TYPE]-[ABBREV]` - **UUID Strategy**: UUID v5 (SHA-1) + UUID v8 (SHA-256) + numeric identifier --- ## Technical Implementation ### Parser Features The `CanadianISILParser` class implements: 1. **Institution Type Inference**: Detects library, archive, museum, education, government, research types from name patterns 2. **City Name Normalization**: Handles ALL CAPS city names, converts to title case 3. **Province Code Mapping**: Maps full province names to 2-letter codes (AB, BC, ON, QC, etc.) 4. **GHCID Generation**: Creates deterministic identifiers with Canadian format 5. **Status Mapping**: Maps "active"/"closed" to ACTIVE/INACTIVE enum values 6. **ISIL Validation**: Validates Canadian ISIL format (CA-XXXX) ### Known Issues **329 records failed conversion** due to invalid city LOCODE generation: - **Special characters**: Cities with accents (Québec → "QUÉ"), apostrophes (L'Anse → "L'A") - **Abbreviations**: "St." (Saint), "La" (short names), numbers ("100 Mile House" → "100") - **Spaces**: City names with spaces create invalid 2-letter codes **Solution**: Implement better city code normalization: - Remove accents (Québec → Quebec → QUE) - Expand abbreviations (St. → Saint → SAI) - Handle apostrophes (L'Anse → Lanse → LAN) - Filter out numbers/special characters --- ## Files Created/Modified ### New Files 1. **Scraper**: `scripts/scrapers/scrape_canadian_isil_fast.py` - Fast list-page scraper 2. **Parser**: `src/glam_extractor/parsers/canadian_isil.py` - LinkML converter 3. **Test**: `test_canadian_parser.py` - Parser validation script 4. **Converter**: `convert_canadian_to_linkml.py` - Bulk conversion script 5. **Data**: - `data/isil/canada/*.json` - Raw scraped data - `data/instances/canada/*.json|yaml` - LinkML-formatted output 6. **Documentation**: `docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md` - Session log ### Modified Files 1. **Parser Init**: `src/glam_extractor/parsers/__init__.py` - Added CanadianISILParser export --- ## Next Steps (Future Work) ### Task 1: Fix City Code Normalization (HIGH PRIORITY) Improve `_create_city_locode()` method to handle: - Accented characters (é, è, ê, ô, etc.) - Apostrophes and hyphens - Abbreviations (St., Ste., Mt., etc.) - Short names (< 3 characters) - Special cases (numbers, symbols) **Impact**: Will recover 329 failed records (3.4%) ### Task 2: Enrich with Detail Pages (MEDIUM PRIORITY) Extract additional fields from detail pages: - **Contact information**: Address, phone, email, website - **Operational details**: Hours, services, policies - **Administrative info**: Director, founded date, notes **Tool**: Modify `scripts/scrapers/scrape_canadian_isil.py` (slow but comprehensive) **Time estimate**: ~2.5 hours for 9,566 records ### Task 3: Geocoding (LOW PRIORITY) Add latitude/longitude coordinates: - Use Nominatim API with rate limiting (1 req/sec) - Cache results to avoid repeated lookups - Handle ambiguous city names **Script**: `scripts/enrich_geocoding.py` (already exists) ### Task 4: Integration with Global Dataset (LOW PRIORITY) Merge Canadian data with main GLAM dataset: - Cross-link with conversation-extracted Canadian institutions - Deduplicate by ISIL code - Resolve conflicts (use TIER_1 Canadian data as authoritative) --- ## Schema Compliance All converted records conform to LinkML schema v0.2.0: - **Core Classes**: `HeritageCustodian`, `Location`, `Identifier`, `Provenance` - **Enumerations**: `InstitutionTypeEnum`, `DataSourceEnum`, `DataTierEnum`, `OrganizationStatusEnum` - **GHCID**: Proper UUID v5, UUID v8, and numeric identifier generation - **Provenance**: Complete extraction metadata with confidence scores ### Example Record ```yaml - id: https://w3id.org/heritage/custodian/ca/aa name: Andrew Municipal Library institution_type: LIBRARY organization_status: ACTIVE ghcid_uuid: 2d0444bb-8934-571c-89d6-027bf0c87df4 ghcid_current: CA-AB-AND-L-AML locations: - city: Andrew region: Alberta country: CA identifiers: - identifier_scheme: ISIL identifier_value: CA-AA identifier_url: https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000 provenance: data_source: CSV_REGISTRY data_tier: TIER_1_AUTHORITATIVE extraction_date: '2025-11-18T20:23:09.715778' confidence_score: 0.98 ``` --- ## Lessons Learned 1. **City code generation is fragile**: Need robust normalization for international characters 2. **Enum validation is strict**: Must use exact uppercase values (ACTIVE vs active) 3. **LinkML models use different serialization**: Use `_as_json_obj()` not `model_dump()` 4. **Fast scraping is effective**: List-page scraping is 100x faster than detail-page scraping 5. **Province diversity**: Canadian data spans 13 provinces/territories with distinct patterns --- ## References - **LAC Directory**: https://sigles-symbols.bac-lac.gc.ca/eng/Search - **ISIL Standard**: ISO 15511 (International Standard Identifier for Libraries) - **LinkML Schema**: `schemas/heritage_custodian.yaml` (v0.2.0) - **GHCID Specification**: `docs/GHCID_PID_SCHEME.md` --- **Session Status**: ✅ COMPLETE **Next Session**: Fix city code normalization and recover 329 failed records