glam/docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md
2025-11-19 23:25:22 +01:00

7 KiB

Canadian ISIL Extraction - Session Complete

Date: November 18, 2025
Status: COMPLETE
Duration: ~3 hours

Summary

Successfully extracted and converted 9,237 Canadian heritage institutions from Library and Archives Canada's directory to LinkML format.


Accomplishments

1. Data Extraction (COMPLETE )

Source: Library and Archives Canada - Canadian Library Symbols Registry
URL: https://sigles-symbols.bac-lac.gc.ca/eng/Search

Scraped Data:

  • 9,566 total library records (6,520 active + 3,046 closed)
  • Extraction time: ~4 minutes
  • Success rate: 100%
  • Output files:
    • data/isil/canada/canadian_libraries_active.json (2.2 MB)
    • data/isil/canada/canadian_libraries_closed.json (1.1 MB)
    • data/isil/canada/canadian_libraries_all.json (3.3 MB)

2. LinkML Conversion (COMPLETE )

Parser Created: src/glam_extractor/parsers/canadian_isil.py

Conversion Results:

  • 9,237 institutions successfully converted (96.6% success rate)
  • 329 records failed (3.4%) - due to city names with special characters
  • Output files:
    • data/instances/canada/canadian_heritage_custodians.json (13 MB)
    • data/instances/canada/canadian_heritage_custodians_sample.yaml (116 KB)

Data Statistics

By Institution Type

Type Count Percentage
LIBRARY 4,490 48.6%
EDUCATION_PROVIDER 2,011 21.8%
OFFICIAL_INSTITUTION 1,200 13.0%
RESEARCH_CENTER 1,096 11.9%
ARCHIVE 235 2.5%
MUSEUM 205 2.2%

By Province (Top 5)

Province Count
Ontario 3,335
Quebec 1,812
Alberta 1,259
British Columbia 923
Saskatchewan 564

Data Quality

  • Data Tier: TIER_1_AUTHORITATIVE (government registry)
  • Confidence Score: 0.98
  • GHCID Format: CA-[PROVINCE]-[CITY]-[TYPE]-[ABBREV]
  • UUID Strategy: UUID v5 (SHA-1) + UUID v8 (SHA-256) + numeric identifier

Technical Implementation

Parser Features

The CanadianISILParser class implements:

  1. Institution Type Inference: Detects library, archive, museum, education, government, research types from name patterns
  2. City Name Normalization: Handles ALL CAPS city names, converts to title case
  3. Province Code Mapping: Maps full province names to 2-letter codes (AB, BC, ON, QC, etc.)
  4. GHCID Generation: Creates deterministic identifiers with Canadian format
  5. Status Mapping: Maps "active"/"closed" to ACTIVE/INACTIVE enum values
  6. ISIL Validation: Validates Canadian ISIL format (CA-XXXX)

Known Issues

329 records failed conversion due to invalid city LOCODE generation:

  • Special characters: Cities with accents (Québec → "QUÉ"), apostrophes (L'Anse → "L'A")
  • Abbreviations: "St." (Saint), "La" (short names), numbers ("100 Mile House" → "100")
  • Spaces: City names with spaces create invalid 2-letter codes

Solution: Implement better city code normalization:

  • Remove accents (Québec → Quebec → QUE)
  • Expand abbreviations (St. → Saint → SAI)
  • Handle apostrophes (L'Anse → Lanse → LAN)
  • Filter out numbers/special characters

Files Created/Modified

New Files

  1. Scraper: scripts/scrapers/scrape_canadian_isil_fast.py - Fast list-page scraper
  2. Parser: src/glam_extractor/parsers/canadian_isil.py - LinkML converter
  3. Test: test_canadian_parser.py - Parser validation script
  4. Converter: convert_canadian_to_linkml.py - Bulk conversion script
  5. Data:
    • data/isil/canada/*.json - Raw scraped data
    • data/instances/canada/*.json|yaml - LinkML-formatted output
  6. Documentation: docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md - Session log

Modified Files

  1. Parser Init: src/glam_extractor/parsers/__init__.py - Added CanadianISILParser export

Next Steps (Future Work)

Task 1: Fix City Code Normalization (HIGH PRIORITY)

Improve _create_city_locode() method to handle:

  • Accented characters (é, è, ê, ô, etc.)
  • Apostrophes and hyphens
  • Abbreviations (St., Ste., Mt., etc.)
  • Short names (< 3 characters)
  • Special cases (numbers, symbols)

Impact: Will recover 329 failed records (3.4%)

Task 2: Enrich with Detail Pages (MEDIUM PRIORITY)

Extract additional fields from detail pages:

  • Contact information: Address, phone, email, website
  • Operational details: Hours, services, policies
  • Administrative info: Director, founded date, notes

Tool: Modify scripts/scrapers/scrape_canadian_isil.py (slow but comprehensive)
Time estimate: ~2.5 hours for 9,566 records

Task 3: Geocoding (LOW PRIORITY)

Add latitude/longitude coordinates:

  • Use Nominatim API with rate limiting (1 req/sec)
  • Cache results to avoid repeated lookups
  • Handle ambiguous city names

Script: scripts/enrich_geocoding.py (already exists)

Task 4: Integration with Global Dataset (LOW PRIORITY)

Merge Canadian data with main GLAM dataset:

  • Cross-link with conversation-extracted Canadian institutions
  • Deduplicate by ISIL code
  • Resolve conflicts (use TIER_1 Canadian data as authoritative)

Schema Compliance

All converted records conform to LinkML schema v0.2.0:

  • Core Classes: HeritageCustodian, Location, Identifier, Provenance
  • Enumerations: InstitutionTypeEnum, DataSourceEnum, DataTierEnum, OrganizationStatusEnum
  • GHCID: Proper UUID v5, UUID v8, and numeric identifier generation
  • Provenance: Complete extraction metadata with confidence scores

Example Record

- id: https://w3id.org/heritage/custodian/ca/aa
  name: Andrew Municipal Library
  institution_type: LIBRARY
  organization_status: ACTIVE
  ghcid_uuid: 2d0444bb-8934-571c-89d6-027bf0c87df4
  ghcid_current: CA-AB-AND-L-AML
  locations:
  - city: Andrew
    region: Alberta
    country: CA
  identifiers:
  - identifier_scheme: ISIL
    identifier_value: CA-AA
    identifier_url: https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000
  provenance:
    data_source: CSV_REGISTRY
    data_tier: TIER_1_AUTHORITATIVE
    extraction_date: '2025-11-18T20:23:09.715778'
    confidence_score: 0.98

Lessons Learned

  1. City code generation is fragile: Need robust normalization for international characters
  2. Enum validation is strict: Must use exact uppercase values (ACTIVE vs active)
  3. LinkML models use different serialization: Use _as_json_obj() not model_dump()
  4. Fast scraping is effective: List-page scraping is 100x faster than detail-page scraping
  5. Province diversity: Canadian data spans 13 provinces/territories with distinct patterns

References

  • LAC Directory: https://sigles-symbols.bac-lac.gc.ca/eng/Search
  • ISIL Standard: ISO 15511 (International Standard Identifier for Libraries)
  • LinkML Schema: schemas/heritage_custodian.yaml (v0.2.0)
  • GHCID Specification: docs/GHCID_PID_SCHEME.md

Session Status: COMPLETE
Next Session: Fix city code normalization and recover 329 failed records