glam/docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md
2025-11-19 23:25:22 +01:00

7.3 KiB
Raw Permalink Blame History

Canadian ISIL Database Extraction - Session Summary

Date: November 18, 2025
Source: Library and Archives Canada - Canadian Library Directory
URL: https://sigles-symbols.bac-lac.gc.ca/eng/Search

Summary

Successfully extracted the complete Canadian ISIL database containing all heritage institutions with Canadian Library Symbols.

Statistics

  • Total records extracted: 9,566
    • Active libraries: 6,520
    • Closed/Superseded: 3,046

Extraction Method

Created a Python scraper using Playwright that:

  1. Navigates through paginated search results (100 records per page)
  2. Extracts basic library information from list pages
  3. Generates ISIL codes in Canadian format (CA-[SYMBOL])
  4. Saves data in three JSON files

Output Files

All files saved to: /Users/kempersc/apps/glam/data/isil/canada/

  1. canadian_libraries_active.json (2.2 MB)

    • 6,520 active libraries
    • Current, operational institutions
  2. canadian_libraries_closed.json (1.1 MB)

    • 3,046 closed/superseded libraries
    • Historical records, merged institutions, closed facilities
  3. canadian_libraries_all.json (3.3 MB)

    • Combined dataset
    • All 9,566 records

Data Structure

Each record contains:

{
  "isil_code": "CA-AA",
  "library_symbol": "AA",
  "name": "Andrew Municipal Library",
  "city": "Andrew",
  "province": "Alberta",
  "country": "CA",
  "library_id": "3000",
  "detail_url": "https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000",
  "status": "active"
}

Fields Extracted

  • isil_code: Canadian ISIL code (format: CA-[symbol])
  • library_symbol: Official Canadian Library Symbol (unique identifier)
  • name: Institution name
  • city: City location
  • province: Province/territory
  • country: Country code (CA)
  • library_id: Internal database ID
  • detail_url: Link to full detail page
  • status: "active" or "closed"

Coverage by Province

The database covers all Canadian provinces and territories:

  • Alberta (AB)
  • British Columbia (BC)
  • Manitoba (MB)
  • New Brunswick (NB)
  • Newfoundland and Labrador (NL)
  • Northwest Territories (NT)
  • Nova Scotia (NS)
  • Nunavut (NU)
  • Ontario (ON)
  • Prince Edward Island (PE)
  • Quebec (QC)
  • Saskatchewan (SK)
  • Yukon (YT)

Institution Types Included

Based on observation of the data, the database includes:

  • Public libraries (municipal, regional)
  • School libraries (elementary, secondary)
  • Academic libraries (colleges, universities)
  • Special libraries (government, corporate, research)
  • Archives
  • Museum libraries
  • Religious institution libraries

Performance

  • Extraction time: ~4 minutes (total)
    • Active libraries (66 pages): ~2.5 minutes
    • Closed libraries (31 pages): ~1.5 minutes
  • Request rate: ~0.5 seconds delay between page requests (polite scraping)
  • Success rate: 100% (all 97 pages successfully scraped)

Scripts Created

  1. scrape_canadian_isil_fast.py

    • Fast scraper that extracts list-level data only
    • Located at: /Users/kempersc/apps/glam/scripts/scrapers/scrape_canadian_isil_fast.py
    • Usage: python3 scrape_canadian_isil_fast.py (full dataset)
    • Usage: python3 scrape_canadian_isil_fast.py --test (first 2 pages only)
  2. scrape_canadian_isil.py (slower, detailed version)

    • Includes fetching individual detail pages for each library
    • Located at: /Users/kempersc/apps/glam/scripts/scrapers/scrape_canadian_isil.py
    • Not used for full dataset due to time constraints (~1.2 sec per detail page)
    • Could be used later to enrich data with additional fields

Additional Data Available (Not Yet Extracted)

The detail pages contain much more information that could be extracted in a future pass:

  • Full address (street, postal code)
  • Telephone number
  • Fax number
  • Email address(es)
  • Library type classification
  • OCLC symbol
  • Lending policies (monographs, serials)
  • Photocopy policies
  • ILL (Interlibrary Loan) policies
  • Request methods (email, fax, web form)
  • Library system membership
  • Website URL
  • Notes/comments

Next Steps (Optional)

  1. Detail enrichment: Run the detailed scraper to fetch all additional fields from detail pages

    • Estimated time: ~2.5 hours (9,566 records × 1 second per record)
    • Would add contact info, policies, and other metadata
  2. Convert to LinkML format: Transform JSON data into the project's LinkML schema

    • Map to HeritageCustodian class
    • Assign institution types (LIBRARY, ARCHIVE, MUSEUM, etc.)
    • Add provenance metadata
    • Generate GHCIDs
  3. Geocoding: Add latitude/longitude coordinates for each institution

    • Use Nominatim API or Google Maps API
    • Would enable geographic visualization
  4. Cross-reference: Link with other datasets

    • Wikidata IDs
    • Library websites
    • Other Canadian heritage databases
  5. Export: Convert to other formats

    • CSV for spreadsheet analysis
    • RDF/Turtle for semantic web
    • Parquet for data warehousing

Data Quality Notes

Strengths:

  • Complete dataset (all 9,566 records)
  • Authoritative source (Library and Archives Canada)
  • Well-structured data
  • Includes historical records (closed/superseded)
  • ISIL codes available for all institutions
  • Covers all Canadian provinces/territories

Limitations:

  • ⚠️ Basic data only (name, city, province, symbol)
  • ⚠️ No contact information in current extract
  • ⚠️ No geographic coordinates
  • ⚠️ No library type classification
  • ⚠️ No cross-references to Wikidata or other databases
  • ⚠️ City names not fully standardized (some all-caps, some mixed case)

Data Tier Classification

According to the project's data quality framework:

  • Data Source: CSV_REGISTRY (authoritative government source)
  • Data Tier: TIER_1_AUTHORITATIVE
  • Confidence Score: 0.95-1.0 (government registry data)

Canadian ISIL Format

Canadian ISIL codes follow the pattern: CA-[SYMBOL]

Examples:

  • CA-AA - Andrew Municipal Library
  • CA-OONL - National Library of Canada
  • CA-OTU - University of Toronto
  • CA-QMM - McGill University

The symbol portion is assigned by Library and Archives Canada and is unique within the Canadian system.

Integration with Global GLAM Project

This Canadian dataset should be integrated into the global heritage custodian database alongside:

  • Dutch ISIL registry (364 institutions)
  • Dutch organizations CSV (1,351 institutions)
  • Conversation-extracted institutions from 139 JSON files

Total heritage custodians after integration: 11,281+ institutions worldwide

Commands Used

# Make scraper executable
chmod +x /Users/kempersc/apps/glam/scripts/scrapers/scrape_canadian_isil_fast.py

# Run full extraction
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil_fast.py

# Test mode (first 2 pages only)
python3 scripts/scrapers/scrape_canadian_isil_fast.py --test

# Check output
ls -lh data/isil/canada/*.json
jq '.record_count // .total_records' data/isil/canada/*.json

Repository Location

  • Data files: /Users/kempersc/apps/glam/data/isil/canada/
  • Scripts: /Users/kempersc/apps/glam/scripts/scrapers/
  • Project root: /Users/kempersc/apps/glam/

Status: COMPLETE
Quality: HIGH (authoritative source, complete dataset)
Extraction Date: 2025-11-18T20:23:09
Last Updated: 2024-11-05 (source database)