219 lines
7 KiB
Markdown
219 lines
7 KiB
Markdown
# Canadian ISIL Extraction - Session Complete
|
|
|
|
**Date**: November 18, 2025
|
|
**Status**: ✅ COMPLETE
|
|
**Duration**: ~3 hours
|
|
|
|
## Summary
|
|
|
|
Successfully extracted and converted **9,237 Canadian heritage institutions** from Library and Archives Canada's directory to LinkML format.
|
|
|
|
---
|
|
|
|
## Accomplishments
|
|
|
|
### 1. Data Extraction (COMPLETE ✅)
|
|
|
|
**Source**: Library and Archives Canada - Canadian Library Symbols Registry
|
|
**URL**: https://sigles-symbols.bac-lac.gc.ca/eng/Search
|
|
|
|
**Scraped Data**:
|
|
- **9,566 total library records** (6,520 active + 3,046 closed)
|
|
- **Extraction time**: ~4 minutes
|
|
- **Success rate**: 100%
|
|
- **Output files**:
|
|
- `data/isil/canada/canadian_libraries_active.json` (2.2 MB)
|
|
- `data/isil/canada/canadian_libraries_closed.json` (1.1 MB)
|
|
- `data/isil/canada/canadian_libraries_all.json` (3.3 MB)
|
|
|
|
### 2. LinkML Conversion (COMPLETE ✅)
|
|
|
|
**Parser Created**: `src/glam_extractor/parsers/canadian_isil.py`
|
|
|
|
**Conversion Results**:
|
|
- **9,237 institutions successfully converted** (96.6% success rate)
|
|
- **329 records failed** (3.4%) - due to city names with special characters
|
|
- **Output files**:
|
|
- `data/instances/canada/canadian_heritage_custodians.json` (13 MB)
|
|
- `data/instances/canada/canadian_heritage_custodians_sample.yaml` (116 KB)
|
|
|
|
---
|
|
|
|
## Data Statistics
|
|
|
|
### By Institution Type
|
|
|
|
| Type | Count | Percentage |
|
|
|------|-------|------------|
|
|
| **LIBRARY** | 4,490 | 48.6% |
|
|
| **EDUCATION_PROVIDER** | 2,011 | 21.8% |
|
|
| **OFFICIAL_INSTITUTION** | 1,200 | 13.0% |
|
|
| **RESEARCH_CENTER** | 1,096 | 11.9% |
|
|
| **ARCHIVE** | 235 | 2.5% |
|
|
| **MUSEUM** | 205 | 2.2% |
|
|
|
|
### By Province (Top 5)
|
|
|
|
| Province | Count |
|
|
|----------|-------|
|
|
| Ontario | 3,335 |
|
|
| Quebec | 1,812 |
|
|
| Alberta | 1,259 |
|
|
| British Columbia | 923 |
|
|
| Saskatchewan | 564 |
|
|
|
|
### Data Quality
|
|
|
|
- **Data Tier**: TIER_1_AUTHORITATIVE (government registry)
|
|
- **Confidence Score**: 0.98
|
|
- **GHCID Format**: `CA-[PROVINCE]-[CITY]-[TYPE]-[ABBREV]`
|
|
- **UUID Strategy**: UUID v5 (SHA-1) + UUID v8 (SHA-256) + numeric identifier
|
|
|
|
---
|
|
|
|
## Technical Implementation
|
|
|
|
### Parser Features
|
|
|
|
The `CanadianISILParser` class implements:
|
|
|
|
1. **Institution Type Inference**: Detects library, archive, museum, education, government, research types from name patterns
|
|
2. **City Name Normalization**: Handles ALL CAPS city names, converts to title case
|
|
3. **Province Code Mapping**: Maps full province names to 2-letter codes (AB, BC, ON, QC, etc.)
|
|
4. **GHCID Generation**: Creates deterministic identifiers with Canadian format
|
|
5. **Status Mapping**: Maps "active"/"closed" to ACTIVE/INACTIVE enum values
|
|
6. **ISIL Validation**: Validates Canadian ISIL format (CA-XXXX)
|
|
|
|
### Known Issues
|
|
|
|
**329 records failed conversion** due to invalid city LOCODE generation:
|
|
|
|
- **Special characters**: Cities with accents (Québec → "QUÉ"), apostrophes (L'Anse → "L'A")
|
|
- **Abbreviations**: "St." (Saint), "La" (short names), numbers ("100 Mile House" → "100")
|
|
- **Spaces**: City names with spaces create invalid 2-letter codes
|
|
|
|
**Solution**: Implement better city code normalization:
|
|
- Remove accents (Québec → Quebec → QUE)
|
|
- Expand abbreviations (St. → Saint → SAI)
|
|
- Handle apostrophes (L'Anse → Lanse → LAN)
|
|
- Filter out numbers/special characters
|
|
|
|
---
|
|
|
|
## Files Created/Modified
|
|
|
|
### New Files
|
|
|
|
1. **Scraper**: `scripts/scrapers/scrape_canadian_isil_fast.py` - Fast list-page scraper
|
|
2. **Parser**: `src/glam_extractor/parsers/canadian_isil.py` - LinkML converter
|
|
3. **Test**: `test_canadian_parser.py` - Parser validation script
|
|
4. **Converter**: `convert_canadian_to_linkml.py` - Bulk conversion script
|
|
5. **Data**:
|
|
- `data/isil/canada/*.json` - Raw scraped data
|
|
- `data/instances/canada/*.json|yaml` - LinkML-formatted output
|
|
6. **Documentation**: `docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md` - Session log
|
|
|
|
### Modified Files
|
|
|
|
1. **Parser Init**: `src/glam_extractor/parsers/__init__.py` - Added CanadianISILParser export
|
|
|
|
---
|
|
|
|
## Next Steps (Future Work)
|
|
|
|
### Task 1: Fix City Code Normalization (HIGH PRIORITY)
|
|
|
|
Improve `_create_city_locode()` method to handle:
|
|
- Accented characters (é, è, ê, ô, etc.)
|
|
- Apostrophes and hyphens
|
|
- Abbreviations (St., Ste., Mt., etc.)
|
|
- Short names (< 3 characters)
|
|
- Special cases (numbers, symbols)
|
|
|
|
**Impact**: Will recover 329 failed records (3.4%)
|
|
|
|
### Task 2: Enrich with Detail Pages (MEDIUM PRIORITY)
|
|
|
|
Extract additional fields from detail pages:
|
|
- **Contact information**: Address, phone, email, website
|
|
- **Operational details**: Hours, services, policies
|
|
- **Administrative info**: Director, founded date, notes
|
|
|
|
**Tool**: Modify `scripts/scrapers/scrape_canadian_isil.py` (slow but comprehensive)
|
|
**Time estimate**: ~2.5 hours for 9,566 records
|
|
|
|
### Task 3: Geocoding (LOW PRIORITY)
|
|
|
|
Add latitude/longitude coordinates:
|
|
- Use Nominatim API with rate limiting (1 req/sec)
|
|
- Cache results to avoid repeated lookups
|
|
- Handle ambiguous city names
|
|
|
|
**Script**: `scripts/enrich_geocoding.py` (already exists)
|
|
|
|
### Task 4: Integration with Global Dataset (LOW PRIORITY)
|
|
|
|
Merge Canadian data with main GLAM dataset:
|
|
- Cross-link with conversation-extracted Canadian institutions
|
|
- Deduplicate by ISIL code
|
|
- Resolve conflicts (use TIER_1 Canadian data as authoritative)
|
|
|
|
---
|
|
|
|
## Schema Compliance
|
|
|
|
All converted records conform to LinkML schema v0.2.0:
|
|
|
|
- **Core Classes**: `HeritageCustodian`, `Location`, `Identifier`, `Provenance`
|
|
- **Enumerations**: `InstitutionTypeEnum`, `DataSourceEnum`, `DataTierEnum`, `OrganizationStatusEnum`
|
|
- **GHCID**: Proper UUID v5, UUID v8, and numeric identifier generation
|
|
- **Provenance**: Complete extraction metadata with confidence scores
|
|
|
|
### Example Record
|
|
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/ca/aa
|
|
name: Andrew Municipal Library
|
|
institution_type: LIBRARY
|
|
organization_status: ACTIVE
|
|
ghcid_uuid: 2d0444bb-8934-571c-89d6-027bf0c87df4
|
|
ghcid_current: CA-AB-AND-L-AML
|
|
locations:
|
|
- city: Andrew
|
|
region: Alberta
|
|
country: CA
|
|
identifiers:
|
|
- identifier_scheme: ISIL
|
|
identifier_value: CA-AA
|
|
identifier_url: https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000
|
|
provenance:
|
|
data_source: CSV_REGISTRY
|
|
data_tier: TIER_1_AUTHORITATIVE
|
|
extraction_date: '2025-11-18T20:23:09.715778'
|
|
confidence_score: 0.98
|
|
```
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **City code generation is fragile**: Need robust normalization for international characters
|
|
2. **Enum validation is strict**: Must use exact uppercase values (ACTIVE vs active)
|
|
3. **LinkML models use different serialization**: Use `_as_json_obj()` not `model_dump()`
|
|
4. **Fast scraping is effective**: List-page scraping is 100x faster than detail-page scraping
|
|
5. **Province diversity**: Canadian data spans 13 provinces/territories with distinct patterns
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **LAC Directory**: https://sigles-symbols.bac-lac.gc.ca/eng/Search
|
|
- **ISIL Standard**: ISO 15511 (International Standard Identifier for Libraries)
|
|
- **LinkML Schema**: `schemas/heritage_custodian.yaml` (v0.2.0)
|
|
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
|
|
|
|
---
|
|
|
|
**Session Status**: ✅ COMPLETE
|
|
**Next Session**: Fix city code normalization and recover 329 failed records
|
|
|