glam/docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md
2025-11-19 23:25:22 +01:00

219 lines
7 KiB
Markdown

# Canadian ISIL Extraction - Session Complete
**Date**: November 18, 2025
**Status**: ✅ COMPLETE
**Duration**: ~3 hours
## Summary
Successfully extracted and converted **9,237 Canadian heritage institutions** from Library and Archives Canada's directory to LinkML format.
---
## Accomplishments
### 1. Data Extraction (COMPLETE ✅)
**Source**: Library and Archives Canada - Canadian Library Symbols Registry
**URL**: https://sigles-symbols.bac-lac.gc.ca/eng/Search
**Scraped Data**:
- **9,566 total library records** (6,520 active + 3,046 closed)
- **Extraction time**: ~4 minutes
- **Success rate**: 100%
- **Output files**:
- `data/isil/canada/canadian_libraries_active.json` (2.2 MB)
- `data/isil/canada/canadian_libraries_closed.json` (1.1 MB)
- `data/isil/canada/canadian_libraries_all.json` (3.3 MB)
### 2. LinkML Conversion (COMPLETE ✅)
**Parser Created**: `src/glam_extractor/parsers/canadian_isil.py`
**Conversion Results**:
- **9,237 institutions successfully converted** (96.6% success rate)
- **329 records failed** (3.4%) - due to city names with special characters
- **Output files**:
- `data/instances/canada/canadian_heritage_custodians.json` (13 MB)
- `data/instances/canada/canadian_heritage_custodians_sample.yaml` (116 KB)
---
## Data Statistics
### By Institution Type
| Type | Count | Percentage |
|------|-------|------------|
| **LIBRARY** | 4,490 | 48.6% |
| **EDUCATION_PROVIDER** | 2,011 | 21.8% |
| **OFFICIAL_INSTITUTION** | 1,200 | 13.0% |
| **RESEARCH_CENTER** | 1,096 | 11.9% |
| **ARCHIVE** | 235 | 2.5% |
| **MUSEUM** | 205 | 2.2% |
### By Province (Top 5)
| Province | Count |
|----------|-------|
| Ontario | 3,335 |
| Quebec | 1,812 |
| Alberta | 1,259 |
| British Columbia | 923 |
| Saskatchewan | 564 |
### Data Quality
- **Data Tier**: TIER_1_AUTHORITATIVE (government registry)
- **Confidence Score**: 0.98
- **GHCID Format**: `CA-[PROVINCE]-[CITY]-[TYPE]-[ABBREV]`
- **UUID Strategy**: UUID v5 (SHA-1) + UUID v8 (SHA-256) + numeric identifier
---
## Technical Implementation
### Parser Features
The `CanadianISILParser` class implements:
1. **Institution Type Inference**: Detects library, archive, museum, education, government, research types from name patterns
2. **City Name Normalization**: Handles ALL CAPS city names, converts to title case
3. **Province Code Mapping**: Maps full province names to 2-letter codes (AB, BC, ON, QC, etc.)
4. **GHCID Generation**: Creates deterministic identifiers with Canadian format
5. **Status Mapping**: Maps "active"/"closed" to ACTIVE/INACTIVE enum values
6. **ISIL Validation**: Validates Canadian ISIL format (CA-XXXX)
### Known Issues
**329 records failed conversion** due to invalid city LOCODE generation:
- **Special characters**: Cities with accents (Québec → "QUÉ"), apostrophes (L'Anse → "L'A")
- **Abbreviations**: "St." (Saint), "La" (short names), numbers ("100 Mile House" → "100")
- **Spaces**: City names with spaces create invalid 2-letter codes
**Solution**: Implement better city code normalization:
- Remove accents (Québec → Quebec → QUE)
- Expand abbreviations (St. → Saint → SAI)
- Handle apostrophes (L'Anse → Lanse → LAN)
- Filter out numbers/special characters
---
## Files Created/Modified
### New Files
1. **Scraper**: `scripts/scrapers/scrape_canadian_isil_fast.py` - Fast list-page scraper
2. **Parser**: `src/glam_extractor/parsers/canadian_isil.py` - LinkML converter
3. **Test**: `test_canadian_parser.py` - Parser validation script
4. **Converter**: `convert_canadian_to_linkml.py` - Bulk conversion script
5. **Data**:
- `data/isil/canada/*.json` - Raw scraped data
- `data/instances/canada/*.json|yaml` - LinkML-formatted output
6. **Documentation**: `docs/sessions/CANADIAN_ISIL_EXTRACTION_20251118.md` - Session log
### Modified Files
1. **Parser Init**: `src/glam_extractor/parsers/__init__.py` - Added CanadianISILParser export
---
## Next Steps (Future Work)
### Task 1: Fix City Code Normalization (HIGH PRIORITY)
Improve `_create_city_locode()` method to handle:
- Accented characters (é, è, ê, ô, etc.)
- Apostrophes and hyphens
- Abbreviations (St., Ste., Mt., etc.)
- Short names (< 3 characters)
- Special cases (numbers, symbols)
**Impact**: Will recover 329 failed records (3.4%)
### Task 2: Enrich with Detail Pages (MEDIUM PRIORITY)
Extract additional fields from detail pages:
- **Contact information**: Address, phone, email, website
- **Operational details**: Hours, services, policies
- **Administrative info**: Director, founded date, notes
**Tool**: Modify `scripts/scrapers/scrape_canadian_isil.py` (slow but comprehensive)
**Time estimate**: ~2.5 hours for 9,566 records
### Task 3: Geocoding (LOW PRIORITY)
Add latitude/longitude coordinates:
- Use Nominatim API with rate limiting (1 req/sec)
- Cache results to avoid repeated lookups
- Handle ambiguous city names
**Script**: `scripts/enrich_geocoding.py` (already exists)
### Task 4: Integration with Global Dataset (LOW PRIORITY)
Merge Canadian data with main GLAM dataset:
- Cross-link with conversation-extracted Canadian institutions
- Deduplicate by ISIL code
- Resolve conflicts (use TIER_1 Canadian data as authoritative)
---
## Schema Compliance
All converted records conform to LinkML schema v0.2.0:
- **Core Classes**: `HeritageCustodian`, `Location`, `Identifier`, `Provenance`
- **Enumerations**: `InstitutionTypeEnum`, `DataSourceEnum`, `DataTierEnum`, `OrganizationStatusEnum`
- **GHCID**: Proper UUID v5, UUID v8, and numeric identifier generation
- **Provenance**: Complete extraction metadata with confidence scores
### Example Record
```yaml
- id: https://w3id.org/heritage/custodian/ca/aa
name: Andrew Municipal Library
institution_type: LIBRARY
organization_status: ACTIVE
ghcid_uuid: 2d0444bb-8934-571c-89d6-027bf0c87df4
ghcid_current: CA-AB-AND-L-AML
locations:
- city: Andrew
region: Alberta
country: CA
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-AA
identifier_url: https://sigles-symbols.bac-lac.gc.ca/eng/Search/Details?Id=3000
provenance:
data_source: CSV_REGISTRY
data_tier: TIER_1_AUTHORITATIVE
extraction_date: '2025-11-18T20:23:09.715778'
confidence_score: 0.98
```
---
## Lessons Learned
1. **City code generation is fragile**: Need robust normalization for international characters
2. **Enum validation is strict**: Must use exact uppercase values (ACTIVE vs active)
3. **LinkML models use different serialization**: Use `_as_json_obj()` not `model_dump()`
4. **Fast scraping is effective**: List-page scraping is 100x faster than detail-page scraping
5. **Province diversity**: Canadian data spans 13 provinces/territories with distinct patterns
---
## References
- **LAC Directory**: https://sigles-symbols.bac-lac.gc.ca/eng/Search
- **ISIL Standard**: ISO 15511 (International Standard Identifier for Libraries)
- **LinkML Schema**: `schemas/heritage_custodian.yaml` (v0.2.0)
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
---
**Session Status**: COMPLETE
**Next Session**: Fix city code normalization and recover 329 failed records