# Canadian ISIL Dataset - Enrichment Guide **Current Status**: 9,566 records with basic metadata (100% complete) **Next Steps**: Optional enrichment tasks to add contact info, geocoding, and integration --- ## Current Dataset Quality ### What We Have βœ… - **9,566 institutions** with 100% success rate - **Basic Metadata**: - Institution name - ISIL code (CA-XXXX format) - City and province - Institution type (Library, Archive, Museum, etc.) - Organization status (Active/Inactive) - GHCID identifiers (UUID v5, UUID v8, numeric) - Detail page URLs - Data provenance (TIER_1_AUTHORITATIVE) ### What We're Missing πŸ”„ - **Contact Information**: Address, phone, email, website - **Geographic Coordinates**: Latitude/longitude for mapping - **Enriched Descriptions**: Operating hours, services, collection info - **Cross-references**: Links to Wikidata, VIAF, other identifiers --- ## Enrichment Task 1: Contact Details from Detail Pages ### Objective Extract additional metadata from LAC detail pages for all 9,566 institutions. ### What's Available Based on the LAC website structure, detail pages contain: - **Full address** (street, city, postal code) - **Phone number** - **Email address** - **Website URL** - **Operating hours** (for some institutions) - **Service descriptions** - **Director/Contact person** - **Notes** (historical info, mergers, relocations) ### Implementation **Tool Already Exists**: `scripts/scrapers/scrape_canadian_isil.py` **Method**: `fetch_library_details()` - Extracts detail page data **Usage**: ```bash cd /Users/kempersc/apps/glam python3 scripts/scrapers/scrape_canadian_isil.py --fetch-details ``` ### Time Estimate - **Rate**: ~1.2 seconds per detail page (Playwright navigation + parsing) - **Total**: 9,566 records Γ— 1.2 sec = **~3.2 hours** - **Best time to run**: Overnight or during off-hours ### Expected Output Structure ```yaml - id: https://w3id.org/heritage/custodian/ca/aa name: Andrew Municipal Library identifiers: - identifier_scheme: ISIL identifier_value: CA-AA - identifier_scheme: Website identifier_value: https://www.andrewlibrary.ca locations: - city: Andrew region: Alberta country: CA street_address: 4915 50th Street postal_code: T0B 0C0 contact_info: phone: "+1-780-365-3131" email: "andrew.library@example.ca" ``` ### Schema Mapping The LinkML `Location` class already supports: - `street_address` (string) - `postal_code` (string) For contact info, we may need to extend the schema or use a separate `ContactInfo` class. ### Command to Run ```bash # Run detail scraper with rate limiting (1 req/sec to be polite) cd /Users/kempersc/apps/glam python3 scripts/scrapers/scrape_canadian_isil.py \ --fetch-details \ --rate-limit 1.0 \ --output data/isil/canada/canadian_libraries_enriched.json ``` --- ## Enrichment Task 2: Geocoding ### Objective Add latitude/longitude coordinates to all 9,566 institutions for mapping and spatial analysis. ### Implementation Options #### Option A: GeoNames Lookup (Fast, Offline) **Tool**: `src/glam_extractor/geocoding/geonames_lookup.py` **Advantages**: - βœ… Fast (local SQLite database) - βœ… No API rate limits - βœ… Works offline - βœ… High accuracy for cities (population > 1,000) **Disadvantages**: - ❌ May lack coordinates for very small towns - ❌ Requires GeoNames database setup **Setup**: ```bash # Build GeoNames database (one-time, ~30 minutes) cd /Users/kempersc/apps/glam python3 scripts/build_geonames_db.py ``` **Usage**: ```python from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup geocoder = GeoNamesLookup() city_info = geocoder.lookup_city("Toronto", "CA", admin1_code="ON") print(f"Lat: {city_info.latitude}, Lon: {city_info.longitude}") ``` **Time Estimate**: ~5 minutes for 9,566 lookups (offline, instant) #### Option B: Nominatim API (Accurate, Slow) **Tool**: Nominatim (OpenStreetMap geocoding) **Advantages**: - βœ… Very accurate (street-level) - βœ… Handles ambiguous addresses - βœ… Free (with rate limits) **Disadvantages**: - ❌ Rate limit: 1 request/second - ❌ Total time: 9,566 requests = **~2.7 hours** - ❌ Requires internet connection **Usage**: ```python import requests import time def geocode_nominatim(city, province, country="CA"): """Geocode using Nominatim API with rate limiting.""" url = "https://nominatim.openstreetmap.org/search" params = { "city": city, "state": province, "country": country, "format": "json", "limit": 1 } headers = {"User-Agent": "GLAM-Extractor/1.0"} response = requests.get(url, params=params, headers=headers) time.sleep(1) # Rate limit: 1 req/sec if response.ok and response.json(): result = response.json()[0] return float(result["lat"]), float(result["lon"]) return None, None ``` #### Option C: Hybrid Approach (Best) 1. **Try GeoNames first** (fast, covers 95% of cases) 2. **Fall back to Nominatim** for misses (only ~500 lookups) 3. **Cache results** to avoid repeated API calls **Time Estimate**: ~15 minutes total ### Geocoding Script Create `scripts/geocode_canadian_institutions.py`: ```python #!/usr/bin/env python3 """ Geocode Canadian heritage institutions. Uses hybrid approach: 1. GeoNames lookup (fast, offline) for cities 2. Nominatim fallback for misses 3. Cache results to avoid repeated lookups """ import json import time from pathlib import Path from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup def geocode_canadian_institutions(input_file, output_file): """Add geocoding to Canadian institutions.""" # Load institutions with open(input_file) as f: institutions = json.load(f) # Initialize geocoder geocoder = GeoNamesLookup() geocoded = 0 misses = 0 for inst in institutions: if not inst.get('locations'): continue location = inst['locations'][0] city = location.get('city') province = location.get('region') if not city or not province: continue # Try GeoNames lookup city_info = geocoder.lookup_city(city, "CA", admin1_name=province) if city_info: location['latitude'] = city_info.latitude location['longitude'] = city_info.longitude location['geonames_id'] = str(city_info.geonames_id) geocoded += 1 else: misses += 1 print(f"Miss: {city}, {province}") # Save results with open(output_file, 'w') as f: json.dump(institutions, f, indent=2, ensure_ascii=False) print(f"\nβœ… Geocoded: {geocoded} / {len(institutions)}") print(f"❌ Misses: {misses}") if __name__ == "__main__": geocode_canadian_institutions( "data/instances/canada/canadian_heritage_custodians.json", "data/instances/canada/canadian_heritage_custodians_geocoded.json" ) ``` **Run**: ```bash cd /Users/kempersc/apps/glam python3 scripts/geocode_canadian_institutions.py ``` --- ## Enrichment Task 3: Integration with Global Dataset ### Objective Merge Canadian data with the global GLAM dataset and resolve any duplicates. ### Steps #### 1. Find Conversation-Extracted Canadian Institutions Search existing conversation files for Canadian institutions: ```bash cd /Users/kempersc/Documents/claude/glam grep -l "Canada\|Canadian" *.json | head -10 ``` #### 2. Cross-Reference by ISIL Code Canadian ISIL codes follow format: `CA-XXXX` ```python def find_duplicates(canadian_tier1, conversation_tier4): """Find duplicate institutions by ISIL code.""" # Build ISIL lookup for Canadian TIER_1 data tier1_by_isil = {} for inst in canadian_tier1: for identifier in inst.get('identifiers', []): if identifier['identifier_scheme'] == 'ISIL': tier1_by_isil[identifier['identifier_value']] = inst # Check conversation data for matches duplicates = [] for conv_inst in conversation_tier4: for identifier in conv_inst.get('identifiers', []): if identifier['identifier_scheme'] == 'ISIL': isil = identifier['identifier_value'] if isil in tier1_by_isil: duplicates.append({ 'isil': isil, 'tier1': tier1_by_isil[isil], 'tier4': conv_inst }) return duplicates ``` #### 3. Merge Strategy When duplicates found: - **Keep TIER_1 data** (Canadian ISIL registry is authoritative) - **Merge additional fields** from TIER_4 (descriptions, collection info) - **Update provenance** to show data consolidation - **Create GHCID history** if identifiers change #### 4. Export Unified Dataset Create combined dataset with: - All TIER_1 Canadian institutions (9,566) - Non-duplicate conversation institutions - Merged metadata where applicable --- ## Enrichment Task 4: Wikidata Linking ### Objective Link Canadian institutions to Wikidata entities for Linked Open Data integration. ### Implementation **Query Wikidata SPARQL endpoint** for Canadian heritage institutions: ```sparql SELECT ?item ?itemLabel ?isil ?viaf WHERE { ?item wdt:P31/wdt:P279* wd:Q33506 . # instance of museum (or subclass) ?item wdt:P17 wd:Q16 . # country: Canada OPTIONAL { ?item wdt:P791 ?isil } # ISIL code OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" } } ``` **Match by**: 1. ISIL code (if available in Wikidata) 2. Fuzzy name matching (institution name similarity > 85%) 3. Geographic proximity (same city) **Add Wikidata IDs** to `identifiers`: ```yaml identifiers: - identifier_scheme: ISIL identifier_value: CA-OONL - identifier_scheme: Wikidata identifier_value: Q16959027 # National Library of Canada identifier_url: https://www.wikidata.org/wiki/Q16959027 ``` --- ## Priority Recommendations Based on effort vs. value: ### High Priority (Do First) 1. **Geocoding with GeoNames** (15 minutes) - Enables mapping, high value 2. **Integration with global dataset** (30 minutes) - Consolidates data ### Medium Priority (Optional) 3. **Wikidata linking** (1 hour) - Adds LOD connectivity 4. **Contact details scraping** (3 hours) - Useful but time-intensive ### Low Priority (Future) 5. **Detailed descriptions** - Extract from detail pages 6. **Collection information** - May require separate API --- ## Scripts to Create ### 1. Geocoding Script **File**: `scripts/geocode_canadian_institutions.py` **Time**: 15 minutes to write, 15 minutes to run **Output**: `data/instances/canada/canadian_heritage_custodians_geocoded.json` ### 2. Integration Script **File**: `scripts/integrate_canadian_with_global.py` **Time**: 30 minutes to write, 5 minutes to run **Output**: `data/instances/global/unified_heritage_custodians.json` ### 3. Wikidata Linking Script **File**: `scripts/enrich_canadian_with_wikidata.py` **Time**: 1 hour to write, 30 minutes to run **Output**: `data/instances/canada/canadian_heritage_custodians_wikidata.json` --- ## Testing Strategy For each enrichment: 1. **Test with sample** (10-100 records first) 2. **Validate schema compliance** (LinkML validation) 3. **Check data quality** (manual review of samples) 4. **Run full batch** (all 9,566 records) 5. **Export and backup** (JSON + YAML formats) --- ## Resource Requirements ### Disk Space - Current dataset: 14 MB - With geocoding: ~16 MB (+2 MB for coordinates) - With contact details: ~25 MB (+11 MB for addresses/phones/emails) - With all enrichments: ~30 MB ### Time Investment - **Geocoding**: 15 minutes (GeoNames) or 3 hours (Nominatim) - **Contact scraping**: 3 hours (detail pages) - **Wikidata linking**: 1.5 hours (SPARQL + fuzzy matching) - **Integration**: 30 minutes (deduplication + merge) **Total (all tasks)**: 5-8 hours depending on approach --- ## Decision Matrix | Task | Value | Effort | Priority | When to Do | |------|-------|--------|----------|------------| | Geocoding | High | Low | πŸ”₯ High | Now | | Integration | High | Low | πŸ”₯ High | Now | | Wikidata | Medium | Medium | ⚠️ Medium | Next session | | Contact Details | Medium | High | ⏸️ Low | If needed | --- ## Next Session Checklist When continuing this work: - [ ] Check if GeoNames database is built (`data/geonames/geonames.db`) - [ ] Verify Playwright is installed for detail scraping - [ ] Review Canadian dataset for any updates/changes - [ ] Check if Wikidata has new Canadian institution entries - [ ] Consider API rate limits before starting batch jobs --- ## Contact For questions or issues with enrichment: - Review `AGENTS.md` for extraction guidelines - Check `docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md` for session history - See LinkML schema at `schemas/heritage_custodian.yaml` --- **Last Updated**: 2025-11-19 **Dataset Version**: 1.0 (9,566 records, 100% complete) **Status**: Ready for enrichment