13 KiB
Canadian ISIL Dataset - Enrichment Guide
Current Status: 9,566 records with basic metadata (100% complete)
Next Steps: Optional enrichment tasks to add contact info, geocoding, and integration
Current Dataset Quality
What We Have ✅
- 9,566 institutions with 100% success rate
- Basic Metadata:
- Institution name
- ISIL code (CA-XXXX format)
- City and province
- Institution type (Library, Archive, Museum, etc.)
- Organization status (Active/Inactive)
- GHCID identifiers (UUID v5, UUID v8, numeric)
- Detail page URLs
- Data provenance (TIER_1_AUTHORITATIVE)
What We're Missing 🔄
- Contact Information: Address, phone, email, website
- Geographic Coordinates: Latitude/longitude for mapping
- Enriched Descriptions: Operating hours, services, collection info
- Cross-references: Links to Wikidata, VIAF, other identifiers
Enrichment Task 1: Contact Details from Detail Pages
Objective
Extract additional metadata from LAC detail pages for all 9,566 institutions.
What's Available
Based on the LAC website structure, detail pages contain:
- Full address (street, city, postal code)
- Phone number
- Email address
- Website URL
- Operating hours (for some institutions)
- Service descriptions
- Director/Contact person
- Notes (historical info, mergers, relocations)
Implementation
Tool Already Exists: scripts/scrapers/scrape_canadian_isil.py
Method: fetch_library_details() - Extracts detail page data
Usage:
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py --fetch-details
Time Estimate
- Rate: ~1.2 seconds per detail page (Playwright navigation + parsing)
- Total: 9,566 records × 1.2 sec = ~3.2 hours
- Best time to run: Overnight or during off-hours
Expected Output Structure
- id: https://w3id.org/heritage/custodian/ca/aa
name: Andrew Municipal Library
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-AA
- identifier_scheme: Website
identifier_value: https://www.andrewlibrary.ca
locations:
- city: Andrew
region: Alberta
country: CA
street_address: 4915 50th Street
postal_code: T0B 0C0
contact_info:
phone: "+1-780-365-3131"
email: "andrew.library@example.ca"
Schema Mapping
The LinkML Location class already supports:
street_address(string)postal_code(string)
For contact info, we may need to extend the schema or use a separate ContactInfo class.
Command to Run
# Run detail scraper with rate limiting (1 req/sec to be polite)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py \
--fetch-details \
--rate-limit 1.0 \
--output data/isil/canada/canadian_libraries_enriched.json
Enrichment Task 2: Geocoding
Objective
Add latitude/longitude coordinates to all 9,566 institutions for mapping and spatial analysis.
Implementation Options
Option A: GeoNames Lookup (Fast, Offline)
Tool: src/glam_extractor/geocoding/geonames_lookup.py
Advantages:
- ✅ Fast (local SQLite database)
- ✅ No API rate limits
- ✅ Works offline
- ✅ High accuracy for cities (population > 1,000)
Disadvantages:
- ❌ May lack coordinates for very small towns
- ❌ Requires GeoNames database setup
Setup:
# Build GeoNames database (one-time, ~30 minutes)
cd /Users/kempersc/apps/glam
python3 scripts/build_geonames_db.py
Usage:
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
geocoder = GeoNamesLookup()
city_info = geocoder.lookup_city("Toronto", "CA", admin1_code="ON")
print(f"Lat: {city_info.latitude}, Lon: {city_info.longitude}")
Time Estimate: ~5 minutes for 9,566 lookups (offline, instant)
Option B: Nominatim API (Accurate, Slow)
Tool: Nominatim (OpenStreetMap geocoding)
Advantages:
- ✅ Very accurate (street-level)
- ✅ Handles ambiguous addresses
- ✅ Free (with rate limits)
Disadvantages:
- ❌ Rate limit: 1 request/second
- ❌ Total time: 9,566 requests = ~2.7 hours
- ❌ Requires internet connection
Usage:
import requests
import time
def geocode_nominatim(city, province, country="CA"):
"""Geocode using Nominatim API with rate limiting."""
url = "https://nominatim.openstreetmap.org/search"
params = {
"city": city,
"state": province,
"country": country,
"format": "json",
"limit": 1
}
headers = {"User-Agent": "GLAM-Extractor/1.0"}
response = requests.get(url, params=params, headers=headers)
time.sleep(1) # Rate limit: 1 req/sec
if response.ok and response.json():
result = response.json()[0]
return float(result["lat"]), float(result["lon"])
return None, None
Option C: Hybrid Approach (Best)
- Try GeoNames first (fast, covers 95% of cases)
- Fall back to Nominatim for misses (only ~500 lookups)
- Cache results to avoid repeated API calls
Time Estimate: ~15 minutes total
Geocoding Script
Create scripts/geocode_canadian_institutions.py:
#!/usr/bin/env python3
"""
Geocode Canadian heritage institutions.
Uses hybrid approach:
1. GeoNames lookup (fast, offline) for cities
2. Nominatim fallback for misses
3. Cache results to avoid repeated lookups
"""
import json
import time
from pathlib import Path
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
def geocode_canadian_institutions(input_file, output_file):
"""Add geocoding to Canadian institutions."""
# Load institutions
with open(input_file) as f:
institutions = json.load(f)
# Initialize geocoder
geocoder = GeoNamesLookup()
geocoded = 0
misses = 0
for inst in institutions:
if not inst.get('locations'):
continue
location = inst['locations'][0]
city = location.get('city')
province = location.get('region')
if not city or not province:
continue
# Try GeoNames lookup
city_info = geocoder.lookup_city(city, "CA", admin1_name=province)
if city_info:
location['latitude'] = city_info.latitude
location['longitude'] = city_info.longitude
location['geonames_id'] = str(city_info.geonames_id)
geocoded += 1
else:
misses += 1
print(f"Miss: {city}, {province}")
# Save results
with open(output_file, 'w') as f:
json.dump(institutions, f, indent=2, ensure_ascii=False)
print(f"\n✅ Geocoded: {geocoded} / {len(institutions)}")
print(f"❌ Misses: {misses}")
if __name__ == "__main__":
geocode_canadian_institutions(
"data/instances/canada/canadian_heritage_custodians.json",
"data/instances/canada/canadian_heritage_custodians_geocoded.json"
)
Run:
cd /Users/kempersc/apps/glam
python3 scripts/geocode_canadian_institutions.py
Enrichment Task 3: Integration with Global Dataset
Objective
Merge Canadian data with the global GLAM dataset and resolve any duplicates.
Steps
1. Find Conversation-Extracted Canadian Institutions
Search existing conversation files for Canadian institutions:
cd /Users/kempersc/Documents/claude/glam
grep -l "Canada\|Canadian" *.json | head -10
2. Cross-Reference by ISIL Code
Canadian ISIL codes follow format: CA-XXXX
def find_duplicates(canadian_tier1, conversation_tier4):
"""Find duplicate institutions by ISIL code."""
# Build ISIL lookup for Canadian TIER_1 data
tier1_by_isil = {}
for inst in canadian_tier1:
for identifier in inst.get('identifiers', []):
if identifier['identifier_scheme'] == 'ISIL':
tier1_by_isil[identifier['identifier_value']] = inst
# Check conversation data for matches
duplicates = []
for conv_inst in conversation_tier4:
for identifier in conv_inst.get('identifiers', []):
if identifier['identifier_scheme'] == 'ISIL':
isil = identifier['identifier_value']
if isil in tier1_by_isil:
duplicates.append({
'isil': isil,
'tier1': tier1_by_isil[isil],
'tier4': conv_inst
})
return duplicates
3. Merge Strategy
When duplicates found:
- Keep TIER_1 data (Canadian ISIL registry is authoritative)
- Merge additional fields from TIER_4 (descriptions, collection info)
- Update provenance to show data consolidation
- Create GHCID history if identifiers change
4. Export Unified Dataset
Create combined dataset with:
- All TIER_1 Canadian institutions (9,566)
- Non-duplicate conversation institutions
- Merged metadata where applicable
Enrichment Task 4: Wikidata Linking
Objective
Link Canadian institutions to Wikidata entities for Linked Open Data integration.
Implementation
Query Wikidata SPARQL endpoint for Canadian heritage institutions:
SELECT ?item ?itemLabel ?isil ?viaf WHERE {
?item wdt:P31/wdt:P279* wd:Q33506 . # instance of museum (or subclass)
?item wdt:P17 wd:Q16 . # country: Canada
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" }
}
Match by:
- ISIL code (if available in Wikidata)
- Fuzzy name matching (institution name similarity > 85%)
- Geographic proximity (same city)
Add Wikidata IDs to identifiers:
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-OONL
- identifier_scheme: Wikidata
identifier_value: Q16959027 # National Library of Canada
identifier_url: https://www.wikidata.org/wiki/Q16959027
Priority Recommendations
Based on effort vs. value:
High Priority (Do First)
- Geocoding with GeoNames (15 minutes) - Enables mapping, high value
- Integration with global dataset (30 minutes) - Consolidates data
Medium Priority (Optional)
- Wikidata linking (1 hour) - Adds LOD connectivity
- Contact details scraping (3 hours) - Useful but time-intensive
Low Priority (Future)
- Detailed descriptions - Extract from detail pages
- Collection information - May require separate API
Scripts to Create
1. Geocoding Script
File: scripts/geocode_canadian_institutions.py
Time: 15 minutes to write, 15 minutes to run
Output: data/instances/canada/canadian_heritage_custodians_geocoded.json
2. Integration Script
File: scripts/integrate_canadian_with_global.py
Time: 30 minutes to write, 5 minutes to run
Output: data/instances/global/unified_heritage_custodians.json
3. Wikidata Linking Script
File: scripts/enrich_canadian_with_wikidata.py
Time: 1 hour to write, 30 minutes to run
Output: data/instances/canada/canadian_heritage_custodians_wikidata.json
Testing Strategy
For each enrichment:
- Test with sample (10-100 records first)
- Validate schema compliance (LinkML validation)
- Check data quality (manual review of samples)
- Run full batch (all 9,566 records)
- Export and backup (JSON + YAML formats)
Resource Requirements
Disk Space
- Current dataset: 14 MB
- With geocoding: ~16 MB (+2 MB for coordinates)
- With contact details: ~25 MB (+11 MB for addresses/phones/emails)
- With all enrichments: ~30 MB
Time Investment
- Geocoding: 15 minutes (GeoNames) or 3 hours (Nominatim)
- Contact scraping: 3 hours (detail pages)
- Wikidata linking: 1.5 hours (SPARQL + fuzzy matching)
- Integration: 30 minutes (deduplication + merge)
Total (all tasks): 5-8 hours depending on approach
Decision Matrix
| Task | Value | Effort | Priority | When to Do |
|---|---|---|---|---|
| Geocoding | High | Low | 🔥 High | Now |
| Integration | High | Low | 🔥 High | Now |
| Wikidata | Medium | Medium | ⚠️ Medium | Next session |
| Contact Details | Medium | High | ⏸️ Low | If needed |
Next Session Checklist
When continuing this work:
- Check if GeoNames database is built (
data/geonames/geonames.db) - Verify Playwright is installed for detail scraping
- Review Canadian dataset for any updates/changes
- Check if Wikidata has new Canadian institution entries
- Consider API rate limits before starting batch jobs
Contact
For questions or issues with enrichment:
- Review
AGENTS.mdfor extraction guidelines - Check
docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.mdfor session history - See LinkML schema at
schemas/heritage_custodian.yaml
Last Updated: 2025-11-19
Dataset Version: 1.0 (9,566 records, 100% complete)
Status: Ready for enrichment