glam/CANADIAN_ENRICHMENT_GUIDE.md
2025-11-19 23:25:22 +01:00

13 KiB
Raw Blame History

Canadian ISIL Dataset - Enrichment Guide

Current Status: 9,566 records with basic metadata (100% complete)
Next Steps: Optional enrichment tasks to add contact info, geocoding, and integration


Current Dataset Quality

What We Have

  • 9,566 institutions with 100% success rate
  • Basic Metadata:
    • Institution name
    • ISIL code (CA-XXXX format)
    • City and province
    • Institution type (Library, Archive, Museum, etc.)
    • Organization status (Active/Inactive)
    • GHCID identifiers (UUID v5, UUID v8, numeric)
    • Detail page URLs
    • Data provenance (TIER_1_AUTHORITATIVE)

What We're Missing 🔄

  • Contact Information: Address, phone, email, website
  • Geographic Coordinates: Latitude/longitude for mapping
  • Enriched Descriptions: Operating hours, services, collection info
  • Cross-references: Links to Wikidata, VIAF, other identifiers

Enrichment Task 1: Contact Details from Detail Pages

Objective

Extract additional metadata from LAC detail pages for all 9,566 institutions.

What's Available

Based on the LAC website structure, detail pages contain:

  • Full address (street, city, postal code)
  • Phone number
  • Email address
  • Website URL
  • Operating hours (for some institutions)
  • Service descriptions
  • Director/Contact person
  • Notes (historical info, mergers, relocations)

Implementation

Tool Already Exists: scripts/scrapers/scrape_canadian_isil.py

Method: fetch_library_details() - Extracts detail page data

Usage:

cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py --fetch-details

Time Estimate

  • Rate: ~1.2 seconds per detail page (Playwright navigation + parsing)
  • Total: 9,566 records × 1.2 sec = ~3.2 hours
  • Best time to run: Overnight or during off-hours

Expected Output Structure

- id: https://w3id.org/heritage/custodian/ca/aa
  name: Andrew Municipal Library
  identifiers:
  - identifier_scheme: ISIL
    identifier_value: CA-AA
  - identifier_scheme: Website
    identifier_value: https://www.andrewlibrary.ca
  locations:
  - city: Andrew
    region: Alberta
    country: CA
    street_address: 4915 50th Street
    postal_code: T0B 0C0
  contact_info:
    phone: "+1-780-365-3131"
    email: "andrew.library@example.ca"

Schema Mapping

The LinkML Location class already supports:

  • street_address (string)
  • postal_code (string)

For contact info, we may need to extend the schema or use a separate ContactInfo class.

Command to Run

# Run detail scraper with rate limiting (1 req/sec to be polite)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py \
  --fetch-details \
  --rate-limit 1.0 \
  --output data/isil/canada/canadian_libraries_enriched.json

Enrichment Task 2: Geocoding

Objective

Add latitude/longitude coordinates to all 9,566 institutions for mapping and spatial analysis.

Implementation Options

Option A: GeoNames Lookup (Fast, Offline)

Tool: src/glam_extractor/geocoding/geonames_lookup.py

Advantages:

  • Fast (local SQLite database)
  • No API rate limits
  • Works offline
  • High accuracy for cities (population > 1,000)

Disadvantages:

  • May lack coordinates for very small towns
  • Requires GeoNames database setup

Setup:

# Build GeoNames database (one-time, ~30 minutes)
cd /Users/kempersc/apps/glam
python3 scripts/build_geonames_db.py

Usage:

from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup

geocoder = GeoNamesLookup()
city_info = geocoder.lookup_city("Toronto", "CA", admin1_code="ON")
print(f"Lat: {city_info.latitude}, Lon: {city_info.longitude}")

Time Estimate: ~5 minutes for 9,566 lookups (offline, instant)

Option B: Nominatim API (Accurate, Slow)

Tool: Nominatim (OpenStreetMap geocoding)

Advantages:

  • Very accurate (street-level)
  • Handles ambiguous addresses
  • Free (with rate limits)

Disadvantages:

  • Rate limit: 1 request/second
  • Total time: 9,566 requests = ~2.7 hours
  • Requires internet connection

Usage:

import requests
import time

def geocode_nominatim(city, province, country="CA"):
    """Geocode using Nominatim API with rate limiting."""
    url = "https://nominatim.openstreetmap.org/search"
    params = {
        "city": city,
        "state": province,
        "country": country,
        "format": "json",
        "limit": 1
    }
    headers = {"User-Agent": "GLAM-Extractor/1.0"}
    
    response = requests.get(url, params=params, headers=headers)
    time.sleep(1)  # Rate limit: 1 req/sec
    
    if response.ok and response.json():
        result = response.json()[0]
        return float(result["lat"]), float(result["lon"])
    return None, None

Option C: Hybrid Approach (Best)

  1. Try GeoNames first (fast, covers 95% of cases)
  2. Fall back to Nominatim for misses (only ~500 lookups)
  3. Cache results to avoid repeated API calls

Time Estimate: ~15 minutes total

Geocoding Script

Create scripts/geocode_canadian_institutions.py:

#!/usr/bin/env python3
"""
Geocode Canadian heritage institutions.

Uses hybrid approach:
1. GeoNames lookup (fast, offline) for cities
2. Nominatim fallback for misses
3. Cache results to avoid repeated lookups
"""

import json
import time
from pathlib import Path
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup

def geocode_canadian_institutions(input_file, output_file):
    """Add geocoding to Canadian institutions."""
    
    # Load institutions
    with open(input_file) as f:
        institutions = json.load(f)
    
    # Initialize geocoder
    geocoder = GeoNamesLookup()
    
    geocoded = 0
    misses = 0
    
    for inst in institutions:
        if not inst.get('locations'):
            continue
        
        location = inst['locations'][0]
        city = location.get('city')
        province = location.get('region')
        
        if not city or not province:
            continue
        
        # Try GeoNames lookup
        city_info = geocoder.lookup_city(city, "CA", admin1_name=province)
        
        if city_info:
            location['latitude'] = city_info.latitude
            location['longitude'] = city_info.longitude
            location['geonames_id'] = str(city_info.geonames_id)
            geocoded += 1
        else:
            misses += 1
            print(f"Miss: {city}, {province}")
    
    # Save results
    with open(output_file, 'w') as f:
        json.dump(institutions, f, indent=2, ensure_ascii=False)
    
    print(f"\n✅ Geocoded: {geocoded} / {len(institutions)}")
    print(f"❌ Misses: {misses}")

if __name__ == "__main__":
    geocode_canadian_institutions(
        "data/instances/canada/canadian_heritage_custodians.json",
        "data/instances/canada/canadian_heritage_custodians_geocoded.json"
    )

Run:

cd /Users/kempersc/apps/glam
python3 scripts/geocode_canadian_institutions.py

Enrichment Task 3: Integration with Global Dataset

Objective

Merge Canadian data with the global GLAM dataset and resolve any duplicates.

Steps

1. Find Conversation-Extracted Canadian Institutions

Search existing conversation files for Canadian institutions:

cd /Users/kempersc/Documents/claude/glam
grep -l "Canada\|Canadian" *.json | head -10

2. Cross-Reference by ISIL Code

Canadian ISIL codes follow format: CA-XXXX

def find_duplicates(canadian_tier1, conversation_tier4):
    """Find duplicate institutions by ISIL code."""
    
    # Build ISIL lookup for Canadian TIER_1 data
    tier1_by_isil = {}
    for inst in canadian_tier1:
        for identifier in inst.get('identifiers', []):
            if identifier['identifier_scheme'] == 'ISIL':
                tier1_by_isil[identifier['identifier_value']] = inst
    
    # Check conversation data for matches
    duplicates = []
    for conv_inst in conversation_tier4:
        for identifier in conv_inst.get('identifiers', []):
            if identifier['identifier_scheme'] == 'ISIL':
                isil = identifier['identifier_value']
                if isil in tier1_by_isil:
                    duplicates.append({
                        'isil': isil,
                        'tier1': tier1_by_isil[isil],
                        'tier4': conv_inst
                    })
    
    return duplicates

3. Merge Strategy

When duplicates found:

  • Keep TIER_1 data (Canadian ISIL registry is authoritative)
  • Merge additional fields from TIER_4 (descriptions, collection info)
  • Update provenance to show data consolidation
  • Create GHCID history if identifiers change

4. Export Unified Dataset

Create combined dataset with:

  • All TIER_1 Canadian institutions (9,566)
  • Non-duplicate conversation institutions
  • Merged metadata where applicable

Enrichment Task 4: Wikidata Linking

Objective

Link Canadian institutions to Wikidata entities for Linked Open Data integration.

Implementation

Query Wikidata SPARQL endpoint for Canadian heritage institutions:

SELECT ?item ?itemLabel ?isil ?viaf WHERE {
  ?item wdt:P31/wdt:P279* wd:Q33506 .  # instance of museum (or subclass)
  ?item wdt:P17 wd:Q16 .                # country: Canada
  OPTIONAL { ?item wdt:P791 ?isil }     # ISIL code
  OPTIONAL { ?item wdt:P214 ?viaf }     # VIAF ID
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" }
}

Match by:

  1. ISIL code (if available in Wikidata)
  2. Fuzzy name matching (institution name similarity > 85%)
  3. Geographic proximity (same city)

Add Wikidata IDs to identifiers:

identifiers:
- identifier_scheme: ISIL
  identifier_value: CA-OONL
- identifier_scheme: Wikidata
  identifier_value: Q16959027  # National Library of Canada
  identifier_url: https://www.wikidata.org/wiki/Q16959027

Priority Recommendations

Based on effort vs. value:

High Priority (Do First)

  1. Geocoding with GeoNames (15 minutes) - Enables mapping, high value
  2. Integration with global dataset (30 minutes) - Consolidates data

Medium Priority (Optional)

  1. Wikidata linking (1 hour) - Adds LOD connectivity
  2. Contact details scraping (3 hours) - Useful but time-intensive

Low Priority (Future)

  1. Detailed descriptions - Extract from detail pages
  2. Collection information - May require separate API

Scripts to Create

1. Geocoding Script

File: scripts/geocode_canadian_institutions.py
Time: 15 minutes to write, 15 minutes to run
Output: data/instances/canada/canadian_heritage_custodians_geocoded.json

2. Integration Script

File: scripts/integrate_canadian_with_global.py
Time: 30 minutes to write, 5 minutes to run
Output: data/instances/global/unified_heritage_custodians.json

3. Wikidata Linking Script

File: scripts/enrich_canadian_with_wikidata.py
Time: 1 hour to write, 30 minutes to run
Output: data/instances/canada/canadian_heritage_custodians_wikidata.json


Testing Strategy

For each enrichment:

  1. Test with sample (10-100 records first)
  2. Validate schema compliance (LinkML validation)
  3. Check data quality (manual review of samples)
  4. Run full batch (all 9,566 records)
  5. Export and backup (JSON + YAML formats)

Resource Requirements

Disk Space

  • Current dataset: 14 MB
  • With geocoding: ~16 MB (+2 MB for coordinates)
  • With contact details: ~25 MB (+11 MB for addresses/phones/emails)
  • With all enrichments: ~30 MB

Time Investment

  • Geocoding: 15 minutes (GeoNames) or 3 hours (Nominatim)
  • Contact scraping: 3 hours (detail pages)
  • Wikidata linking: 1.5 hours (SPARQL + fuzzy matching)
  • Integration: 30 minutes (deduplication + merge)

Total (all tasks): 5-8 hours depending on approach


Decision Matrix

Task Value Effort Priority When to Do
Geocoding High Low 🔥 High Now
Integration High Low 🔥 High Now
Wikidata Medium Medium ⚠️ Medium Next session
Contact Details Medium High ⏸️ Low If needed

Next Session Checklist

When continuing this work:

  • Check if GeoNames database is built (data/geonames/geonames.db)
  • Verify Playwright is installed for detail scraping
  • Review Canadian dataset for any updates/changes
  • Check if Wikidata has new Canadian institution entries
  • Consider API rate limits before starting batch jobs

Contact

For questions or issues with enrichment:

  • Review AGENTS.md for extraction guidelines
  • Check docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md for session history
  • See LinkML schema at schemas/heritage_custodian.yaml

Last Updated: 2025-11-19
Dataset Version: 1.0 (9,566 records, 100% complete)
Status: Ready for enrichment