glam/CANADIAN_ENRICHMENT_GUIDE.md
2025-11-19 23:25:22 +01:00

471 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Canadian ISIL Dataset - Enrichment Guide
**Current Status**: 9,566 records with basic metadata (100% complete)
**Next Steps**: Optional enrichment tasks to add contact info, geocoding, and integration
---
## Current Dataset Quality
### What We Have ✅
- **9,566 institutions** with 100% success rate
- **Basic Metadata**:
- Institution name
- ISIL code (CA-XXXX format)
- City and province
- Institution type (Library, Archive, Museum, etc.)
- Organization status (Active/Inactive)
- GHCID identifiers (UUID v5, UUID v8, numeric)
- Detail page URLs
- Data provenance (TIER_1_AUTHORITATIVE)
### What We're Missing 🔄
- **Contact Information**: Address, phone, email, website
- **Geographic Coordinates**: Latitude/longitude for mapping
- **Enriched Descriptions**: Operating hours, services, collection info
- **Cross-references**: Links to Wikidata, VIAF, other identifiers
---
## Enrichment Task 1: Contact Details from Detail Pages
### Objective
Extract additional metadata from LAC detail pages for all 9,566 institutions.
### What's Available
Based on the LAC website structure, detail pages contain:
- **Full address** (street, city, postal code)
- **Phone number**
- **Email address**
- **Website URL**
- **Operating hours** (for some institutions)
- **Service descriptions**
- **Director/Contact person**
- **Notes** (historical info, mergers, relocations)
### Implementation
**Tool Already Exists**: `scripts/scrapers/scrape_canadian_isil.py`
**Method**: `fetch_library_details()` - Extracts detail page data
**Usage**:
```bash
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py --fetch-details
```
### Time Estimate
- **Rate**: ~1.2 seconds per detail page (Playwright navigation + parsing)
- **Total**: 9,566 records × 1.2 sec = **~3.2 hours**
- **Best time to run**: Overnight or during off-hours
### Expected Output Structure
```yaml
- id: https://w3id.org/heritage/custodian/ca/aa
name: Andrew Municipal Library
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-AA
- identifier_scheme: Website
identifier_value: https://www.andrewlibrary.ca
locations:
- city: Andrew
region: Alberta
country: CA
street_address: 4915 50th Street
postal_code: T0B 0C0
contact_info:
phone: "+1-780-365-3131"
email: "andrew.library@example.ca"
```
### Schema Mapping
The LinkML `Location` class already supports:
- `street_address` (string)
- `postal_code` (string)
For contact info, we may need to extend the schema or use a separate `ContactInfo` class.
### Command to Run
```bash
# Run detail scraper with rate limiting (1 req/sec to be polite)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_canadian_isil.py \
--fetch-details \
--rate-limit 1.0 \
--output data/isil/canada/canadian_libraries_enriched.json
```
---
## Enrichment Task 2: Geocoding
### Objective
Add latitude/longitude coordinates to all 9,566 institutions for mapping and spatial analysis.
### Implementation Options
#### Option A: GeoNames Lookup (Fast, Offline)
**Tool**: `src/glam_extractor/geocoding/geonames_lookup.py`
**Advantages**:
- ✅ Fast (local SQLite database)
- ✅ No API rate limits
- ✅ Works offline
- ✅ High accuracy for cities (population > 1,000)
**Disadvantages**:
- ❌ May lack coordinates for very small towns
- ❌ Requires GeoNames database setup
**Setup**:
```bash
# Build GeoNames database (one-time, ~30 minutes)
cd /Users/kempersc/apps/glam
python3 scripts/build_geonames_db.py
```
**Usage**:
```python
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
geocoder = GeoNamesLookup()
city_info = geocoder.lookup_city("Toronto", "CA", admin1_code="ON")
print(f"Lat: {city_info.latitude}, Lon: {city_info.longitude}")
```
**Time Estimate**: ~5 minutes for 9,566 lookups (offline, instant)
#### Option B: Nominatim API (Accurate, Slow)
**Tool**: Nominatim (OpenStreetMap geocoding)
**Advantages**:
- ✅ Very accurate (street-level)
- ✅ Handles ambiguous addresses
- ✅ Free (with rate limits)
**Disadvantages**:
- ❌ Rate limit: 1 request/second
- ❌ Total time: 9,566 requests = **~2.7 hours**
- ❌ Requires internet connection
**Usage**:
```python
import requests
import time
def geocode_nominatim(city, province, country="CA"):
"""Geocode using Nominatim API with rate limiting."""
url = "https://nominatim.openstreetmap.org/search"
params = {
"city": city,
"state": province,
"country": country,
"format": "json",
"limit": 1
}
headers = {"User-Agent": "GLAM-Extractor/1.0"}
response = requests.get(url, params=params, headers=headers)
time.sleep(1) # Rate limit: 1 req/sec
if response.ok and response.json():
result = response.json()[0]
return float(result["lat"]), float(result["lon"])
return None, None
```
#### Option C: Hybrid Approach (Best)
1. **Try GeoNames first** (fast, covers 95% of cases)
2. **Fall back to Nominatim** for misses (only ~500 lookups)
3. **Cache results** to avoid repeated API calls
**Time Estimate**: ~15 minutes total
### Geocoding Script
Create `scripts/geocode_canadian_institutions.py`:
```python
#!/usr/bin/env python3
"""
Geocode Canadian heritage institutions.
Uses hybrid approach:
1. GeoNames lookup (fast, offline) for cities
2. Nominatim fallback for misses
3. Cache results to avoid repeated lookups
"""
import json
import time
from pathlib import Path
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
def geocode_canadian_institutions(input_file, output_file):
"""Add geocoding to Canadian institutions."""
# Load institutions
with open(input_file) as f:
institutions = json.load(f)
# Initialize geocoder
geocoder = GeoNamesLookup()
geocoded = 0
misses = 0
for inst in institutions:
if not inst.get('locations'):
continue
location = inst['locations'][0]
city = location.get('city')
province = location.get('region')
if not city or not province:
continue
# Try GeoNames lookup
city_info = geocoder.lookup_city(city, "CA", admin1_name=province)
if city_info:
location['latitude'] = city_info.latitude
location['longitude'] = city_info.longitude
location['geonames_id'] = str(city_info.geonames_id)
geocoded += 1
else:
misses += 1
print(f"Miss: {city}, {province}")
# Save results
with open(output_file, 'w') as f:
json.dump(institutions, f, indent=2, ensure_ascii=False)
print(f"\n✅ Geocoded: {geocoded} / {len(institutions)}")
print(f"❌ Misses: {misses}")
if __name__ == "__main__":
geocode_canadian_institutions(
"data/instances/canada/canadian_heritage_custodians.json",
"data/instances/canada/canadian_heritage_custodians_geocoded.json"
)
```
**Run**:
```bash
cd /Users/kempersc/apps/glam
python3 scripts/geocode_canadian_institutions.py
```
---
## Enrichment Task 3: Integration with Global Dataset
### Objective
Merge Canadian data with the global GLAM dataset and resolve any duplicates.
### Steps
#### 1. Find Conversation-Extracted Canadian Institutions
Search existing conversation files for Canadian institutions:
```bash
cd /Users/kempersc/Documents/claude/glam
grep -l "Canada\|Canadian" *.json | head -10
```
#### 2. Cross-Reference by ISIL Code
Canadian ISIL codes follow format: `CA-XXXX`
```python
def find_duplicates(canadian_tier1, conversation_tier4):
"""Find duplicate institutions by ISIL code."""
# Build ISIL lookup for Canadian TIER_1 data
tier1_by_isil = {}
for inst in canadian_tier1:
for identifier in inst.get('identifiers', []):
if identifier['identifier_scheme'] == 'ISIL':
tier1_by_isil[identifier['identifier_value']] = inst
# Check conversation data for matches
duplicates = []
for conv_inst in conversation_tier4:
for identifier in conv_inst.get('identifiers', []):
if identifier['identifier_scheme'] == 'ISIL':
isil = identifier['identifier_value']
if isil in tier1_by_isil:
duplicates.append({
'isil': isil,
'tier1': tier1_by_isil[isil],
'tier4': conv_inst
})
return duplicates
```
#### 3. Merge Strategy
When duplicates found:
- **Keep TIER_1 data** (Canadian ISIL registry is authoritative)
- **Merge additional fields** from TIER_4 (descriptions, collection info)
- **Update provenance** to show data consolidation
- **Create GHCID history** if identifiers change
#### 4. Export Unified Dataset
Create combined dataset with:
- All TIER_1 Canadian institutions (9,566)
- Non-duplicate conversation institutions
- Merged metadata where applicable
---
## Enrichment Task 4: Wikidata Linking
### Objective
Link Canadian institutions to Wikidata entities for Linked Open Data integration.
### Implementation
**Query Wikidata SPARQL endpoint** for Canadian heritage institutions:
```sparql
SELECT ?item ?itemLabel ?isil ?viaf WHERE {
?item wdt:P31/wdt:P279* wd:Q33506 . # instance of museum (or subclass)
?item wdt:P17 wd:Q16 . # country: Canada
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" }
}
```
**Match by**:
1. ISIL code (if available in Wikidata)
2. Fuzzy name matching (institution name similarity > 85%)
3. Geographic proximity (same city)
**Add Wikidata IDs** to `identifiers`:
```yaml
identifiers:
- identifier_scheme: ISIL
identifier_value: CA-OONL
- identifier_scheme: Wikidata
identifier_value: Q16959027 # National Library of Canada
identifier_url: https://www.wikidata.org/wiki/Q16959027
```
---
## Priority Recommendations
Based on effort vs. value:
### High Priority (Do First)
1. **Geocoding with GeoNames** (15 minutes) - Enables mapping, high value
2. **Integration with global dataset** (30 minutes) - Consolidates data
### Medium Priority (Optional)
3. **Wikidata linking** (1 hour) - Adds LOD connectivity
4. **Contact details scraping** (3 hours) - Useful but time-intensive
### Low Priority (Future)
5. **Detailed descriptions** - Extract from detail pages
6. **Collection information** - May require separate API
---
## Scripts to Create
### 1. Geocoding Script
**File**: `scripts/geocode_canadian_institutions.py`
**Time**: 15 minutes to write, 15 minutes to run
**Output**: `data/instances/canada/canadian_heritage_custodians_geocoded.json`
### 2. Integration Script
**File**: `scripts/integrate_canadian_with_global.py`
**Time**: 30 minutes to write, 5 minutes to run
**Output**: `data/instances/global/unified_heritage_custodians.json`
### 3. Wikidata Linking Script
**File**: `scripts/enrich_canadian_with_wikidata.py`
**Time**: 1 hour to write, 30 minutes to run
**Output**: `data/instances/canada/canadian_heritage_custodians_wikidata.json`
---
## Testing Strategy
For each enrichment:
1. **Test with sample** (10-100 records first)
2. **Validate schema compliance** (LinkML validation)
3. **Check data quality** (manual review of samples)
4. **Run full batch** (all 9,566 records)
5. **Export and backup** (JSON + YAML formats)
---
## Resource Requirements
### Disk Space
- Current dataset: 14 MB
- With geocoding: ~16 MB (+2 MB for coordinates)
- With contact details: ~25 MB (+11 MB for addresses/phones/emails)
- With all enrichments: ~30 MB
### Time Investment
- **Geocoding**: 15 minutes (GeoNames) or 3 hours (Nominatim)
- **Contact scraping**: 3 hours (detail pages)
- **Wikidata linking**: 1.5 hours (SPARQL + fuzzy matching)
- **Integration**: 30 minutes (deduplication + merge)
**Total (all tasks)**: 5-8 hours depending on approach
---
## Decision Matrix
| Task | Value | Effort | Priority | When to Do |
|------|-------|--------|----------|------------|
| Geocoding | High | Low | 🔥 High | Now |
| Integration | High | Low | 🔥 High | Now |
| Wikidata | Medium | Medium | ⚠️ Medium | Next session |
| Contact Details | Medium | High | ⏸️ Low | If needed |
---
## Next Session Checklist
When continuing this work:
- [ ] Check if GeoNames database is built (`data/geonames/geonames.db`)
- [ ] Verify Playwright is installed for detail scraping
- [ ] Review Canadian dataset for any updates/changes
- [ ] Check if Wikidata has new Canadian institution entries
- [ ] Consider API rate limits before starting batch jobs
---
## Contact
For questions or issues with enrichment:
- Review `AGENTS.md` for extraction guidelines
- Check `docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md` for session history
- See LinkML schema at `schemas/heritage_custodian.yaml`
---
**Last Updated**: 2025-11-19
**Dataset Version**: 1.0 (9,566 records, 100% complete)
**Status**: Ready for enrichment