471 lines
13 KiB
Markdown
471 lines
13 KiB
Markdown
# Canadian ISIL Dataset - Enrichment Guide
|
||
|
||
**Current Status**: 9,566 records with basic metadata (100% complete)
|
||
**Next Steps**: Optional enrichment tasks to add contact info, geocoding, and integration
|
||
|
||
---
|
||
|
||
## Current Dataset Quality
|
||
|
||
### What We Have ✅
|
||
|
||
- **9,566 institutions** with 100% success rate
|
||
- **Basic Metadata**:
|
||
- Institution name
|
||
- ISIL code (CA-XXXX format)
|
||
- City and province
|
||
- Institution type (Library, Archive, Museum, etc.)
|
||
- Organization status (Active/Inactive)
|
||
- GHCID identifiers (UUID v5, UUID v8, numeric)
|
||
- Detail page URLs
|
||
- Data provenance (TIER_1_AUTHORITATIVE)
|
||
|
||
### What We're Missing 🔄
|
||
|
||
- **Contact Information**: Address, phone, email, website
|
||
- **Geographic Coordinates**: Latitude/longitude for mapping
|
||
- **Enriched Descriptions**: Operating hours, services, collection info
|
||
- **Cross-references**: Links to Wikidata, VIAF, other identifiers
|
||
|
||
---
|
||
|
||
## Enrichment Task 1: Contact Details from Detail Pages
|
||
|
||
### Objective
|
||
Extract additional metadata from LAC detail pages for all 9,566 institutions.
|
||
|
||
### What's Available
|
||
Based on the LAC website structure, detail pages contain:
|
||
- **Full address** (street, city, postal code)
|
||
- **Phone number**
|
||
- **Email address**
|
||
- **Website URL**
|
||
- **Operating hours** (for some institutions)
|
||
- **Service descriptions**
|
||
- **Director/Contact person**
|
||
- **Notes** (historical info, mergers, relocations)
|
||
|
||
### Implementation
|
||
|
||
**Tool Already Exists**: `scripts/scrapers/scrape_canadian_isil.py`
|
||
|
||
**Method**: `fetch_library_details()` - Extracts detail page data
|
||
|
||
**Usage**:
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
python3 scripts/scrapers/scrape_canadian_isil.py --fetch-details
|
||
```
|
||
|
||
### Time Estimate
|
||
- **Rate**: ~1.2 seconds per detail page (Playwright navigation + parsing)
|
||
- **Total**: 9,566 records × 1.2 sec = **~3.2 hours**
|
||
- **Best time to run**: Overnight or during off-hours
|
||
|
||
### Expected Output Structure
|
||
```yaml
|
||
- id: https://w3id.org/heritage/custodian/ca/aa
|
||
name: Andrew Municipal Library
|
||
identifiers:
|
||
- identifier_scheme: ISIL
|
||
identifier_value: CA-AA
|
||
- identifier_scheme: Website
|
||
identifier_value: https://www.andrewlibrary.ca
|
||
locations:
|
||
- city: Andrew
|
||
region: Alberta
|
||
country: CA
|
||
street_address: 4915 50th Street
|
||
postal_code: T0B 0C0
|
||
contact_info:
|
||
phone: "+1-780-365-3131"
|
||
email: "andrew.library@example.ca"
|
||
```
|
||
|
||
### Schema Mapping
|
||
The LinkML `Location` class already supports:
|
||
- `street_address` (string)
|
||
- `postal_code` (string)
|
||
|
||
For contact info, we may need to extend the schema or use a separate `ContactInfo` class.
|
||
|
||
### Command to Run
|
||
```bash
|
||
# Run detail scraper with rate limiting (1 req/sec to be polite)
|
||
cd /Users/kempersc/apps/glam
|
||
python3 scripts/scrapers/scrape_canadian_isil.py \
|
||
--fetch-details \
|
||
--rate-limit 1.0 \
|
||
--output data/isil/canada/canadian_libraries_enriched.json
|
||
```
|
||
|
||
---
|
||
|
||
## Enrichment Task 2: Geocoding
|
||
|
||
### Objective
|
||
Add latitude/longitude coordinates to all 9,566 institutions for mapping and spatial analysis.
|
||
|
||
### Implementation Options
|
||
|
||
#### Option A: GeoNames Lookup (Fast, Offline)
|
||
|
||
**Tool**: `src/glam_extractor/geocoding/geonames_lookup.py`
|
||
|
||
**Advantages**:
|
||
- ✅ Fast (local SQLite database)
|
||
- ✅ No API rate limits
|
||
- ✅ Works offline
|
||
- ✅ High accuracy for cities (population > 1,000)
|
||
|
||
**Disadvantages**:
|
||
- ❌ May lack coordinates for very small towns
|
||
- ❌ Requires GeoNames database setup
|
||
|
||
**Setup**:
|
||
```bash
|
||
# Build GeoNames database (one-time, ~30 minutes)
|
||
cd /Users/kempersc/apps/glam
|
||
python3 scripts/build_geonames_db.py
|
||
```
|
||
|
||
**Usage**:
|
||
```python
|
||
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
|
||
|
||
geocoder = GeoNamesLookup()
|
||
city_info = geocoder.lookup_city("Toronto", "CA", admin1_code="ON")
|
||
print(f"Lat: {city_info.latitude}, Lon: {city_info.longitude}")
|
||
```
|
||
|
||
**Time Estimate**: ~5 minutes for 9,566 lookups (offline, instant)
|
||
|
||
#### Option B: Nominatim API (Accurate, Slow)
|
||
|
||
**Tool**: Nominatim (OpenStreetMap geocoding)
|
||
|
||
**Advantages**:
|
||
- ✅ Very accurate (street-level)
|
||
- ✅ Handles ambiguous addresses
|
||
- ✅ Free (with rate limits)
|
||
|
||
**Disadvantages**:
|
||
- ❌ Rate limit: 1 request/second
|
||
- ❌ Total time: 9,566 requests = **~2.7 hours**
|
||
- ❌ Requires internet connection
|
||
|
||
**Usage**:
|
||
```python
|
||
import requests
|
||
import time
|
||
|
||
def geocode_nominatim(city, province, country="CA"):
|
||
"""Geocode using Nominatim API with rate limiting."""
|
||
url = "https://nominatim.openstreetmap.org/search"
|
||
params = {
|
||
"city": city,
|
||
"state": province,
|
||
"country": country,
|
||
"format": "json",
|
||
"limit": 1
|
||
}
|
||
headers = {"User-Agent": "GLAM-Extractor/1.0"}
|
||
|
||
response = requests.get(url, params=params, headers=headers)
|
||
time.sleep(1) # Rate limit: 1 req/sec
|
||
|
||
if response.ok and response.json():
|
||
result = response.json()[0]
|
||
return float(result["lat"]), float(result["lon"])
|
||
return None, None
|
||
```
|
||
|
||
#### Option C: Hybrid Approach (Best)
|
||
|
||
1. **Try GeoNames first** (fast, covers 95% of cases)
|
||
2. **Fall back to Nominatim** for misses (only ~500 lookups)
|
||
3. **Cache results** to avoid repeated API calls
|
||
|
||
**Time Estimate**: ~15 minutes total
|
||
|
||
### Geocoding Script
|
||
|
||
Create `scripts/geocode_canadian_institutions.py`:
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
"""
|
||
Geocode Canadian heritage institutions.
|
||
|
||
Uses hybrid approach:
|
||
1. GeoNames lookup (fast, offline) for cities
|
||
2. Nominatim fallback for misses
|
||
3. Cache results to avoid repeated lookups
|
||
"""
|
||
|
||
import json
|
||
import time
|
||
from pathlib import Path
|
||
from glam_extractor.geocoding.geonames_lookup import GeoNamesLookup
|
||
|
||
def geocode_canadian_institutions(input_file, output_file):
|
||
"""Add geocoding to Canadian institutions."""
|
||
|
||
# Load institutions
|
||
with open(input_file) as f:
|
||
institutions = json.load(f)
|
||
|
||
# Initialize geocoder
|
||
geocoder = GeoNamesLookup()
|
||
|
||
geocoded = 0
|
||
misses = 0
|
||
|
||
for inst in institutions:
|
||
if not inst.get('locations'):
|
||
continue
|
||
|
||
location = inst['locations'][0]
|
||
city = location.get('city')
|
||
province = location.get('region')
|
||
|
||
if not city or not province:
|
||
continue
|
||
|
||
# Try GeoNames lookup
|
||
city_info = geocoder.lookup_city(city, "CA", admin1_name=province)
|
||
|
||
if city_info:
|
||
location['latitude'] = city_info.latitude
|
||
location['longitude'] = city_info.longitude
|
||
location['geonames_id'] = str(city_info.geonames_id)
|
||
geocoded += 1
|
||
else:
|
||
misses += 1
|
||
print(f"Miss: {city}, {province}")
|
||
|
||
# Save results
|
||
with open(output_file, 'w') as f:
|
||
json.dump(institutions, f, indent=2, ensure_ascii=False)
|
||
|
||
print(f"\n✅ Geocoded: {geocoded} / {len(institutions)}")
|
||
print(f"❌ Misses: {misses}")
|
||
|
||
if __name__ == "__main__":
|
||
geocode_canadian_institutions(
|
||
"data/instances/canada/canadian_heritage_custodians.json",
|
||
"data/instances/canada/canadian_heritage_custodians_geocoded.json"
|
||
)
|
||
```
|
||
|
||
**Run**:
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
python3 scripts/geocode_canadian_institutions.py
|
||
```
|
||
|
||
---
|
||
|
||
## Enrichment Task 3: Integration with Global Dataset
|
||
|
||
### Objective
|
||
Merge Canadian data with the global GLAM dataset and resolve any duplicates.
|
||
|
||
### Steps
|
||
|
||
#### 1. Find Conversation-Extracted Canadian Institutions
|
||
|
||
Search existing conversation files for Canadian institutions:
|
||
|
||
```bash
|
||
cd /Users/kempersc/Documents/claude/glam
|
||
grep -l "Canada\|Canadian" *.json | head -10
|
||
```
|
||
|
||
#### 2. Cross-Reference by ISIL Code
|
||
|
||
Canadian ISIL codes follow format: `CA-XXXX`
|
||
|
||
```python
|
||
def find_duplicates(canadian_tier1, conversation_tier4):
|
||
"""Find duplicate institutions by ISIL code."""
|
||
|
||
# Build ISIL lookup for Canadian TIER_1 data
|
||
tier1_by_isil = {}
|
||
for inst in canadian_tier1:
|
||
for identifier in inst.get('identifiers', []):
|
||
if identifier['identifier_scheme'] == 'ISIL':
|
||
tier1_by_isil[identifier['identifier_value']] = inst
|
||
|
||
# Check conversation data for matches
|
||
duplicates = []
|
||
for conv_inst in conversation_tier4:
|
||
for identifier in conv_inst.get('identifiers', []):
|
||
if identifier['identifier_scheme'] == 'ISIL':
|
||
isil = identifier['identifier_value']
|
||
if isil in tier1_by_isil:
|
||
duplicates.append({
|
||
'isil': isil,
|
||
'tier1': tier1_by_isil[isil],
|
||
'tier4': conv_inst
|
||
})
|
||
|
||
return duplicates
|
||
```
|
||
|
||
#### 3. Merge Strategy
|
||
|
||
When duplicates found:
|
||
- **Keep TIER_1 data** (Canadian ISIL registry is authoritative)
|
||
- **Merge additional fields** from TIER_4 (descriptions, collection info)
|
||
- **Update provenance** to show data consolidation
|
||
- **Create GHCID history** if identifiers change
|
||
|
||
#### 4. Export Unified Dataset
|
||
|
||
Create combined dataset with:
|
||
- All TIER_1 Canadian institutions (9,566)
|
||
- Non-duplicate conversation institutions
|
||
- Merged metadata where applicable
|
||
|
||
---
|
||
|
||
## Enrichment Task 4: Wikidata Linking
|
||
|
||
### Objective
|
||
Link Canadian institutions to Wikidata entities for Linked Open Data integration.
|
||
|
||
### Implementation
|
||
|
||
**Query Wikidata SPARQL endpoint** for Canadian heritage institutions:
|
||
|
||
```sparql
|
||
SELECT ?item ?itemLabel ?isil ?viaf WHERE {
|
||
?item wdt:P31/wdt:P279* wd:Q33506 . # instance of museum (or subclass)
|
||
?item wdt:P17 wd:Q16 . # country: Canada
|
||
OPTIONAL { ?item wdt:P791 ?isil } # ISIL code
|
||
OPTIONAL { ?item wdt:P214 ?viaf } # VIAF ID
|
||
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" }
|
||
}
|
||
```
|
||
|
||
**Match by**:
|
||
1. ISIL code (if available in Wikidata)
|
||
2. Fuzzy name matching (institution name similarity > 85%)
|
||
3. Geographic proximity (same city)
|
||
|
||
**Add Wikidata IDs** to `identifiers`:
|
||
|
||
```yaml
|
||
identifiers:
|
||
- identifier_scheme: ISIL
|
||
identifier_value: CA-OONL
|
||
- identifier_scheme: Wikidata
|
||
identifier_value: Q16959027 # National Library of Canada
|
||
identifier_url: https://www.wikidata.org/wiki/Q16959027
|
||
```
|
||
|
||
---
|
||
|
||
## Priority Recommendations
|
||
|
||
Based on effort vs. value:
|
||
|
||
### High Priority (Do First)
|
||
1. **Geocoding with GeoNames** (15 minutes) - Enables mapping, high value
|
||
2. **Integration with global dataset** (30 minutes) - Consolidates data
|
||
|
||
### Medium Priority (Optional)
|
||
3. **Wikidata linking** (1 hour) - Adds LOD connectivity
|
||
4. **Contact details scraping** (3 hours) - Useful but time-intensive
|
||
|
||
### Low Priority (Future)
|
||
5. **Detailed descriptions** - Extract from detail pages
|
||
6. **Collection information** - May require separate API
|
||
|
||
---
|
||
|
||
## Scripts to Create
|
||
|
||
### 1. Geocoding Script
|
||
**File**: `scripts/geocode_canadian_institutions.py`
|
||
**Time**: 15 minutes to write, 15 minutes to run
|
||
**Output**: `data/instances/canada/canadian_heritage_custodians_geocoded.json`
|
||
|
||
### 2. Integration Script
|
||
**File**: `scripts/integrate_canadian_with_global.py`
|
||
**Time**: 30 minutes to write, 5 minutes to run
|
||
**Output**: `data/instances/global/unified_heritage_custodians.json`
|
||
|
||
### 3. Wikidata Linking Script
|
||
**File**: `scripts/enrich_canadian_with_wikidata.py`
|
||
**Time**: 1 hour to write, 30 minutes to run
|
||
**Output**: `data/instances/canada/canadian_heritage_custodians_wikidata.json`
|
||
|
||
---
|
||
|
||
## Testing Strategy
|
||
|
||
For each enrichment:
|
||
|
||
1. **Test with sample** (10-100 records first)
|
||
2. **Validate schema compliance** (LinkML validation)
|
||
3. **Check data quality** (manual review of samples)
|
||
4. **Run full batch** (all 9,566 records)
|
||
5. **Export and backup** (JSON + YAML formats)
|
||
|
||
---
|
||
|
||
## Resource Requirements
|
||
|
||
### Disk Space
|
||
- Current dataset: 14 MB
|
||
- With geocoding: ~16 MB (+2 MB for coordinates)
|
||
- With contact details: ~25 MB (+11 MB for addresses/phones/emails)
|
||
- With all enrichments: ~30 MB
|
||
|
||
### Time Investment
|
||
- **Geocoding**: 15 minutes (GeoNames) or 3 hours (Nominatim)
|
||
- **Contact scraping**: 3 hours (detail pages)
|
||
- **Wikidata linking**: 1.5 hours (SPARQL + fuzzy matching)
|
||
- **Integration**: 30 minutes (deduplication + merge)
|
||
|
||
**Total (all tasks)**: 5-8 hours depending on approach
|
||
|
||
---
|
||
|
||
## Decision Matrix
|
||
|
||
| Task | Value | Effort | Priority | When to Do |
|
||
|------|-------|--------|----------|------------|
|
||
| Geocoding | High | Low | 🔥 High | Now |
|
||
| Integration | High | Low | 🔥 High | Now |
|
||
| Wikidata | Medium | Medium | ⚠️ Medium | Next session |
|
||
| Contact Details | Medium | High | ⏸️ Low | If needed |
|
||
|
||
---
|
||
|
||
## Next Session Checklist
|
||
|
||
When continuing this work:
|
||
|
||
- [ ] Check if GeoNames database is built (`data/geonames/geonames.db`)
|
||
- [ ] Verify Playwright is installed for detail scraping
|
||
- [ ] Review Canadian dataset for any updates/changes
|
||
- [ ] Check if Wikidata has new Canadian institution entries
|
||
- [ ] Consider API rate limits before starting batch jobs
|
||
|
||
---
|
||
|
||
## Contact
|
||
|
||
For questions or issues with enrichment:
|
||
- Review `AGENTS.md` for extraction guidelines
|
||
- Check `docs/sessions/CANADIAN_ISIL_EXTRACTION_COMPLETE.md` for session history
|
||
- See LinkML schema at `schemas/heritage_custodian.yaml`
|
||
|
||
---
|
||
|
||
**Last Updated**: 2025-11-19
|
||
**Dataset Version**: 1.0 (9,566 records, 100% complete)
|
||
**Status**: Ready for enrichment
|