3.4 KiB
3.4 KiB
Canadian Dataset Integration Report
Date: 2025-11-19 11:26:41
Operation: Merge Canadian ISIL Registry (TIER_1) with Global Dataset
Integration Summary
| Metric | Count |
|---|---|
| Total institutions after merge | 22,981 |
| Global institutions (before) | 13,415 |
| Canadian institutions (TIER_1) | 9,566 |
| Overlapping ISIL codes | 0 |
| Canadian institutions added | 9,566 |
| Global institutions replaced | 0 |
| Global institutions retained | 13,415 |
Data Tier Breakdown
The integration follows data tier hierarchy:
-
TIER_1_AUTHORITATIVE (Canadian ISIL Registry)
- 9,566 institutions
- Official government registry
- Takes precedence over conversation-extracted data
-
TIER_4_INFERRED (Conversation extraction)
- 13,415 institutions
- NLP-extracted from heritage conversations
- Retained where no TIER_1 data exists
Geographic Coverage
Before Integration
- Countries: ~60+ (from conversations)
- Canadian institutions: 0
After Integration
- Countries: ~61 (added Canada)
- Canadian institutions: 9,566
- Total institutions: 22,981
Deduplication Details
ISIL Code Matching
0 overlapping ISIL codes were found and resolved:
- Strategy: TIER_1 (Canadian registry) replaces TIER_4 (conversations)
- Reason: Government registries are authoritative sources
- Result: 0 global records replaced with Canadian TIER_1 data
New Additions
9,566 Canadian institutions added to global dataset:
- All have ISIL codes (CA-XXXX format)
- 94.3% have geocoded coordinates
- Complete metadata from Library and Archives Canada
Quality Assessment
Data Completeness
| Field | Canadian Coverage |
|---|---|
| Name | 100% |
| ISIL Code | 100% |
| City | 100% |
| Province | 100% |
| Institution Type | 100% |
| Geocoded (lat/lon) | 94.3% |
| GHCID | 100% |
Data Sources
Canadian ISIL Registry (TIER_1):
- Source: Library and Archives Canada
- URL: https://sigles-symbols.bac-lac.gc.ca
- Extraction date: 2025-11-18
- Records: 9,566 institutions
- Coverage: All 13 provinces/territories
Impact on Global Dataset
Size Increase
- Before: 13,415 institutions
- After: 22,981 institutions
- Growth: +9,566 institutions (+71.3%)
TIER_1 Coverage
- Canada is now the largest single-country TIER_1 dataset
- 9,566 institutions with authoritative metadata
- Surpasses Netherlands (1,351), Belgium (427), Argentina (2,156)
Next Steps
Immediate
- Merge Canadian dataset with global dataset
- Deduplicate by ISIL code
- Export updated global dataset
Future Enhancements
- Improve geocoding to 98%+ (Nominatim fallback for 543 small communities)
- Add Wikidata linking for Canadian institutions
- Cross-reference with OpenStreetMap for address validation
- Create GeoJSON export for mapping
Files Created
Input Files
-
data/instances/canada/canadian_heritage_custodians_geocoded.json(15 MB)- 9,566 Canadian institutions
- TIER_1_AUTHORITATIVE
- 94.3% geocoded
-
data/instances/all/globalglam-20251111.yaml(existing)- 13,415 global institutions
- TIER_4_INFERRED (conversation extraction)
Output Files
data/instances/all/globalglam-20251119-canada-integrated.yaml- 22,981 merged institutions
- Sorted by country, then by name
- Includes metadata header
Integration completed successfully ✅