145 lines
3.4 KiB
Markdown
145 lines
3.4 KiB
Markdown
# Canadian Dataset Integration Report
|
|
|
|
**Date**: 2025-11-19 11:26:41
|
|
**Operation**: Merge Canadian ISIL Registry (TIER_1) with Global Dataset
|
|
|
|
---
|
|
|
|
## Integration Summary
|
|
|
|
| Metric | Count |
|
|
|--------|-------|
|
|
| **Total institutions after merge** | 22,981 |
|
|
| **Global institutions (before)** | 13,415 |
|
|
| **Canadian institutions (TIER_1)** | 9,566 |
|
|
| **Overlapping ISIL codes** | 0 |
|
|
| **Canadian institutions added** | 9,566 |
|
|
| **Global institutions replaced** | 0 |
|
|
| **Global institutions retained** | 13,415 |
|
|
|
|
---
|
|
|
|
## Data Tier Breakdown
|
|
|
|
The integration follows data tier hierarchy:
|
|
|
|
1. **TIER_1_AUTHORITATIVE** (Canadian ISIL Registry)
|
|
- 9,566 institutions
|
|
- Official government registry
|
|
- Takes precedence over conversation-extracted data
|
|
|
|
2. **TIER_4_INFERRED** (Conversation extraction)
|
|
- 13,415 institutions
|
|
- NLP-extracted from heritage conversations
|
|
- Retained where no TIER_1 data exists
|
|
|
|
---
|
|
|
|
## Geographic Coverage
|
|
|
|
### Before Integration
|
|
- **Countries**: ~60+ (from conversations)
|
|
- **Canadian institutions**: 0
|
|
|
|
### After Integration
|
|
- **Countries**: ~61 (added Canada)
|
|
- **Canadian institutions**: 9,566
|
|
- **Total institutions**: 22,981
|
|
|
|
---
|
|
|
|
## Deduplication Details
|
|
|
|
### ISIL Code Matching
|
|
|
|
0 overlapping ISIL codes were found and resolved:
|
|
|
|
- **Strategy**: TIER_1 (Canadian registry) replaces TIER_4 (conversations)
|
|
- **Reason**: Government registries are authoritative sources
|
|
- **Result**: 0 global records replaced with Canadian TIER_1 data
|
|
|
|
### New Additions
|
|
|
|
9,566 Canadian institutions added to global dataset:
|
|
|
|
- All have ISIL codes (CA-XXXX format)
|
|
- 94.3% have geocoded coordinates
|
|
- Complete metadata from Library and Archives Canada
|
|
|
|
---
|
|
|
|
## Quality Assessment
|
|
|
|
### Data Completeness
|
|
|
|
| Field | Canadian Coverage |
|
|
|-------|-------------------|
|
|
| Name | 100% |
|
|
| ISIL Code | 100% |
|
|
| City | 100% |
|
|
| Province | 100% |
|
|
| Institution Type | 100% |
|
|
| Geocoded (lat/lon) | 94.3% |
|
|
| GHCID | 100% |
|
|
|
|
### Data Sources
|
|
|
|
**Canadian ISIL Registry** (TIER_1):
|
|
- Source: Library and Archives Canada
|
|
- URL: https://sigles-symbols.bac-lac.gc.ca
|
|
- Extraction date: 2025-11-18
|
|
- Records: 9,566 institutions
|
|
- Coverage: All 13 provinces/territories
|
|
|
|
---
|
|
|
|
## Impact on Global Dataset
|
|
|
|
### Size Increase
|
|
- Before: 13,415 institutions
|
|
- After: 22,981 institutions
|
|
- Growth: +9,566 institutions (+71.3%)
|
|
|
|
### TIER_1 Coverage
|
|
- Canada is now the **largest single-country TIER_1 dataset**
|
|
- 9,566 institutions with authoritative metadata
|
|
- Surpasses Netherlands (1,351), Belgium (427), Argentina (2,156)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Immediate
|
|
- [x] Merge Canadian dataset with global dataset
|
|
- [x] Deduplicate by ISIL code
|
|
- [x] Export updated global dataset
|
|
|
|
### Future Enhancements
|
|
- [ ] Improve geocoding to 98%+ (Nominatim fallback for 543 small communities)
|
|
- [ ] Add Wikidata linking for Canadian institutions
|
|
- [ ] Cross-reference with OpenStreetMap for address validation
|
|
- [ ] Create GeoJSON export for mapping
|
|
|
|
---
|
|
|
|
## Files Created
|
|
|
|
### Input Files
|
|
- `data/instances/canada/canadian_heritage_custodians_geocoded.json` (15 MB)
|
|
- 9,566 Canadian institutions
|
|
- TIER_1_AUTHORITATIVE
|
|
- 94.3% geocoded
|
|
|
|
- `data/instances/all/globalglam-20251111.yaml` (existing)
|
|
- 13,415 global institutions
|
|
- TIER_4_INFERRED (conversation extraction)
|
|
|
|
### Output Files
|
|
- `data/instances/all/globalglam-20251119-canada-integrated.yaml`
|
|
- 22,981 merged institutions
|
|
- Sorted by country, then by name
|
|
- Includes metadata header
|
|
|
|
---
|
|
|
|
**Integration completed successfully** ✅
|