glam/CANADIAN_INTEGRATION_REPORT.md
2025-11-19 23:25:22 +01:00

145 lines
3.4 KiB
Markdown

# Canadian Dataset Integration Report
**Date**: 2025-11-19 11:26:41
**Operation**: Merge Canadian ISIL Registry (TIER_1) with Global Dataset
---
## Integration Summary
| Metric | Count |
|--------|-------|
| **Total institutions after merge** | 22,981 |
| **Global institutions (before)** | 13,415 |
| **Canadian institutions (TIER_1)** | 9,566 |
| **Overlapping ISIL codes** | 0 |
| **Canadian institutions added** | 9,566 |
| **Global institutions replaced** | 0 |
| **Global institutions retained** | 13,415 |
---
## Data Tier Breakdown
The integration follows data tier hierarchy:
1. **TIER_1_AUTHORITATIVE** (Canadian ISIL Registry)
- 9,566 institutions
- Official government registry
- Takes precedence over conversation-extracted data
2. **TIER_4_INFERRED** (Conversation extraction)
- 13,415 institutions
- NLP-extracted from heritage conversations
- Retained where no TIER_1 data exists
---
## Geographic Coverage
### Before Integration
- **Countries**: ~60+ (from conversations)
- **Canadian institutions**: 0
### After Integration
- **Countries**: ~61 (added Canada)
- **Canadian institutions**: 9,566
- **Total institutions**: 22,981
---
## Deduplication Details
### ISIL Code Matching
0 overlapping ISIL codes were found and resolved:
- **Strategy**: TIER_1 (Canadian registry) replaces TIER_4 (conversations)
- **Reason**: Government registries are authoritative sources
- **Result**: 0 global records replaced with Canadian TIER_1 data
### New Additions
9,566 Canadian institutions added to global dataset:
- All have ISIL codes (CA-XXXX format)
- 94.3% have geocoded coordinates
- Complete metadata from Library and Archives Canada
---
## Quality Assessment
### Data Completeness
| Field | Canadian Coverage |
|-------|-------------------|
| Name | 100% |
| ISIL Code | 100% |
| City | 100% |
| Province | 100% |
| Institution Type | 100% |
| Geocoded (lat/lon) | 94.3% |
| GHCID | 100% |
### Data Sources
**Canadian ISIL Registry** (TIER_1):
- Source: Library and Archives Canada
- URL: https://sigles-symbols.bac-lac.gc.ca
- Extraction date: 2025-11-18
- Records: 9,566 institutions
- Coverage: All 13 provinces/territories
---
## Impact on Global Dataset
### Size Increase
- Before: 13,415 institutions
- After: 22,981 institutions
- Growth: +9,566 institutions (+71.3%)
### TIER_1 Coverage
- Canada is now the **largest single-country TIER_1 dataset**
- 9,566 institutions with authoritative metadata
- Surpasses Netherlands (1,351), Belgium (427), Argentina (2,156)
---
## Next Steps
### Immediate
- [x] Merge Canadian dataset with global dataset
- [x] Deduplicate by ISIL code
- [x] Export updated global dataset
### Future Enhancements
- [ ] Improve geocoding to 98%+ (Nominatim fallback for 543 small communities)
- [ ] Add Wikidata linking for Canadian institutions
- [ ] Cross-reference with OpenStreetMap for address validation
- [ ] Create GeoJSON export for mapping
---
## Files Created
### Input Files
- `data/instances/canada/canadian_heritage_custodians_geocoded.json` (15 MB)
- 9,566 Canadian institutions
- TIER_1_AUTHORITATIVE
- 94.3% geocoded
- `data/instances/all/globalglam-20251111.yaml` (existing)
- 13,415 global institutions
- TIER_4_INFERRED (conversation extraction)
### Output Files
- `data/instances/all/globalglam-20251119-canada-integrated.yaml`
- 22,981 merged institutions
- Sorted by country, then by name
- Includes metadata header
---
**Integration completed successfully**