glam/CANADIAN_INTEGRATION_REPORT.md
2025-11-19 23:25:22 +01:00

3.4 KiB

Canadian Dataset Integration Report

Date: 2025-11-19 11:26:41
Operation: Merge Canadian ISIL Registry (TIER_1) with Global Dataset


Integration Summary

Metric Count
Total institutions after merge 22,981
Global institutions (before) 13,415
Canadian institutions (TIER_1) 9,566
Overlapping ISIL codes 0
Canadian institutions added 9,566
Global institutions replaced 0
Global institutions retained 13,415

Data Tier Breakdown

The integration follows data tier hierarchy:

  1. TIER_1_AUTHORITATIVE (Canadian ISIL Registry)

    • 9,566 institutions
    • Official government registry
    • Takes precedence over conversation-extracted data
  2. TIER_4_INFERRED (Conversation extraction)

    • 13,415 institutions
    • NLP-extracted from heritage conversations
    • Retained where no TIER_1 data exists

Geographic Coverage

Before Integration

  • Countries: ~60+ (from conversations)
  • Canadian institutions: 0

After Integration

  • Countries: ~61 (added Canada)
  • Canadian institutions: 9,566
  • Total institutions: 22,981

Deduplication Details

ISIL Code Matching

0 overlapping ISIL codes were found and resolved:

  • Strategy: TIER_1 (Canadian registry) replaces TIER_4 (conversations)
  • Reason: Government registries are authoritative sources
  • Result: 0 global records replaced with Canadian TIER_1 data

New Additions

9,566 Canadian institutions added to global dataset:

  • All have ISIL codes (CA-XXXX format)
  • 94.3% have geocoded coordinates
  • Complete metadata from Library and Archives Canada

Quality Assessment

Data Completeness

Field Canadian Coverage
Name 100%
ISIL Code 100%
City 100%
Province 100%
Institution Type 100%
Geocoded (lat/lon) 94.3%
GHCID 100%

Data Sources

Canadian ISIL Registry (TIER_1):


Impact on Global Dataset

Size Increase

  • Before: 13,415 institutions
  • After: 22,981 institutions
  • Growth: +9,566 institutions (+71.3%)

TIER_1 Coverage

  • Canada is now the largest single-country TIER_1 dataset
  • 9,566 institutions with authoritative metadata
  • Surpasses Netherlands (1,351), Belgium (427), Argentina (2,156)

Next Steps

Immediate

  • Merge Canadian dataset with global dataset
  • Deduplicate by ISIL code
  • Export updated global dataset

Future Enhancements

  • Improve geocoding to 98%+ (Nominatim fallback for 543 small communities)
  • Add Wikidata linking for Canadian institutions
  • Cross-reference with OpenStreetMap for address validation
  • Create GeoJSON export for mapping

Files Created

Input Files

  • data/instances/canada/canadian_heritage_custodians_geocoded.json (15 MB)

    • 9,566 Canadian institutions
    • TIER_1_AUTHORITATIVE
    • 94.3% geocoded
  • data/instances/all/globalglam-20251111.yaml (existing)

    • 13,415 global institutions
    • TIER_4_INFERRED (conversation extraction)

Output Files

  • data/instances/all/globalglam-20251119-canada-integrated.yaml
    • 22,981 merged institutions
    • Sorted by country, then by name
    • Includes metadata header

Integration completed successfully