feat: add PostGIS international boundary architecture

Add schema and tooling for storing administrative boundaries in PostGIS:
- 002_postgis_boundaries.sql: Complete PostGIS schema with:
  - boundary_countries (ISO 3166-1)
  - boundary_admin1 (states/provinces/regions)
  - boundary_admin2 (municipalities/districts)
  - boundary_historical (HALC pre-modern territories)
  - custodian_service_areas (computed werkgebied geometries)
  - geonames_settlements (reverse geocoding)
  - Spatial functions: find_admin_for_point, find_nearest_settlement
  - Views for API access

- load_boundaries_postgis.py: Python loader supporting:
  - GADM (Global Administrative Areas) - primary global source
  - CBS (Dutch municipality boundaries)
  - GeoNames settlements for reverse geocoding
  - Cached downloads and upsert logic

- POSTGIS_BOUNDARY_ARCHITECTURE.md: Design documentation

This replaces the static GeoJSON approach for international coverage.
This commit is contained in:
kempersc 2025-12-07 14:34:39 +01:00
parent 0b06af0fb6
commit 83ab098cf7
3 changed files with 1363 additions and 0 deletions

View file

@ -0,0 +1,189 @@
# PostGIS International Boundary Architecture
## Overview
This document describes the PostGIS-based architecture for storing and querying international administrative boundaries to compute heritage custodian service areas ("werkgebied").
## Problem Statement
The current implementation uses static GeoJSON files for Netherlands-only boundaries:
- `netherlands_provinces.geojson` - 12 provinces
- `netherlands_municipalities.geojson` - ~350 municipalities
- `netherlands_historical_1500.geojson` - HALC historical territories
This approach **does not scale** for international coverage (Japan, Czechia, Germany, Belgium, Brazil, etc.) because:
1. GeoJSON files are too large for client-side loading (Germany has 400+ districts)
2. No server-side spatial queries (point-in-polygon, intersection)
3. No temporal versioning for boundary changes
4. No consistent administrative hierarchy across countries
## Solution Architecture
### Database Schema
```
┌─────────────────────────────────────────────────────────────────┐
│ PostGIS Database │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ boundary_countries │────▶│ boundary_admin1 │ │
│ │ (ISO 3166-1) │ │ (States/Provinces) │ │
│ │ - iso_a2: NL, DE... │ │ - iso_3166_2 │ │
│ │ - geom: POLYGON │ │ - geom: POLYGON │ │
│ └─────────────────────┘ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ boundary_admin2 │ │
│ │ (Municipalities) │ │
│ │ - geonames_id │ │
│ │ - cbs_gemeente_code│ │
│ │ - geom: POLYGON │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌─────────────────────┐ │ │
│ │ boundary_historical │ │ │
│ │ (HALC, pre-modern) │ │ │
│ │ - halc_adm1_code │ │ │
│ │ - period_start/end │ │ │
│ └─────────────────────┘ │ │
│ ▼ │
│ ┌───────────────────────────┐ │
│ │ custodian_service_areas │ │
│ │ (Computed werkgebied) │ │
│ │ - ghcid: NL-NH-HAA-A-NHA │ │
│ │ - admin2_ids: [1,2,3...] │ │
│ │ - geom: MULTIPOLYGON │ │
│ └───────────────────────────┘ │
│ │
│ ┌─────────────────────┐ │
│ │ geonames_settlements│ (For reverse geocoding) │
│ │ - geonames_id │ │
│ │ - name, ascii_name │ │
│ │ - geom: POINT │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### Data Sources
| Source | Coverage | License | Admin Levels | Use Case |
|--------|----------|---------|--------------|----------|
| **GADM** | Global | CC-BY-NC | 0, 1, 2, 3+ | Primary global boundaries |
| **Natural Earth** | Global | Public Domain | 0, 1 | Simplified country shapes |
| **CBS** | Netherlands | CC-BY-4.0 | 2 (gemeente) | Official NL municipalities |
| **HALC** | Low Countries | Academic | Historical | Pre-1800 territories |
| **OSM** | Global | ODbL | Variable | Crowdsourced, current |
| **Eurostat** | Europe | Eurostat | NUTS/LAU | EU statistical regions |
### API Endpoints
The PostGIS database will be exposed via REST API:
```
GET /api/boundaries/countries
GET /api/boundaries/countries/{iso_a2}
GET /api/boundaries/admin1/{iso_a2}
GET /api/boundaries/admin2/{iso_a2}/{admin1_code}
GET /api/boundaries/point?lon={lon}&lat={lat}
GET /api/boundaries/service-area/{ghcid}
GET /api/boundaries/service-area/{ghcid}/geojson
```
### Frontend Integration
The frontend will:
1. Fetch service area GeoJSON via API (not static files)
2. Use MapLibre vector tiles for admin boundaries (optional optimization)
3. Cache frequently-accessed boundaries in browser
4. Support historical boundary display with temporal filtering
```typescript
// Example: Fetch service area for a custodian
const response = await fetch(`/api/boundaries/service-area/${ghcid}/geojson`);
const geojson = await response.json();
map.getSource('werkgebied').setData(geojson);
```
### Temporal Versioning
Boundaries change over time (municipal mergers, etc.). The schema supports:
```sql
-- Current boundary (valid_to IS NULL)
SELECT * FROM boundary_admin2 WHERE cbs_gemeente_code = 'GM0363' AND valid_to IS NULL;
-- Historical boundary (before merger)
SELECT * FROM boundary_admin2 WHERE cbs_gemeente_code = 'GM0363' AND valid_to = '2001-01-01';
```
### Service Area Computation
Service areas are computed from admin units:
```sql
-- Compute service area for Noord-Hollands Archief (serves Haarlem + region)
SELECT ST_Union(geom) AS service_area_geom
FROM boundary_admin2
WHERE id IN (SELECT unnest(admin2_ids) FROM custodian_service_areas WHERE ghcid = 'NL-NH-HAA-A-NHA');
```
Or pre-computed and stored:
```sql
-- Pre-compute and cache
INSERT INTO custodian_service_areas (ghcid, service_area_name, geom, admin2_ids)
VALUES (
'NL-NH-HAA-A-NHA',
'Noord-Hollands Archief Werkgebied',
compute_service_area_geometry(ARRAY[123, 124, 125, 126]), -- admin2 IDs
ARRAY[123, 124, 125, 126]
);
```
## Implementation Plan
### Phase 1: Schema & Initial Data (Current Sprint)
- [x] Create PostGIS schema (`002_postgis_boundaries.sql`)
- [x] Create boundary loading script (`load_boundaries_postgis.py`)
- [ ] Load Netherlands boundaries (CBS + provinces)
- [ ] Load HALC historical boundaries
- [ ] Migrate existing GeoJSON data
### Phase 2: International Expansion
- [ ] Load GADM for priority countries: JP, CZ, DE, BE, CH, AT
- [ ] Load GeoNames settlements for reverse geocoding
- [ ] Create API endpoints for boundary queries
- [ ] Update frontend to use API instead of static files
### Phase 3: Service Area Management
- [ ] Compute service areas for existing custodians
- [ ] Create admin UI for service area editing
- [ ] Implement temporal boundary display
- [ ] Add vector tile generation (optional optimization)
## Files Created
| File | Description |
|------|-------------|
| `infrastructure/sql/002_postgis_boundaries.sql` | PostGIS schema for boundaries |
| `scripts/load_boundaries_postgis.py` | Python script to load boundary data |
| `docs/POSTGIS_BOUNDARY_ARCHITECTURE.md` | This document |
## Dependencies
- PostgreSQL 14+ with PostGIS 3.3+
- Python: `psycopg2`, `geopandas`, `shapely`
- GADM data (downloaded on demand)
- CBS GeoJSON (existing in `frontend/public/data/`)
## Migration from Static GeoJSON
The current static GeoJSON approach will be **deprecated** but **not immediately removed**:
1. PostGIS becomes the source of truth for boundaries
2. API serves boundary GeoJSON on demand
3. Static files remain as fallback for development
4. Frontend gradually migrates to API-based loading

View file

@ -0,0 +1,622 @@
-- PostGIS International Boundary Schema for Heritage Custodian Service Areas
-- Migration: 002_postgis_boundaries.sql
-- Created: 2025-12-07
--
-- Stores administrative boundaries for computing heritage custodian service areas
-- Supports global coverage with multiple data sources (GADM, Natural Earth, OSM, CBS)
-- Enables spatial queries: point-in-polygon, intersection, containment
-- ============================================================================
-- Enable PostGIS Extension
-- ============================================================================
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS postgis_topology;
-- ============================================================================
-- Data Source Registry
-- ============================================================================
-- Track data sources for provenance and updates
CREATE TABLE IF NOT EXISTS boundary_data_sources (
id SERIAL PRIMARY KEY,
source_code VARCHAR(50) NOT NULL UNIQUE, -- e.g., "GADM", "CBS", "OSM", "HALC"
source_name VARCHAR(255) NOT NULL, -- Full name
source_url TEXT, -- Official website
license_type VARCHAR(100), -- e.g., "CC-BY-4.0", "ODbL"
license_url TEXT,
description TEXT,
coverage_scope VARCHAR(50), -- "GLOBAL", "EUROPE", "NETHERLANDS", etc.
last_updated DATE,
attribution_required TEXT, -- Required attribution text
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- Insert known data sources
INSERT INTO boundary_data_sources (source_code, source_name, source_url, license_type, coverage_scope, description) VALUES
('GADM', 'Global Administrative Areas Database', 'https://gadm.org', 'CC-BY-NC-4.0', 'GLOBAL', 'Global administrative boundaries at multiple levels (country, region, district)'),
('NATURAL_EARTH', 'Natural Earth', 'https://www.naturalearthdata.com', 'Public Domain', 'GLOBAL', 'Free vector and raster map data at 1:10m, 1:50m, and 1:110m scales'),
('OSM', 'OpenStreetMap', 'https://www.openstreetmap.org', 'ODbL', 'GLOBAL', 'Crowdsourced global geographic data'),
('CBS', 'CBS Wijken en Buurten', 'https://www.cbs.nl/nl-nl/dossier/nederland-regionaal/geografische-data', 'CC-BY-4.0', 'NETHERLANDS', 'Official Dutch municipality boundaries from CBS'),
('HALC', 'Historical Atlas of the Low Countries', NULL, 'Academic', 'LOW_COUNTRIES', 'Historical administrative boundaries for Netherlands, Belgium, Luxembourg'),
('EUROSTAT', 'Eurostat NUTS/LAU', 'https://ec.europa.eu/eurostat/web/gisco', 'Eurostat', 'EUROPE', 'European statistical regions (NUTS) and local administrative units (LAU)')
ON CONFLICT (source_code) DO NOTHING;
-- ============================================================================
-- Countries Table
-- ============================================================================
-- ISO 3166-1 countries with boundaries
CREATE TABLE IF NOT EXISTS boundary_countries (
id SERIAL PRIMARY KEY,
iso_a2 CHAR(2) NOT NULL, -- ISO 3166-1 alpha-2 (e.g., "NL", "DE")
iso_a3 CHAR(3) NOT NULL, -- ISO 3166-1 alpha-3 (e.g., "NLD", "DEU")
iso_n3 CHAR(3), -- ISO 3166-1 numeric (e.g., "528")
country_name VARCHAR(255) NOT NULL, -- English name
country_name_local VARCHAR(255), -- Native language name
-- Geometry
geom GEOMETRY(MULTIPOLYGON, 4326), -- WGS84 boundary polygon
centroid GEOMETRY(POINT, 4326), -- Computed centroid
bbox BOX2D, -- Bounding box for quick filtering
-- Metadata
source_id INTEGER REFERENCES boundary_data_sources(id),
source_feature_id VARCHAR(100), -- ID in source dataset
area_km2 NUMERIC, -- Computed area
-- Versioning
valid_from DATE,
valid_to DATE, -- NULL = current
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(iso_a2, valid_to) -- Allow multiple versions, only one current
);
CREATE INDEX IF NOT EXISTS idx_boundary_countries_iso_a2 ON boundary_countries(iso_a2);
CREATE INDEX IF NOT EXISTS idx_boundary_countries_geom ON boundary_countries USING GIST(geom);
CREATE INDEX IF NOT EXISTS idx_boundary_countries_centroid ON boundary_countries USING GIST(centroid);
CREATE INDEX IF NOT EXISTS idx_boundary_countries_valid ON boundary_countries(valid_from, valid_to);
-- ============================================================================
-- Administrative Level 1 (States/Provinces/Regions)
-- ============================================================================
-- ISO 3166-2 regions
CREATE TABLE IF NOT EXISTS boundary_admin1 (
id SERIAL PRIMARY KEY,
country_id INTEGER REFERENCES boundary_countries(id),
-- Identifiers
iso_3166_2 VARCHAR(10), -- e.g., "NL-NH", "DE-BY", "JP-13"
admin1_code VARCHAR(20) NOT NULL, -- Internal code (varies by source)
geonames_id INTEGER, -- GeoNames ID for linking
wikidata_id VARCHAR(20), -- Wikidata Q-number
-- Names
admin1_name VARCHAR(255) NOT NULL, -- English/standard name
admin1_name_local VARCHAR(255), -- Native language name
admin1_type VARCHAR(100), -- e.g., "Province", "State", "Prefecture", "Land"
-- Geometry
geom GEOMETRY(MULTIPOLYGON, 4326),
centroid GEOMETRY(POINT, 4326),
bbox BOX2D,
-- Metadata
source_id INTEGER REFERENCES boundary_data_sources(id),
source_feature_id VARCHAR(100),
area_km2 NUMERIC,
-- Versioning
valid_from DATE,
valid_to DATE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(country_id, admin1_code, valid_to)
);
CREATE INDEX IF NOT EXISTS idx_boundary_admin1_country ON boundary_admin1(country_id);
CREATE INDEX IF NOT EXISTS idx_boundary_admin1_iso ON boundary_admin1(iso_3166_2);
CREATE INDEX IF NOT EXISTS idx_boundary_admin1_geonames ON boundary_admin1(geonames_id);
CREATE INDEX IF NOT EXISTS idx_boundary_admin1_geom ON boundary_admin1 USING GIST(geom);
CREATE INDEX IF NOT EXISTS idx_boundary_admin1_centroid ON boundary_admin1 USING GIST(centroid);
-- ============================================================================
-- Administrative Level 2 (Districts/Counties/Municipalities)
-- ============================================================================
CREATE TABLE IF NOT EXISTS boundary_admin2 (
id SERIAL PRIMARY KEY,
admin1_id INTEGER REFERENCES boundary_admin1(id),
country_id INTEGER REFERENCES boundary_countries(id),
-- Identifiers
admin2_code VARCHAR(50) NOT NULL, -- Internal code
nuts_code VARCHAR(10), -- NUTS code (EU)
lau_code VARCHAR(20), -- LAU code (EU)
geonames_id INTEGER,
wikidata_id VARCHAR(20),
-- CBS-specific (Netherlands)
cbs_gemeente_code VARCHAR(10), -- Dutch municipality code (GM0363)
-- Names
admin2_name VARCHAR(255) NOT NULL,
admin2_name_local VARCHAR(255),
admin2_type VARCHAR(100), -- e.g., "Municipality", "District", "County"
-- Geometry
geom GEOMETRY(MULTIPOLYGON, 4326),
centroid GEOMETRY(POINT, 4326),
bbox BOX2D,
-- Metadata
source_id INTEGER REFERENCES boundary_data_sources(id),
source_feature_id VARCHAR(100),
area_km2 NUMERIC,
population INTEGER, -- If available from source
-- Versioning
valid_from DATE,
valid_to DATE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(admin1_id, admin2_code, valid_to)
);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_admin1 ON boundary_admin2(admin1_id);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_country ON boundary_admin2(country_id);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_geonames ON boundary_admin2(geonames_id);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_cbs ON boundary_admin2(cbs_gemeente_code);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_geom ON boundary_admin2 USING GIST(geom);
CREATE INDEX IF NOT EXISTS idx_boundary_admin2_centroid ON boundary_admin2 USING GIST(centroid);
-- ============================================================================
-- Historical Boundaries
-- ============================================================================
-- Pre-modern administrative boundaries (e.g., HALC 1500 data)
CREATE TABLE IF NOT EXISTS boundary_historical (
id SERIAL PRIMARY KEY,
-- Identifiers
historical_id VARCHAR(100) NOT NULL, -- Source-specific ID
halc_adm1_code VARCHAR(20), -- HALC ADM1 territory code (e.g., "VI")
halc_adm2_name VARCHAR(255), -- HALC ADM2 district name
wikidata_id VARCHAR(20),
-- Location context
modern_country_iso_a2 CHAR(2), -- Modern country this is within
-- Names
territory_name VARCHAR(255) NOT NULL,
territory_name_historical VARCHAR(255), -- Name in historical period
territory_type VARCHAR(100), -- e.g., "County", "Duchy", "Lordship"
-- Temporal extent
period_start SMALLINT, -- Year (e.g., 1500)
period_end SMALLINT, -- Year (e.g., 1795)
period_description VARCHAR(255), -- e.g., "Ancien Regime", "Medieval"
-- Geometry
geom GEOMETRY(MULTIPOLYGON, 4326),
centroid GEOMETRY(POINT, 4326),
bbox BOX2D,
-- Metadata
source_id INTEGER REFERENCES boundary_data_sources(id),
source_feature_id VARCHAR(100),
area_km2 NUMERIC,
certainty VARCHAR(50), -- "HIGH", "MEDIUM", "LOW", "APPROXIMATE"
notes TEXT,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(historical_id, source_id)
);
CREATE INDEX IF NOT EXISTS idx_boundary_historical_halc ON boundary_historical(halc_adm1_code);
CREATE INDEX IF NOT EXISTS idx_boundary_historical_country ON boundary_historical(modern_country_iso_a2);
CREATE INDEX IF NOT EXISTS idx_boundary_historical_period ON boundary_historical(period_start, period_end);
CREATE INDEX IF NOT EXISTS idx_boundary_historical_geom ON boundary_historical USING GIST(geom);
-- ============================================================================
-- Custodian Service Areas (Computed/Custom)
-- ============================================================================
-- Stores pre-computed or custom service area geometries for heritage custodians
CREATE TABLE IF NOT EXISTS custodian_service_areas (
id SERIAL PRIMARY KEY,
-- Custodian link (GHCID)
ghcid VARCHAR(100) NOT NULL, -- e.g., "NL-NH-HAA-A-NHA"
ghcid_uuid UUID, -- UUID v5 of GHCID
-- Service area identity
service_area_name VARCHAR(255) NOT NULL,
service_area_type VARCHAR(50) NOT NULL, -- From ServiceAreaTypeEnum
-- Geometry (can be computed union or custom-drawn)
geom GEOMETRY(MULTIPOLYGON, 4326),
centroid GEOMETRY(POINT, 4326),
bbox BOX2D,
-- Composition (which admin units make up this service area)
admin2_ids INTEGER[], -- Array of boundary_admin2.id
admin1_ids INTEGER[], -- Array of boundary_admin1.id
historical_ids INTEGER[], -- Array of boundary_historical.id
-- Properties
is_historical BOOLEAN DEFAULT FALSE,
is_custom BOOLEAN DEFAULT FALSE, -- True if hand-drawn, not computed from admin units
-- Temporal extent
valid_from DATE,
valid_to DATE, -- NULL = current
-- Display properties
display_fill_color VARCHAR(20) DEFAULT '#3498db',
display_fill_opacity NUMERIC(3,2) DEFAULT 0.20,
display_line_color VARCHAR(20) DEFAULT '#2980b9',
display_line_width NUMERIC(3,1) DEFAULT 2.0,
-- Metadata
source_dataset VARCHAR(100), -- Data provenance
notes TEXT,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(ghcid, valid_to)
);
CREATE INDEX IF NOT EXISTS idx_custodian_sa_ghcid ON custodian_service_areas(ghcid);
CREATE INDEX IF NOT EXISTS idx_custodian_sa_uuid ON custodian_service_areas(ghcid_uuid);
CREATE INDEX IF NOT EXISTS idx_custodian_sa_geom ON custodian_service_areas USING GIST(geom);
CREATE INDEX IF NOT EXISTS idx_custodian_sa_type ON custodian_service_areas(service_area_type);
CREATE INDEX IF NOT EXISTS idx_custodian_sa_historical ON custodian_service_areas(is_historical);
-- ============================================================================
-- GeoNames Settlements (for reverse geocoding and linking)
-- ============================================================================
-- Subset of GeoNames populated places for settlement lookups
CREATE TABLE IF NOT EXISTS geonames_settlements (
geonames_id INTEGER PRIMARY KEY,
-- Names
name VARCHAR(200) NOT NULL,
ascii_name VARCHAR(200),
alternate_names TEXT, -- Comma-separated
-- Location
latitude NUMERIC(10, 6) NOT NULL,
longitude NUMERIC(10, 6) NOT NULL,
geom GEOMETRY(POINT, 4326), -- Computed from lat/lon
-- Classification
feature_class CHAR(1), -- 'P' = populated place
feature_code VARCHAR(10), -- 'PPL', 'PPLA', 'PPLC', etc.
-- Administrative hierarchy
country_code CHAR(2),
admin1_code VARCHAR(20), -- GeoNames admin1
admin2_code VARCHAR(80),
admin3_code VARCHAR(20),
admin4_code VARCHAR(20),
-- Metadata
population INTEGER,
elevation INTEGER,
timezone VARCHAR(40),
modification_date DATE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_geonames_country ON geonames_settlements(country_code);
CREATE INDEX IF NOT EXISTS idx_geonames_admin1 ON geonames_settlements(country_code, admin1_code);
CREATE INDEX IF NOT EXISTS idx_geonames_feature ON geonames_settlements(feature_class, feature_code);
CREATE INDEX IF NOT EXISTS idx_geonames_geom ON geonames_settlements USING GIST(geom);
CREATE INDEX IF NOT EXISTS idx_geonames_name ON geonames_settlements(name);
CREATE INDEX IF NOT EXISTS idx_geonames_ascii ON geonames_settlements(ascii_name);
CREATE INDEX IF NOT EXISTS idx_geonames_population ON geonames_settlements(population DESC);
-- ============================================================================
-- Spatial Functions
-- ============================================================================
-- Function: Find admin units containing a point
CREATE OR REPLACE FUNCTION find_admin_for_point(
p_lon NUMERIC,
p_lat NUMERIC,
p_country_code CHAR(2) DEFAULT NULL
)
RETURNS TABLE (
admin_level INTEGER,
admin_code VARCHAR,
admin_name VARCHAR,
iso_code VARCHAR,
geonames_id INTEGER
) AS $$
BEGIN
RETURN QUERY
-- Admin Level 1
SELECT
1::INTEGER AS admin_level,
a.admin1_code,
a.admin1_name,
a.iso_3166_2,
a.geonames_id
FROM boundary_admin1 a
JOIN boundary_countries c ON a.country_id = c.id
WHERE ST_Contains(a.geom, ST_SetSRID(ST_Point(p_lon, p_lat), 4326))
AND a.valid_to IS NULL
AND (p_country_code IS NULL OR c.iso_a2 = p_country_code)
UNION ALL
-- Admin Level 2
SELECT
2::INTEGER,
a.admin2_code,
a.admin2_name,
a.nuts_code,
a.geonames_id
FROM boundary_admin2 a
JOIN boundary_countries c ON a.country_id = c.id
WHERE ST_Contains(a.geom, ST_SetSRID(ST_Point(p_lon, p_lat), 4326))
AND a.valid_to IS NULL
AND (p_country_code IS NULL OR c.iso_a2 = p_country_code)
ORDER BY admin_level;
END;
$$ LANGUAGE plpgsql;
-- Function: Find nearest settlement
CREATE OR REPLACE FUNCTION find_nearest_settlement(
p_lon NUMERIC,
p_lat NUMERIC,
p_country_code CHAR(2) DEFAULT NULL,
p_max_distance_km NUMERIC DEFAULT 50
)
RETURNS TABLE (
geonames_id INTEGER,
name VARCHAR,
feature_code VARCHAR,
population INTEGER,
distance_km NUMERIC
) AS $$
DECLARE
v_point GEOMETRY;
BEGIN
v_point := ST_SetSRID(ST_Point(p_lon, p_lat), 4326);
RETURN QUERY
SELECT
g.geonames_id,
g.name,
g.feature_code,
g.population,
ROUND((ST_Distance(g.geom::geography, v_point::geography) / 1000)::NUMERIC, 2) AS distance_km
FROM geonames_settlements g
WHERE g.feature_class = 'P'
AND g.feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
AND (p_country_code IS NULL OR g.country_code = p_country_code)
AND ST_DWithin(g.geom::geography, v_point::geography, p_max_distance_km * 1000)
ORDER BY g.geom <-> v_point
LIMIT 5;
END;
$$ LANGUAGE plpgsql;
-- Function: Compute service area from admin units
CREATE OR REPLACE FUNCTION compute_service_area_geometry(
p_admin2_ids INTEGER[],
p_admin1_ids INTEGER[] DEFAULT NULL
)
RETURNS GEOMETRY AS $$
DECLARE
v_geom GEOMETRY;
BEGIN
-- Union all admin2 geometries
IF p_admin2_ids IS NOT NULL AND array_length(p_admin2_ids, 1) > 0 THEN
SELECT ST_Union(geom) INTO v_geom
FROM boundary_admin2
WHERE id = ANY(p_admin2_ids)
AND valid_to IS NULL;
END IF;
-- Add admin1 geometries if specified
IF p_admin1_ids IS NOT NULL AND array_length(p_admin1_ids, 1) > 0 THEN
SELECT ST_Union(COALESCE(v_geom, 'GEOMETRYCOLLECTION EMPTY'::GEOMETRY), ST_Union(geom))
INTO v_geom
FROM boundary_admin1
WHERE id = ANY(p_admin1_ids)
AND valid_to IS NULL;
END IF;
RETURN v_geom;
END;
$$ LANGUAGE plpgsql;
-- Function: Get service area GeoJSON for a custodian
CREATE OR REPLACE FUNCTION get_custodian_service_area_geojson(
p_ghcid VARCHAR,
p_include_historical BOOLEAN DEFAULT FALSE
)
RETURNS JSONB AS $$
DECLARE
v_result JSONB;
BEGIN
SELECT jsonb_build_object(
'type', 'FeatureCollection',
'features', COALESCE(jsonb_agg(
jsonb_build_object(
'type', 'Feature',
'geometry', ST_AsGeoJSON(sa.geom)::jsonb,
'properties', jsonb_build_object(
'ghcid', sa.ghcid,
'name', sa.service_area_name,
'type', sa.service_area_type,
'is_historical', sa.is_historical,
'valid_from', sa.valid_from,
'valid_to', sa.valid_to,
'fill_color', sa.display_fill_color,
'fill_opacity', sa.display_fill_opacity,
'line_color', sa.display_line_color,
'line_width', sa.display_line_width
)
)
), '[]'::jsonb)
)
INTO v_result
FROM custodian_service_areas sa
WHERE sa.ghcid = p_ghcid
AND (p_include_historical OR sa.is_historical = FALSE);
RETURN v_result;
END;
$$ LANGUAGE plpgsql;
-- ============================================================================
-- Views for API Access
-- ============================================================================
-- Current admin1 boundaries
CREATE OR REPLACE VIEW v_current_admin1 AS
SELECT
a.id,
c.iso_a2 AS country_code,
a.iso_3166_2,
a.admin1_code,
a.admin1_name,
a.admin1_type,
a.geonames_id,
a.wikidata_id,
ST_AsGeoJSON(a.geom)::jsonb AS geometry,
ST_AsGeoJSON(a.centroid)::jsonb AS centroid,
a.area_km2,
s.source_code AS data_source
FROM boundary_admin1 a
JOIN boundary_countries c ON a.country_id = c.id
LEFT JOIN boundary_data_sources s ON a.source_id = s.id
WHERE a.valid_to IS NULL AND c.valid_to IS NULL;
-- Current admin2 boundaries
CREATE OR REPLACE VIEW v_current_admin2 AS
SELECT
a2.id,
c.iso_a2 AS country_code,
a1.iso_3166_2 AS admin1_iso,
a2.admin2_code,
a2.admin2_name,
a2.admin2_type,
a2.cbs_gemeente_code,
a2.nuts_code,
a2.lau_code,
a2.geonames_id,
a2.wikidata_id,
ST_AsGeoJSON(a2.geom)::jsonb AS geometry,
ST_AsGeoJSON(a2.centroid)::jsonb AS centroid,
a2.area_km2,
a2.population,
s.source_code AS data_source
FROM boundary_admin2 a2
JOIN boundary_admin1 a1 ON a2.admin1_id = a1.id
JOIN boundary_countries c ON a2.country_id = c.id
LEFT JOIN boundary_data_sources s ON a2.source_id = s.id
WHERE a2.valid_to IS NULL AND a1.valid_to IS NULL AND c.valid_to IS NULL;
-- Custodian service areas
CREATE OR REPLACE VIEW v_custodian_service_areas AS
SELECT
sa.ghcid,
sa.ghcid_uuid,
sa.service_area_name,
sa.service_area_type,
sa.is_historical,
sa.is_custom,
sa.valid_from,
sa.valid_to,
ST_AsGeoJSON(sa.geom)::jsonb AS geometry,
ST_AsGeoJSON(sa.centroid)::jsonb AS centroid,
sa.display_fill_color,
sa.display_fill_opacity,
sa.display_line_color,
sa.display_line_width,
sa.source_dataset,
sa.notes
FROM custodian_service_areas sa
WHERE sa.valid_to IS NULL;
-- ============================================================================
-- Statistics
-- ============================================================================
CREATE OR REPLACE VIEW v_boundary_stats AS
SELECT
'countries' AS table_name,
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE valid_to IS NULL) AS current_rows
FROM boundary_countries
UNION ALL
SELECT 'admin1', COUNT(*), COUNT(*) FILTER (WHERE valid_to IS NULL) FROM boundary_admin1
UNION ALL
SELECT 'admin2', COUNT(*), COUNT(*) FILTER (WHERE valid_to IS NULL) FROM boundary_admin2
UNION ALL
SELECT 'historical', COUNT(*), COUNT(*) FROM boundary_historical
UNION ALL
SELECT 'service_areas', COUNT(*), COUNT(*) FILTER (WHERE valid_to IS NULL) FROM custodian_service_areas
UNION ALL
SELECT 'geonames_settlements', COUNT(*), COUNT(*) FROM geonames_settlements;
-- Country coverage summary
CREATE OR REPLACE VIEW v_boundary_coverage AS
SELECT
c.iso_a2,
c.country_name,
COUNT(DISTINCT a1.id) AS admin1_count,
COUNT(DISTINCT a2.id) AS admin2_count,
(SELECT COUNT(*) FROM geonames_settlements g WHERE g.country_code = c.iso_a2) AS settlement_count,
(SELECT COUNT(*) FROM custodian_service_areas sa WHERE sa.ghcid LIKE c.iso_a2 || '-%') AS service_area_count
FROM boundary_countries c
LEFT JOIN boundary_admin1 a1 ON a1.country_id = c.id AND a1.valid_to IS NULL
LEFT JOIN boundary_admin2 a2 ON a2.country_id = c.id AND a2.valid_to IS NULL
WHERE c.valid_to IS NULL
GROUP BY c.id, c.iso_a2, c.country_name
ORDER BY c.country_name;
-- ============================================================================
-- Permissions
-- ============================================================================
GRANT SELECT ON ALL TABLES IN SCHEMA public TO glam_api;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO glam_api;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO glam_api;
-- ============================================================================
-- Table Comments
-- ============================================================================
COMMENT ON TABLE boundary_data_sources IS 'Registry of data sources for boundary data (GADM, CBS, OSM, etc.)';
COMMENT ON TABLE boundary_countries IS 'Country boundaries with ISO 3166-1 codes';
COMMENT ON TABLE boundary_admin1 IS 'First-level administrative divisions (states, provinces, regions)';
COMMENT ON TABLE boundary_admin2 IS 'Second-level administrative divisions (municipalities, districts, counties)';
COMMENT ON TABLE boundary_historical IS 'Historical administrative boundaries (e.g., HALC 1500 data)';
COMMENT ON TABLE custodian_service_areas IS 'Pre-computed or custom service area geometries for heritage custodians';
COMMENT ON TABLE geonames_settlements IS 'GeoNames populated places for settlement lookups';
COMMENT ON FUNCTION find_admin_for_point IS 'Find administrative units containing a geographic point';
COMMENT ON FUNCTION find_nearest_settlement IS 'Find nearest GeoNames settlement to a point';
COMMENT ON FUNCTION compute_service_area_geometry IS 'Compute union geometry from admin unit IDs';
COMMENT ON FUNCTION get_custodian_service_area_geojson IS 'Get GeoJSON for a custodian service area';

View file

@ -0,0 +1,552 @@
#!/usr/bin/env python3
"""
Load International Administrative Boundaries into PostGIS
Supports multiple data sources:
- GADM (Global Administrative Areas) - Primary global source
- Natural Earth - Simplified global boundaries
- CBS (Netherlands) - Official Dutch municipality boundaries
- OSM Boundaries - OpenStreetMap extracts
Usage:
python load_boundaries_postgis.py --source gadm --country NL
python load_boundaries_postgis.py --source cbs
python load_boundaries_postgis.py --source natural_earth --level 1
python load_boundaries_postgis.py --countries JP,CZ,DE,BE --source gadm
"""
import argparse
import json
import logging
import os
import sys
from datetime import date
from pathlib import Path
from typing import Optional, List, Dict, Any
from urllib.request import urlretrieve
import zipfile
import tempfile
# Third-party imports
try:
import psycopg2
from psycopg2.extras import execute_values
import geopandas as gpd
from shapely import wkt
from shapely.geometry import shape, mapping
except ImportError as e:
print(f"Missing dependency: {e}")
print("Install with: pip install psycopg2-binary geopandas shapely")
sys.exit(1)
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# ============================================================================
# Configuration
# ============================================================================
# Database connection (from environment or defaults)
DB_HOST = os.getenv('POSTGRES_HOST', 'localhost')
DB_PORT = os.getenv('POSTGRES_PORT', '5432')
DB_NAME = os.getenv('POSTGRES_DB', 'glam')
DB_USER = os.getenv('POSTGRES_USER', 'glam_api')
DB_PASSWORD = os.getenv('POSTGRES_PASSWORD', '')
# Data source URLs
GADM_BASE_URL = "https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_{iso3}_0.json"
GADM_ADM1_URL = "https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_{iso3}_1.json"
GADM_ADM2_URL = "https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_{iso3}_2.json"
NATURAL_EARTH_URL = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
# ISO 3166-1 alpha-2 to alpha-3 mapping for common countries
ISO_A2_TO_A3 = {
'NL': 'NLD', 'DE': 'DEU', 'BE': 'BEL', 'FR': 'FRA', 'GB': 'GBR',
'JP': 'JPN', 'CZ': 'CZE', 'CH': 'CHE', 'AT': 'AUT', 'PL': 'POL',
'IT': 'ITA', 'ES': 'ESP', 'PT': 'PRT', 'DK': 'DNK', 'SE': 'SWE',
'NO': 'NOR', 'FI': 'FIN', 'IE': 'IRL', 'LU': 'LUX', 'BR': 'BRA',
'MX': 'MEX', 'AR': 'ARG', 'CL': 'CHL', 'CO': 'COL', 'EG': 'EGY',
'PS': 'PSE', 'IL': 'ISR', 'DZ': 'DZA', 'AU': 'AUS', 'NZ': 'NZL',
'KR': 'KOR', 'TW': 'TWN', 'CN': 'CHN', 'IN': 'IND', 'RU': 'RUS',
'UA': 'UKR', 'TR': 'TUR', 'GR': 'GRC', 'BG': 'BGR', 'RO': 'ROU',
'HU': 'HUN', 'SK': 'SVK', 'SI': 'SVN', 'HR': 'HRV', 'RS': 'SRB',
'US': 'USA', 'CA': 'CAN', 'ZA': 'ZAF',
}
# ============================================================================
# Database Functions
# ============================================================================
def get_db_connection():
"""Get PostgreSQL connection with PostGIS."""
return psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASSWORD
)
def get_source_id(conn, source_code: str) -> int:
"""Get or create data source ID."""
with conn.cursor() as cur:
cur.execute(
"SELECT id FROM boundary_data_sources WHERE source_code = %s",
(source_code,)
)
row = cur.fetchone()
if row:
return row[0]
# Source not found - this shouldn't happen if schema is properly initialized
raise ValueError(f"Unknown data source: {source_code}")
def get_or_create_country(conn, iso_a2: str, iso_a3: str, name: str,
geom_wkt: str, source_id: int) -> int:
"""Get or create country record, return ID."""
with conn.cursor() as cur:
# Check if exists
cur.execute(
"""SELECT id FROM boundary_countries
WHERE iso_a2 = %s AND valid_to IS NULL""",
(iso_a2,)
)
row = cur.fetchone()
if row:
return row[0]
# Insert new
cur.execute(
"""INSERT INTO boundary_countries
(iso_a2, iso_a3, country_name, geom, centroid, source_id, valid_from)
VALUES (%s, %s, %s, ST_GeomFromText(%s, 4326),
ST_Centroid(ST_GeomFromText(%s, 4326)), %s, CURRENT_DATE)
RETURNING id""",
(iso_a2, iso_a3, name, geom_wkt, geom_wkt, source_id)
)
return cur.fetchone()[0]
# ============================================================================
# GADM Loader
# ============================================================================
def download_gadm(iso_a3: str, level: int, cache_dir: Path) -> Optional[Path]:
"""Download GADM GeoJSON for a country and admin level."""
cache_dir.mkdir(parents=True, exist_ok=True)
if level == 0:
url = GADM_BASE_URL.format(iso3=iso_a3)
elif level == 1:
url = GADM_ADM1_URL.format(iso3=iso_a3)
elif level == 2:
url = GADM_ADM2_URL.format(iso3=iso_a3)
else:
logger.error(f"Unsupported GADM level: {level}")
return None
filename = f"gadm41_{iso_a3}_{level}.json"
filepath = cache_dir / filename
if filepath.exists():
logger.info(f"Using cached: {filepath}")
return filepath
logger.info(f"Downloading: {url}")
try:
urlretrieve(url, filepath)
return filepath
except Exception as e:
logger.error(f"Failed to download {url}: {e}")
return None
def load_gadm_country(conn, iso_a2: str, cache_dir: Path) -> Dict[str, int]:
"""Load GADM data for a country (levels 0, 1, 2)."""
iso_a3 = ISO_A2_TO_A3.get(iso_a2)
if not iso_a3:
logger.error(f"Unknown country code: {iso_a2}")
return {'admin0': 0, 'admin1': 0, 'admin2': 0}
source_id = get_source_id(conn, 'GADM')
counts = {'admin0': 0, 'admin1': 0, 'admin2': 0}
# Level 0 - Country boundary
level0_file = download_gadm(iso_a3, 0, cache_dir)
if level0_file:
gdf = gpd.read_file(level0_file)
for _, row in gdf.iterrows():
geom_wkt = row.geometry.wkt
country_name = row.get('COUNTRY') or row.get('NAME_0') or iso_a2
country_id = get_or_create_country(
conn, iso_a2, iso_a3,
str(country_name),
geom_wkt, source_id
)
counts['admin0'] = 1
logger.info(f"Loaded country: {iso_a2} (id={country_id})")
# Level 1 - Admin1 (states/provinces)
level1_file = download_gadm(iso_a3, 1, cache_dir)
if level1_file:
gdf = gpd.read_file(level1_file)
# Get country ID
with conn.cursor() as cur:
cur.execute(
"SELECT id FROM boundary_countries WHERE iso_a2 = %s AND valid_to IS NULL",
(iso_a2,)
)
country_row = cur.fetchone()
if not country_row:
logger.warning(f"Country {iso_a2} not found, skipping admin1")
else:
country_id = country_row[0]
for _, row in gdf.iterrows():
admin1_code = row.get('GID_1', row.get('HASC_1', ''))
admin1_name = row.get('NAME_1', '')
admin1_type = row.get('TYPE_1', 'Region')
geom_wkt = row.geometry.wkt
# Generate ISO 3166-2 code
iso_3166_2 = f"{iso_a2}-{admin1_code.split('.')[-1][:3].upper()}" if admin1_code else None
cur.execute(
"""INSERT INTO boundary_admin1
(country_id, iso_3166_2, admin1_code, admin1_name, admin1_type,
geom, centroid, source_id, source_feature_id, valid_from)
VALUES (%s, %s, %s, %s, %s,
ST_GeomFromText(%s, 4326), ST_Centroid(ST_GeomFromText(%s, 4326)),
%s, %s, CURRENT_DATE)
ON CONFLICT (country_id, admin1_code, valid_to) DO UPDATE
SET admin1_name = EXCLUDED.admin1_name,
geom = EXCLUDED.geom,
updated_at = NOW()""",
(country_id, iso_3166_2, admin1_code, admin1_name, admin1_type,
geom_wkt, geom_wkt, source_id, row.get('GID_1'))
)
counts['admin1'] += 1
conn.commit()
logger.info(f"Loaded {counts['admin1']} admin1 regions for {iso_a2}")
# Level 2 - Admin2 (municipalities/districts)
level2_file = download_gadm(iso_a3, 2, cache_dir)
if level2_file:
gdf = gpd.read_file(level2_file)
with conn.cursor() as cur:
# Get country ID
cur.execute(
"SELECT id FROM boundary_countries WHERE iso_a2 = %s AND valid_to IS NULL",
(iso_a2,)
)
country_row = cur.fetchone()
if not country_row:
logger.warning(f"Country {iso_a2} not found, skipping admin2")
else:
country_id = country_row[0]
for _, row in gdf.iterrows():
admin1_code = row.get('GID_1', '')
admin2_code = row.get('GID_2', row.get('HASC_2', ''))
admin2_name = row.get('NAME_2', '')
admin2_type = row.get('TYPE_2', 'Municipality')
geom_wkt = row.geometry.wkt
# Find admin1 ID
cur.execute(
"""SELECT id FROM boundary_admin1
WHERE country_id = %s AND admin1_code = %s AND valid_to IS NULL""",
(country_id, admin1_code)
)
admin1_row = cur.fetchone()
admin1_id = admin1_row[0] if admin1_row else None
cur.execute(
"""INSERT INTO boundary_admin2
(admin1_id, country_id, admin2_code, admin2_name, admin2_type,
geom, centroid, source_id, source_feature_id, valid_from)
VALUES (%s, %s, %s, %s, %s,
ST_GeomFromText(%s, 4326), ST_Centroid(ST_GeomFromText(%s, 4326)),
%s, %s, CURRENT_DATE)
ON CONFLICT (admin1_id, admin2_code, valid_to) DO UPDATE
SET admin2_name = EXCLUDED.admin2_name,
geom = EXCLUDED.geom,
updated_at = NOW()""",
(admin1_id, country_id, admin2_code, admin2_name, admin2_type,
geom_wkt, geom_wkt, source_id, row.get('GID_2'))
)
counts['admin2'] += 1
conn.commit()
logger.info(f"Loaded {counts['admin2']} admin2 regions for {iso_a2}")
return counts
# ============================================================================
# CBS Netherlands Loader
# ============================================================================
def load_cbs_netherlands(conn, geojson_path: Path) -> Dict[str, int]:
"""Load CBS municipality boundaries for Netherlands."""
source_id = get_source_id(conn, 'CBS')
counts = {'admin2': 0}
if not geojson_path.exists():
logger.error(f"CBS GeoJSON not found: {geojson_path}")
return counts
gdf = gpd.read_file(geojson_path)
logger.info(f"Loaded {len(gdf)} features from CBS data")
with conn.cursor() as cur:
# Ensure Netherlands country exists
cur.execute(
"SELECT id FROM boundary_countries WHERE iso_a2 = 'NL' AND valid_to IS NULL"
)
country_row = cur.fetchone()
if not country_row:
logger.warning("Netherlands not found in countries table, creating...")
cur.execute(
"""INSERT INTO boundary_countries (iso_a2, iso_a3, country_name, valid_from, source_id)
VALUES ('NL', 'NLD', 'Netherlands', CURRENT_DATE, %s)
RETURNING id""",
(source_id,)
)
country_id = cur.fetchone()[0]
else:
country_id = country_row[0]
for _, row in gdf.iterrows():
# CBS field names vary - try common patterns
gemeente_code = row.get('gemeentecode', row.get('statcode', row.get('GM_CODE', '')))
gemeente_name = row.get('gemeentenaam', row.get('statnaam', row.get('GM_NAAM', '')))
if not gemeente_name:
continue
geom_wkt = row.geometry.wkt
# Determine admin1 (province) from gemeente code or name
# CBS codes: GM0363 = Amsterdam (NH), GM0518 = 's-Gravenhage (ZH)
# For now, leave admin1_id as NULL - can be enriched later
cur.execute(
"""INSERT INTO boundary_admin2
(admin1_id, country_id, admin2_code, cbs_gemeente_code, admin2_name,
admin2_type, geom, centroid, source_id, valid_from)
VALUES (NULL, %s, %s, %s, %s, 'Municipality',
ST_GeomFromText(%s, 4326), ST_Centroid(ST_GeomFromText(%s, 4326)),
%s, CURRENT_DATE)
ON CONFLICT (admin1_id, admin2_code, valid_to) DO UPDATE
SET admin2_name = EXCLUDED.admin2_name,
cbs_gemeente_code = EXCLUDED.cbs_gemeente_code,
geom = EXCLUDED.geom,
updated_at = NOW()""",
(country_id, gemeente_code, gemeente_code, gemeente_name,
geom_wkt, geom_wkt, source_id)
)
counts['admin2'] += 1
conn.commit()
logger.info(f"Loaded {counts['admin2']} CBS municipalities for Netherlands")
return counts
# ============================================================================
# GeoNames Settlements Loader
# ============================================================================
def load_geonames_settlements(conn, tsv_path: Path, country_codes: Optional[List[str]] = None) -> int:
"""Load GeoNames settlements from allCountries.txt or country-specific file."""
if not tsv_path.exists():
logger.error(f"GeoNames file not found: {tsv_path}")
return 0
count = 0
batch = []
batch_size = 5000
logger.info(f"Loading GeoNames settlements from {tsv_path}")
with open(tsv_path, 'r', encoding='utf-8') as f:
for line in f:
parts = line.strip().split('\t')
if len(parts) < 19:
continue
geonames_id = int(parts[0])
name = parts[1]
ascii_name = parts[2]
alternate_names = parts[3]
latitude = float(parts[4])
longitude = float(parts[5])
feature_class = parts[6]
feature_code = parts[7]
country_code = parts[8]
admin1_code = parts[10]
admin2_code = parts[11]
admin3_code = parts[12]
admin4_code = parts[13]
population = int(parts[14]) if parts[14] else 0
elevation = int(parts[15]) if parts[15] and parts[15] != '' else None
timezone = parts[17]
modification_date = parts[18]
# Filter by country if specified
if country_codes and country_code not in country_codes:
continue
# Only load populated places (P class)
if feature_class != 'P':
continue
# Filter to main settlement types
if feature_code not in ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG'):
continue
batch.append((
geonames_id, name, ascii_name, alternate_names,
latitude, longitude,
feature_class, feature_code,
country_code, admin1_code, admin2_code, admin3_code, admin4_code,
population, elevation, timezone, modification_date
))
if len(batch) >= batch_size:
_insert_geonames_batch(conn, batch)
count += len(batch)
logger.info(f"Loaded {count} settlements...")
batch = []
# Final batch
if batch:
_insert_geonames_batch(conn, batch)
count += len(batch)
logger.info(f"Loaded {count} GeoNames settlements total")
return count
def _insert_geonames_batch(conn, batch: List[tuple]):
"""Insert batch of GeoNames settlements."""
with conn.cursor() as cur:
execute_values(
cur,
"""INSERT INTO geonames_settlements
(geonames_id, name, ascii_name, alternate_names,
latitude, longitude, geom,
feature_class, feature_code,
country_code, admin1_code, admin2_code, admin3_code, admin4_code,
population, elevation, timezone, modification_date)
VALUES %s
ON CONFLICT (geonames_id) DO UPDATE
SET name = EXCLUDED.name,
population = EXCLUDED.population,
modification_date = EXCLUDED.modification_date""",
[(
r[0], r[1], r[2], r[3],
r[4], r[5], f"SRID=4326;POINT({r[5]} {r[4]})",
r[6], r[7],
r[8], r[9], r[10], r[11], r[12],
r[13], r[14], r[15], r[16]
) for r in batch],
template="""(%s, %s, %s, %s, %s, %s, ST_GeomFromEWKT(%s), %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""
)
conn.commit()
# ============================================================================
# Main
# ============================================================================
def main():
parser = argparse.ArgumentParser(description='Load administrative boundaries into PostGIS')
parser.add_argument('--source', choices=['gadm', 'cbs', 'natural_earth', 'geonames'],
required=True, help='Data source to load')
parser.add_argument('--country', type=str, help='Single country ISO-2 code (e.g., NL)')
parser.add_argument('--countries', type=str, help='Comma-separated country codes (e.g., NL,DE,BE)')
parser.add_argument('--file', type=str, help='Path to data file (for CBS, GeoNames)')
parser.add_argument('--cache-dir', type=str, default='./data/boundary_cache',
help='Directory for caching downloaded files')
parser.add_argument('--dry-run', action='store_true', help='Show what would be done')
args = parser.parse_args()
# Determine countries to load
countries = []
if args.country:
countries = [args.country.upper()]
elif args.countries:
countries = [c.strip().upper() for c in args.countries.split(',')]
cache_dir = Path(args.cache_dir)
if args.dry_run:
logger.info(f"DRY RUN - Source: {args.source}, Countries: {countries or 'all'}")
return
# Connect to database
try:
conn = get_db_connection()
logger.info(f"Connected to PostgreSQL: {DB_HOST}:{DB_PORT}/{DB_NAME}")
except Exception as e:
logger.error(f"Failed to connect to database: {e}")
sys.exit(1)
try:
if args.source == 'gadm':
if not countries:
logger.error("--country or --countries required for GADM source")
sys.exit(1)
total_counts = {'admin0': 0, 'admin1': 0, 'admin2': 0}
for iso_a2 in countries:
logger.info(f"Loading GADM data for {iso_a2}...")
counts = load_gadm_country(conn, iso_a2, cache_dir)
for k, v in counts.items():
total_counts[k] += v
logger.info(f"GADM load complete: {total_counts}")
elif args.source == 'cbs':
if not args.file:
# Default path
default_path = Path('frontend/public/data/netherlands_municipalities.geojson')
if default_path.exists():
args.file = str(default_path)
else:
logger.error("--file required for CBS source (or place file at frontend/public/data/netherlands_municipalities.geojson)")
sys.exit(1)
counts = load_cbs_netherlands(conn, Path(args.file))
logger.info(f"CBS load complete: {counts}")
elif args.source == 'geonames':
if not args.file:
logger.error("--file required for GeoNames source (path to allCountries.txt or country file)")
sys.exit(1)
count = load_geonames_settlements(conn, Path(args.file), countries or None)
logger.info(f"GeoNames load complete: {count} settlements")
elif args.source == 'natural_earth':
logger.error("Natural Earth loader not yet implemented")
sys.exit(1)
finally:
conn.close()
if __name__ == '__main__':
main()