glam/infrastructure/README.md
2025-12-09 07:56:35 +01:00

314 lines
11 KiB
Markdown

# GLAM Infrastructure as Code
This directory contains Terraform configuration for deploying the GLAM Heritage Custodian Ontology infrastructure to Hetzner Cloud.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Hetzner Cloud │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CX21 Server (2 vCPU, 4GB RAM) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Caddy │ │ Oxigraph │ │ │
│ │ │ (Reverse Proxy) │──▶│ (SPARQL Triplestore) │ │ │
│ │ │ :443, :80 │ │ :7878 │ │ │
│ │ └──────────────────┘ └──────────────────────────┘ │ │
│ │ │ │ │ │
│ │ │ ┌─────────┴─────────┐ │ │
│ │ │ │ /mnt/data/oxigraph │ │
│ │ │ │ (Persistent Storage) │ │
│ │ │ └─────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼────────────────────────────────────────────┐ │
│ │ │ Qdrant (Docker) │ │
│ │ │ Vector Database for RAG │ │
│ │ │ :6333 (REST), :6334 (gRPC) │ │
│ │ └─────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ │ ┌────────▼────────────────────────────────────────────┐ │
│ │ │ Frontend (Static Files) │ │
│ │ │ /var/www/glam-frontend │ │
│ │ └─────────────────────────────────────────────────────┘ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Hetzner Volume (50GB SSD) │ │
│ │ /mnt/data │ │
│ │ - Oxigraph database files │ │
│ │ - Qdrant storage & snapshots │ │
│ │ - Ontology files (.ttl, .rdf, .owl) │ │
│ │ - LinkML schemas (.yaml) │ │
│ │ - UML diagrams (.mmd) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Components
- **Oxigraph**: High-performance SPARQL triplestore for RDF data
- **Qdrant**: Vector database for semantic search and RAG (Retrieval-Augmented Generation)
- **Caddy**: Modern web server with automatic HTTPS
- **Hetzner Cloud Server**: CX21 (2 vCPU, 4GB RAM, 40GB SSD)
- **Hetzner Volume**: 50GB persistent storage for data
## Prerequisites
1. [Terraform](https://www.terraform.io/downloads) (>= 1.0)
2. [Hetzner Cloud Account](https://console.hetzner.cloud/)
3. Hetzner API Token
## Setup
### 1. Create Hetzner API Token
1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/)
2. Select your project (or create a new one)
3. Go to **Security** > **API Tokens**
4. Generate a new token with **Read & Write** permissions
5. Copy the token (it's only shown once!)
### 2. Configure Terraform
```bash
cd infrastructure/terraform
# Copy the example variables file
cp terraform.tfvars.example terraform.tfvars
# Edit with your values
nano terraform.tfvars
```
Set the following variables:
- `hcloud_token`: Your Hetzner API token
- `domain`: Your domain name (e.g., `sparql.glam-ontology.org`)
- `ssh_public_key`: Path to your SSH public key
### 3. Initialize and Apply
```bash
# Initialize Terraform
terraform init
# Preview changes
terraform plan
# Apply configuration
terraform apply
```
### 4. Deploy Data
After the infrastructure is created, deploy your ontology data:
```bash
# SSH into the server
ssh root@$(terraform output -raw server_ip)
# Load ontology files into Oxigraph
cd /var/lib/glam/scripts
./load-ontologies.sh
```
## Configuration Files
| File | Description |
|------|-------------|
| `main.tf` | Main infrastructure resources |
| `variables.tf` | Input variables |
| `outputs.tf` | Output values |
| `cloud-init.yaml` | Server initialization script |
| `terraform.tfvars.example` | Example variable values |
## Costs
Estimated monthly costs (EUR):
| Resource | Cost |
|----------|------|
| CX21 Server | ~€5.95/month |
| 50GB Volume | ~€2.50/month |
| IPv4 Address | Included |
| **Total** | ~€8.45/month |
## Maintenance
### Backup Data
```bash
# Backup Oxigraph database
ssh root@$SERVER_IP "tar -czf /tmp/oxigraph-backup.tar.gz /mnt/data/oxigraph"
scp root@$SERVER_IP:/tmp/oxigraph-backup.tar.gz ./backups/
```
### Update Ontologies
```bash
# Copy new ontology files
scp -r schemas/20251121/rdf/*.ttl root@$SERVER_IP:/mnt/data/ontologies/
# Reload into Oxigraph
ssh root@$SERVER_IP "/var/lib/glam/scripts/load-ontologies.sh"
```
### View Logs
```bash
ssh root@$SERVER_IP "journalctl -u oxigraph -f"
ssh root@$SERVER_IP "journalctl -u caddy -f"
```
## CI/CD with GitHub Actions
The project includes automatic deployment via GitHub Actions. Changes to infrastructure, data, or frontend trigger automatic deployments.
### Automatic Triggers
Deployments are triggered on push to `main` branch when files change in:
- `infrastructure/**` → Infrastructure deployment (Terraform)
- `schemas/20251121/rdf/**` → Data sync
- `schemas/20251121/linkml/**` → Data sync
- `data/ontology/**` → Data sync
- `frontend/**` → Frontend build & deploy
### Setup CI/CD
Run the setup script to generate SSH keys and get instructions:
```bash
./setup-cicd.sh
```
This will:
1. Generate an ED25519 SSH key pair for deployments
2. Create a `.env` file template (if missing)
3. Print instructions for GitHub repository setup
### Required GitHub Secrets
Go to **Repository → Settings → Secrets and variables → Actions → Secrets**:
| Secret | Description |
|--------|-------------|
| `HETZNER_HC_API_TOKEN` | Hetzner Cloud API token (same as in `.env`) |
| `DEPLOY_SSH_PRIVATE_KEY` | Content of `.ssh/glam_deploy_key` (generated by setup script) |
| `TF_API_TOKEN` | (Optional) Terraform Cloud token for remote state |
### Required GitHub Variables
Go to **Repository → Settings → Secrets and variables → Actions → Variables**:
| Variable | Example |
|----------|---------|
| `GLAM_DOMAIN` | `sparql.glam-ontology.org` |
| `ADMIN_EMAIL` | `admin@example.org` |
### Manual Deployment
You can manually trigger deployments from the GitHub Actions tab with options:
- **Deploy infrastructure changes** - Run Terraform apply
- **Deploy ontology/schema data** - Sync data files to server
- **Deploy frontend build** - Build and deploy frontend
- **Reload data into Oxigraph** - Reimport all RDF data
### Local Deployment
For local deployments without GitHub Actions:
```bash
# Full deployment
./deploy.sh --all
# Just infrastructure
./deploy.sh --infra
# Just data
./deploy.sh --data
# Just frontend
./deploy.sh --frontend
# Just Qdrant vector database
./deploy.sh --qdrant
# Reload Oxigraph
./deploy.sh --reload
```
## Qdrant Vector Database
Qdrant is a self-hosted vector database used for semantic search over heritage institution data. It enables RAG (Retrieval-Augmented Generation) in the DSPy SPARQL generation module.
### Configuration
- **REST API**: `http://localhost:6333` (internal)
- **gRPC API**: `http://localhost:6334` (internal)
- **External Access**: `https://<domain>/qdrant/` (via Caddy reverse proxy)
- **Storage**: `/mnt/data/qdrant/storage/`
- **Snapshots**: `/mnt/data/qdrant/snapshots/`
- **Resource Limits**: 2GB RAM, 2 CPUs
### Deployment
```bash
# Deploy/restart Qdrant
./deploy.sh --qdrant
```
### Health Check
```bash
# Check Qdrant health
ssh root@$SERVER_IP "curl http://localhost:6333/health"
# List collections
ssh root@$SERVER_IP "curl http://localhost:6333/collections"
```
### DSPy Integration
The Qdrant retriever is integrated with DSPy for RAG-enhanced SPARQL query generation:
```python
from glam_extractor.api.qdrant_retriever import HeritageCustodianRetriever
# Create retriever
retriever = HeritageCustodianRetriever(
host="localhost",
port=6333,
)
# Search for relevant institutions
results = retriever("museums in Amsterdam", k=5)
# Add institution to index
retriever.add_institution(
name="Rijksmuseum",
description="National museum of arts and history",
institution_type="MUSEUM",
country="NL",
city="Amsterdam",
)
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `QDRANT_HOST` | `localhost` | Qdrant server hostname |
| `QDRANT_PORT` | `6333` | Qdrant REST API port |
| `QDRANT_ENABLED` | `true` | Enable/disable Qdrant integration |
| `OPENAI_API_KEY` | - | Required for embedding generation |
## Destroy Infrastructure
```bash
terraform destroy
```
**Warning**: This will delete all data. Make sure to backup first!