glam/infrastructure/README.md

# GLAM Infrastructure as Code

This directory contains Terraform configuration for deploying the GLAM Heritage Custodian Ontology infrastructure to Hetzner Cloud.

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                     Hetzner Cloud                               │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              CX21 Server (2 vCPU, 4GB RAM)               │   │
│  │                                                          │   │
│  │  ┌──────────────────┐   ┌──────────────────────────┐   │   │
│  │  │    Caddy         │   │      Oxigraph            │   │   │
│  │  │  (Reverse Proxy) │──▶│  (SPARQL Triplestore)    │   │   │
│  │  │  :443, :80       │   │      :7878               │   │   │
│  │  └──────────────────┘   └──────────────────────────┘   │   │
│  │           │                        │                    │   │
│  │           │              ┌─────────┴─────────┐         │   │
│  │           │              │  /mnt/data/oxigraph         │   │
│  │           │              │  (Persistent Storage)       │   │
│  │           │              └─────────────────────────────┘   │   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │                 Qdrant (Docker)                      │   │
│  │  │              Vector Database for RAG                 │   │
│  │  │              :6333 (REST), :6334 (gRPC)             │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │              Frontend (Static Files)                 │   │
│  │  │              /var/www/glam-frontend                  │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Hetzner Volume (50GB SSD)                  │   │
│  │              /mnt/data                                   │   │
│  │  - Oxigraph database files                              │   │
│  │  - Qdrant storage & snapshots                           │   │
│  │  - Ontology files (.ttl, .rdf, .owl)                    │   │
│  │  - LinkML schemas (.yaml)                               │   │
│  │  - UML diagrams (.mmd)                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Components

- **Oxigraph**: High-performance SPARQL triplestore for RDF data
- **Qdrant**: Vector database for semantic search and RAG (Retrieval-Augmented Generation)
- **Caddy**: Modern web server with automatic HTTPS
- **Hetzner Cloud Server**: CX21 (2 vCPU, 4GB RAM, 40GB SSD)
- **Hetzner Volume**: 50GB persistent storage for data

## Prerequisites

1. [Terraform](https://www.terraform.io/downloads) (>= 1.0)
2. [Hetzner Cloud Account](https://console.hetzner.cloud/)
3. Hetzner API Token

## Setup

### 1. Create Hetzner API Token

1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/)
2. Select your project (or create a new one)
3. Go to **Security** > **API Tokens**
4. Generate a new token with **Read & Write** permissions
5. Copy the token (it's only shown once!)

### 2. Configure Terraform

```bash
cd infrastructure/terraform

# Copy the example variables file
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
nano terraform.tfvars
```

Set the following variables:
- `hcloud_token`: Your Hetzner API token
- `domain`: Your domain name (e.g., `sparql.glam-ontology.org`)
- `ssh_public_key`: Path to your SSH public key

### 3. Initialize and Apply

```bash
# Initialize Terraform
terraform init

# Preview changes
terraform plan

# Apply configuration
terraform apply
```

### 4. Deploy Data

After the infrastructure is created, deploy your ontology data:

```bash
# SSH into the server
ssh root@$(terraform output -raw server_ip)

# Load ontology files into Oxigraph
cd /var/lib/glam/scripts
./load-ontologies.sh
```

## Configuration Files

| File | Description |
|------|-------------|
| `main.tf` | Main infrastructure resources |
| `variables.tf` | Input variables |
| `outputs.tf` | Output values |
| `cloud-init.yaml` | Server initialization script |
| `terraform.tfvars.example` | Example variable values |

## Costs

Estimated monthly costs (EUR):

| Resource | Cost |
|----------|------|
| CX21 Server | ~€5.95/month |
| 50GB Volume | ~€2.50/month |
| IPv4 Address | Included |
| **Total** | ~€8.45/month |

## Maintenance

### Backup Data

```bash
# Backup Oxigraph database
ssh root@$SERVER_IP "tar -czf /tmp/oxigraph-backup.tar.gz /mnt/data/oxigraph"
scp root@$SERVER_IP:/tmp/oxigraph-backup.tar.gz ./backups/
```

### Update Ontologies

```bash
# Copy new ontology files
scp -r schemas/20251121/rdf/*.ttl root@$SERVER_IP:/mnt/data/ontologies/

# Reload into Oxigraph
ssh root@$SERVER_IP "/var/lib/glam/scripts/load-ontologies.sh"
```

### View Logs

```bash
ssh root@$SERVER_IP "journalctl -u oxigraph -f"
ssh root@$SERVER_IP "journalctl -u caddy -f"
```

## CI/CD with GitHub Actions

The project includes automatic deployment via GitHub Actions. Changes to infrastructure, data, or frontend trigger automatic deployments.

### Automatic Triggers

Deployments are triggered on push to `main` branch when files change in:
- `infrastructure/**` → Infrastructure deployment (Terraform)
- `schemas/20251121/rdf/**` → Data sync
- `schemas/20251121/linkml/**` → Data sync
- `data/ontology/**` → Data sync
- `frontend/**` → Frontend build & deploy

### Setup CI/CD

Run the setup script to generate SSH keys and get instructions:

```bash
./setup-cicd.sh
```

This will:
1. Generate an ED25519 SSH key pair for deployments
2. Create a `.env` file template (if missing)
3. Print instructions for GitHub repository setup

### Required GitHub Secrets

Go to **Repository → Settings → Secrets and variables → Actions → Secrets**:

| Secret | Description |
|--------|-------------|
| `HETZNER_HC_API_TOKEN` | Hetzner Cloud API token (same as in `.env`) |
| `DEPLOY_SSH_PRIVATE_KEY` | Content of `.ssh/glam_deploy_key` (generated by setup script) |
| `TF_API_TOKEN` | (Optional) Terraform Cloud token for remote state |

### Required GitHub Variables

Go to **Repository → Settings → Secrets and variables → Actions → Variables**:

| Variable | Example |
|----------|---------|
| `GLAM_DOMAIN` | `sparql.glam-ontology.org` |
| `ADMIN_EMAIL` | `admin@example.org` |

### Manual Deployment

You can manually trigger deployments from the GitHub Actions tab with options:
- **Deploy infrastructure changes** - Run Terraform apply
- **Deploy ontology/schema data** - Sync data files to server
- **Deploy frontend build** - Build and deploy frontend
- **Reload data into Oxigraph** - Reimport all RDF data

### Local Deployment

For local deployments without GitHub Actions:

```bash
# Full deployment
./deploy.sh --all

# Just infrastructure
./deploy.sh --infra

# Just data
./deploy.sh --data

# Just frontend
./deploy.sh --frontend

# Just Qdrant vector database
./deploy.sh --qdrant

# Reload Oxigraph
./deploy.sh --reload
```

## Qdrant Vector Database

Qdrant is a self-hosted vector database used for semantic search over heritage institution data. It enables RAG (Retrieval-Augmented Generation) in the DSPy SPARQL generation module.

### Configuration

- **REST API**: `http://localhost:6333` (internal)
- **gRPC API**: `http://localhost:6334` (internal)
- **External Access**: `https://<domain>/qdrant/` (via Caddy reverse proxy)
- **Storage**: `/mnt/data/qdrant/storage/`
- **Snapshots**: `/mnt/data/qdrant/snapshots/`
- **Resource Limits**: 2GB RAM, 2 CPUs

### Deployment

```bash
# Deploy/restart Qdrant
./deploy.sh --qdrant
```

### Health Check

```bash
# Check Qdrant health
ssh root@$SERVER_IP "curl http://localhost:6333/health"

# List collections
ssh root@$SERVER_IP "curl http://localhost:6333/collections"
```

### DSPy Integration

The Qdrant retriever is integrated with DSPy for RAG-enhanced SPARQL query generation:

```python
from glam_extractor.api.qdrant_retriever import HeritageCustodianRetriever

# Create retriever
retriever = HeritageCustodianRetriever(
    host="localhost",
    port=6333,
)

# Search for relevant institutions
results = retriever("museums in Amsterdam", k=5)

# Add institution to index
retriever.add_institution(
    name="Rijksmuseum",
    description="National museum of arts and history",
    institution_type="MUSEUM",
    country="NL",
    city="Amsterdam",
)
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `QDRANT_HOST` | `localhost` | Qdrant server hostname |
| `QDRANT_PORT` | `6333` | Qdrant REST API port |
| `QDRANT_ENABLED` | `true` | Enable/disable Qdrant integration |
| `OPENAI_API_KEY` | - | Required for embedding generation |

## Destroy Infrastructure

```bash
terraform destroy
```

**Warning**: This will delete all data. Make sure to backup first!