glam/docs/SERVER_OPERATIONS.md

# GLAM Server Operations Guide

This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.

## Server Overview

| Property | Value |
|----------|-------|
| **Provider** | Hetzner Cloud |
| **Server Name** | `glam-sparql` |
| **IP Address** | `91.98.224.44` |
| **Instance Type** | cx32 (4 vCPU, 8GB RAM) |
| **Location** | nbg1-dc3 (Nuremberg, Germany) |
| **SSH User** | `root` |

---

## Disk Architecture

The server has two storage volumes with distinct purposes:

### Root Volume (`/dev/sda1`)

| Property | Value |
|----------|-------|
| **Mount Point** | `/` |
| **Size** | 75GB |
| **Purpose** | Operating system, applications, virtual environments |

**Contents:**
- `/var/lib/glam/api/` - GLAM API application and Python venv
- `/var/lib/glam/frontend/` - Frontend static files
- `/var/lib/glam/ducklake/` - DuckLake database files
- `/var/www/` - Web roots for Caddy
- `/usr/` - System binaries
- `/var/log/` - System logs

### Data Volume (`/dev/sdb`)

| Property | Value |
|----------|-------|
| **Mount Point** | `/mnt/data` |
| **Size** | 49GB |
| **Purpose** | Large datasets, Oxigraph triplestore, custodian YAML files |

**Contents:**
- `/mnt/data/oxigraph/` - SPARQL triplestore data
- `/mnt/data/custodian/` - Heritage custodian YAML files (27,459 files, ~17GB)
- `/mnt/data/ontologies/` - Base ontology files
- `/mnt/data/rdf/` - Generated RDF schemas

---

## Directory Layout

### Symbolic Links (Important!)

The following symlinks redirect data to the data volume:

```
/var/lib/glam/api/data/custodian → /mnt/data/custodian
```

**Rationale:** The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to `/mnt/data` with a symlink allows the API to find files at the expected path while storing them on the larger volume.

### Full Directory Structure

```
/var/lib/glam/
├── api/                          # GLAM API (24GB total)
│   ├── venv/                     # Python virtual environment (~2GB after optimization)
│   │   └── lib/python3.12/site-packages/
│   │       ├── torch/            # PyTorch CPU-only (~184MB)
│   │       ├── transformers/     # HuggingFace (~115MB)
│   │       ├── sentence_transformers/
│   │       └── ...
│   ├── src/                      # API source code
│   ├── data/
│   │   ├── custodian → /mnt/data/custodian  # SYMLINK
│   │   ├── validation/
│   │   └── sparql_templates.yaml
│   ├── schemas/                  # LinkML schemas for API
│   ├── backend/                  # RAG backend code
│   └── requirements.txt
├── frontend/                     # Frontend build (~200MB)
├── ducklake/                     # DuckLake database (~3.2GB)
├── valkey/                       # Valkey cache config
└── scripts/                      # Deployment scripts

/mnt/data/
├── oxigraph/                     # SPARQL triplestore data
├── custodian/                    # Heritage custodian YAML files (17GB)
│   ├── *.yaml                    # Individual custodian files
│   ├── person/                   # Person entity profiles
│   └── web/                      # Archived web content
├── ontologies/                   # Base ontology files
└── rdf/                          # Generated RDF schemas
```

---

## Services

### Systemd Services

| Service | Port | Description | Status Command |
|---------|------|-------------|----------------|
| `oxigraph` | 7878 (internal) | SPARQL triplestore | `systemctl status oxigraph` |
| `glam-api` | 8000 (internal) | FastAPI application | `systemctl status glam-api` |
| `caddy` | 80, 443 | Reverse proxy with TLS | `systemctl status caddy` |

### Docker Containers

| Container | Port | Description |
|-----------|------|-------------|
| `qdrant` | 6333-6334 (internal) | Vector database |
| `glam-valkey-api` | 8090 (internal) | Valkey cache API |
| `forgejo` | 3000 (internal), 2222 | Git server |

### Service Endpoints

| Domain | Backend |
|--------|---------|
| `bronhouder.nl` | Static frontend |
| `archief.support` | Archief Assistent app |
| `sparql.bronhouder.nl` | Oxigraph SPARQL proxy |
| `api.bronhouder.nl` | GLAM API (if configured) |

---

## PyTorch CPU-Only Configuration

**Critical:** The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.

### Why CPU-Only?

- The cx32 instance has no GPU
- CUDA libraries add ~6GB of unnecessary packages
- CPU inference is sufficient for embedding generation

### Installation

If PyTorch needs to be reinstalled:

```bash
# SSH to server
ssh root@91.98.224.44

# Activate venv
cd /var/lib/glam/api
source venv/bin/activate

# Uninstall GPU version
pip uninstall torch -y

# Install CPU-only version
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

### Verification

```bash
ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
# Should output: PyTorch 2.9.1+cpu, CUDA available: False
```

### Package Sizes (Reference)

| Package | GPU Version | CPU Version |
|---------|-------------|-------------|
| `torch` | ~1.7GB | ~184MB |
| `nvidia-*` | ~4.3GB | Not installed |
| `triton` | ~594MB | Not installed |

---

## Disk Space Management

### Monitoring

```bash
# Quick check
ssh root@91.98.224.44 "df -h / /mnt/data"

# Detailed breakdown
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"
```

### Target Usage

| Volume | Target | Warning | Critical |
|--------|--------|---------|----------|
| Root (`/`) | <70% | >80% | >90% |
| Data (`/mnt/data`) | <80% | >85% | >95% |

### Cleanup Procedures

#### 1. System Logs

```bash
ssh root@91.98.224.44 "
  journalctl --vacuum-size=50M
  rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
  truncate -s 0 /var/log/syslog /var/log/auth.log
"
```

#### 2. APT Cache

```bash
ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"
```

#### 3. Docker Cleanup

```bash
ssh root@91.98.224.44 "docker system prune -af --volumes"
```

#### 4. Python Cache

```bash
ssh root@91.98.224.44 "
  find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
  find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
"
```

---

## Troubleshooting

### Disk Full Recovery

**Symptoms:**
- Services crash-looping
- "No space left on device" errors
- Deployment fails

**Recovery Steps:**

1. **Identify largest directories:**
   ```bash
   ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20"
   ```

2. **Clean logs (immediate relief):**
   ```bash
   ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog"
   ```

3. **Check for unnecessary CUDA packages:**
   ```bash
   ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia"
   # If >100MB, reinstall PyTorch CPU-only (see above)
   ```

4. **Move large data to data volume:**
   ```bash
   # Example: Move custodian data
   ssh root@91.98.224.44 "
     mv /var/lib/glam/api/data/custodian /mnt/data/custodian
     ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian
   "
   ```

5. **Restart services:**
   ```bash
   ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"
   ```

### GLAM API Crash Loop

**Symptoms:**
- `systemctl status glam-api` shows "activating (auto-restart)"
- High restart counter in logs

**Diagnosis:**
```bash
ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"
```

**Common Causes:**

1. **Port already in use:**
   ```bash
   ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api"
   ```

2. **Missing CUDA libraries (after cleanup):**
   ```bash
   # Look for: "libcublas.so not found"
   # Fix: Reinstall PyTorch CPU-only (see above)
   ```

3. **Out of memory:**
   ```bash
   ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5"
   ```

4. **Missing custodian data symlink:**
   ```bash
   ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian"
   # Should be symlink to /mnt/data/custodian
   ```

### Oxigraph Issues

**Check status:**
```bash
ssh root@91.98.224.44 "systemctl status oxigraph"
```

**Check triple count:**
```bash
curl -s "https://sparql.bronhouder.nl/query" \
  -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"
```

**Restart:**
```bash
ssh root@91.98.224.44 "systemctl restart oxigraph"
```

---

## Recovery Procedures

### Full Disk Recovery (Documented 2026-01-10)

On January 10, 2026, both disks reached 100% capacity. Recovery steps:

1. **Root cause:** NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume

2. **Resolution:**
   - Removed NVIDIA packages: `rm -rf .../site-packages/nvidia .../triton`
   - Reinstalled PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
   - Moved custodian data: `/var/lib/glam/api/data/custodian` → `/mnt/data/custodian` (symlink)

3. **Result:**
   - Root: 100% → 43% used (42GB free)
   - Data: 100% → 69% used (15GB free)

### Service Recovery Order

If multiple services are down, restart in this order:

1. `oxigraph` - SPARQL store (no dependencies)
2. `qdrant` (docker) - Vector database
3. `glam-valkey-api` (docker) - Cache
4. `glam-api` - API (depends on Qdrant, Oxigraph)
5. `caddy` - Reverse proxy (depends on backends)

```bash
ssh root@91.98.224.44 "
  systemctl restart oxigraph
  docker restart qdrant glam-valkey-api
  sleep 5
  systemctl restart glam-api
  sleep 10
  systemctl restart caddy
"
```

---

## Monitoring Commands

### Quick Health Check

```bash
./infrastructure/deploy.sh --status
```

### Detailed Status

```bash
# All services
ssh root@91.98.224.44 "
  echo '=== Disk ===' && df -h / /mnt/data
  echo '=== Memory ===' && free -h
  echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
  echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
  echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
"
```

### Website Verification

```bash
# Check all endpoints
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'
```

---

## Maintenance Schedule

### Daily (Automated)

- Log rotation via systemd journald
- Certificate renewal via Caddy (automatic)

### Weekly (Manual)

```bash
# Check disk usage
ssh root@91.98.224.44 "df -h"

# Check for failed services
ssh root@91.98.224.44 "systemctl --failed"
```

### Monthly (Manual)

```bash
# System updates
ssh root@91.98.224.44 "apt update && apt upgrade -y"

# Docker image updates
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"

# Clean old logs
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"
```

---

## Related Documentation

- `docs/DEPLOYMENT_GUIDE.md` - Deployment procedures
- `.opencode/DEPLOYMENT_RULES.md` - AI agent deployment rules
- `infrastructure/deploy.sh` - Deployment script source
- `AGENTS.md` - Rule 7 covers deployment guidelines

---

**Last Updated:** 2026-01-10
**Maintainer:** GLAM Project Team