glam/docs/SERVER_OPERATIONS.md
kempersc 54dd4a9803 docs(server): add SERVER_OPERATIONS.md for Hetzner cx32 deployment
Document server disk architecture, PyTorch CPU-only setup, service
management, and recovery procedures learned from disk space crisis.

- Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB)
- PyTorch CPU-only installation via --index-url whl/cpu
- Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/
- Service restart procedures for Oxigraph, GLAM API, Qdrant, etc.
- Emergency recovery commands for disk space crises
2026-01-10 18:42:15 +01:00

443 lines
12 KiB
Markdown

# GLAM Server Operations Guide
This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.
## Server Overview
| Property | Value |
|----------|-------|
| **Provider** | Hetzner Cloud |
| **Server Name** | `glam-sparql` |
| **IP Address** | `91.98.224.44` |
| **Instance Type** | cx32 (4 vCPU, 8GB RAM) |
| **Location** | nbg1-dc3 (Nuremberg, Germany) |
| **SSH User** | `root` |
---
## Disk Architecture
The server has two storage volumes with distinct purposes:
### Root Volume (`/dev/sda1`)
| Property | Value |
|----------|-------|
| **Mount Point** | `/` |
| **Size** | 75GB |
| **Purpose** | Operating system, applications, virtual environments |
**Contents:**
- `/var/lib/glam/api/` - GLAM API application and Python venv
- `/var/lib/glam/frontend/` - Frontend static files
- `/var/lib/glam/ducklake/` - DuckLake database files
- `/var/www/` - Web roots for Caddy
- `/usr/` - System binaries
- `/var/log/` - System logs
### Data Volume (`/dev/sdb`)
| Property | Value |
|----------|-------|
| **Mount Point** | `/mnt/data` |
| **Size** | 49GB |
| **Purpose** | Large datasets, Oxigraph triplestore, custodian YAML files |
**Contents:**
- `/mnt/data/oxigraph/` - SPARQL triplestore data
- `/mnt/data/custodian/` - Heritage custodian YAML files (27,459 files, ~17GB)
- `/mnt/data/ontologies/` - Base ontology files
- `/mnt/data/rdf/` - Generated RDF schemas
---
## Directory Layout
### Symbolic Links (Important!)
The following symlinks redirect data to the data volume:
```
/var/lib/glam/api/data/custodian → /mnt/data/custodian
```
**Rationale:** The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to `/mnt/data` with a symlink allows the API to find files at the expected path while storing them on the larger volume.
### Full Directory Structure
```
/var/lib/glam/
├── api/ # GLAM API (24GB total)
│ ├── venv/ # Python virtual environment (~2GB after optimization)
│ │ └── lib/python3.12/site-packages/
│ │ ├── torch/ # PyTorch CPU-only (~184MB)
│ │ ├── transformers/ # HuggingFace (~115MB)
│ │ ├── sentence_transformers/
│ │ └── ...
│ ├── src/ # API source code
│ ├── data/
│ │ ├── custodian → /mnt/data/custodian # SYMLINK
│ │ ├── validation/
│ │ └── sparql_templates.yaml
│ ├── schemas/ # LinkML schemas for API
│ ├── backend/ # RAG backend code
│ └── requirements.txt
├── frontend/ # Frontend build (~200MB)
├── ducklake/ # DuckLake database (~3.2GB)
├── valkey/ # Valkey cache config
└── scripts/ # Deployment scripts
/mnt/data/
├── oxigraph/ # SPARQL triplestore data
├── custodian/ # Heritage custodian YAML files (17GB)
│ ├── *.yaml # Individual custodian files
│ ├── person/ # Person entity profiles
│ └── web/ # Archived web content
├── ontologies/ # Base ontology files
└── rdf/ # Generated RDF schemas
```
---
## Services
### Systemd Services
| Service | Port | Description | Status Command |
|---------|------|-------------|----------------|
| `oxigraph` | 7878 (internal) | SPARQL triplestore | `systemctl status oxigraph` |
| `glam-api` | 8000 (internal) | FastAPI application | `systemctl status glam-api` |
| `caddy` | 80, 443 | Reverse proxy with TLS | `systemctl status caddy` |
### Docker Containers
| Container | Port | Description |
|-----------|------|-------------|
| `qdrant` | 6333-6334 (internal) | Vector database |
| `glam-valkey-api` | 8090 (internal) | Valkey cache API |
| `forgejo` | 3000 (internal), 2222 | Git server |
### Service Endpoints
| Domain | Backend |
|--------|---------|
| `bronhouder.nl` | Static frontend |
| `archief.support` | Archief Assistent app |
| `sparql.bronhouder.nl` | Oxigraph SPARQL proxy |
| `api.bronhouder.nl` | GLAM API (if configured) |
---
## PyTorch CPU-Only Configuration
**Critical:** The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.
### Why CPU-Only?
- The cx32 instance has no GPU
- CUDA libraries add ~6GB of unnecessary packages
- CPU inference is sufficient for embedding generation
### Installation
If PyTorch needs to be reinstalled:
```bash
# SSH to server
ssh root@91.98.224.44
# Activate venv
cd /var/lib/glam/api
source venv/bin/activate
# Uninstall GPU version
pip uninstall torch -y
# Install CPU-only version
pip install torch --index-url https://download.pytorch.org/whl/cpu
```
### Verification
```bash
ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
# Should output: PyTorch 2.9.1+cpu, CUDA available: False
```
### Package Sizes (Reference)
| Package | GPU Version | CPU Version |
|---------|-------------|-------------|
| `torch` | ~1.7GB | ~184MB |
| `nvidia-*` | ~4.3GB | Not installed |
| `triton` | ~594MB | Not installed |
---
## Disk Space Management
### Monitoring
```bash
# Quick check
ssh root@91.98.224.44 "df -h / /mnt/data"
# Detailed breakdown
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"
```
### Target Usage
| Volume | Target | Warning | Critical |
|--------|--------|---------|----------|
| Root (`/`) | <70% | >80% | >90% |
| Data (`/mnt/data`) | <80% | >85% | >95% |
### Cleanup Procedures
#### 1. System Logs
```bash
ssh root@91.98.224.44 "
journalctl --vacuum-size=50M
rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
truncate -s 0 /var/log/syslog /var/log/auth.log
"
```
#### 2. APT Cache
```bash
ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"
```
#### 3. Docker Cleanup
```bash
ssh root@91.98.224.44 "docker system prune -af --volumes"
```
#### 4. Python Cache
```bash
ssh root@91.98.224.44 "
find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
"
```
---
## Troubleshooting
### Disk Full Recovery
**Symptoms:**
- Services crash-looping
- "No space left on device" errors
- Deployment fails
**Recovery Steps:**
1. **Identify largest directories:**
```bash
ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20"
```
2. **Clean logs (immediate relief):**
```bash
ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog"
```
3. **Check for unnecessary CUDA packages:**
```bash
ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia"
# If >100MB, reinstall PyTorch CPU-only (see above)
```
4. **Move large data to data volume:**
```bash
# Example: Move custodian data
ssh root@91.98.224.44 "
mv /var/lib/glam/api/data/custodian /mnt/data/custodian
ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian
"
```
5. **Restart services:**
```bash
ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"
```
### GLAM API Crash Loop
**Symptoms:**
- `systemctl status glam-api` shows "activating (auto-restart)"
- High restart counter in logs
**Diagnosis:**
```bash
ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"
```
**Common Causes:**
1. **Port already in use:**
```bash
ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api"
```
2. **Missing CUDA libraries (after cleanup):**
```bash
# Look for: "libcublas.so not found"
# Fix: Reinstall PyTorch CPU-only (see above)
```
3. **Out of memory:**
```bash
ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5"
```
4. **Missing custodian data symlink:**
```bash
ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian"
# Should be symlink to /mnt/data/custodian
```
### Oxigraph Issues
**Check status:**
```bash
ssh root@91.98.224.44 "systemctl status oxigraph"
```
**Check triple count:**
```bash
curl -s "https://sparql.bronhouder.nl/query" \
-H "Accept: application/sparql-results+json" \
--data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"
```
**Restart:**
```bash
ssh root@91.98.224.44 "systemctl restart oxigraph"
```
---
## Recovery Procedures
### Full Disk Recovery (Documented 2026-01-10)
On January 10, 2026, both disks reached 100% capacity. Recovery steps:
1. **Root cause:** NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume
2. **Resolution:**
- Removed NVIDIA packages: `rm -rf .../site-packages/nvidia .../triton`
- Reinstalled PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
- Moved custodian data: `/var/lib/glam/api/data/custodian``/mnt/data/custodian` (symlink)
3. **Result:**
- Root: 100% → 43% used (42GB free)
- Data: 100% → 69% used (15GB free)
### Service Recovery Order
If multiple services are down, restart in this order:
1. `oxigraph` - SPARQL store (no dependencies)
2. `qdrant` (docker) - Vector database
3. `glam-valkey-api` (docker) - Cache
4. `glam-api` - API (depends on Qdrant, Oxigraph)
5. `caddy` - Reverse proxy (depends on backends)
```bash
ssh root@91.98.224.44 "
systemctl restart oxigraph
docker restart qdrant glam-valkey-api
sleep 5
systemctl restart glam-api
sleep 10
systemctl restart caddy
"
```
---
## Monitoring Commands
### Quick Health Check
```bash
./infrastructure/deploy.sh --status
```
### Detailed Status
```bash
# All services
ssh root@91.98.224.44 "
echo '=== Disk ===' && df -h / /mnt/data
echo '=== Memory ===' && free -h
echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
"
```
### Website Verification
```bash
# Check all endpoints
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
--data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'
```
---
## Maintenance Schedule
### Daily (Automated)
- Log rotation via systemd journald
- Certificate renewal via Caddy (automatic)
### Weekly (Manual)
```bash
# Check disk usage
ssh root@91.98.224.44 "df -h"
# Check for failed services
ssh root@91.98.224.44 "systemctl --failed"
```
### Monthly (Manual)
```bash
# System updates
ssh root@91.98.224.44 "apt update && apt upgrade -y"
# Docker image updates
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"
# Clean old logs
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"
```
---
## Related Documentation
- `docs/DEPLOYMENT_GUIDE.md` - Deployment procedures
- `.opencode/DEPLOYMENT_RULES.md` - AI agent deployment rules
- `infrastructure/deploy.sh` - Deployment script source
- `AGENTS.md` - Rule 7 covers deployment guidelines
---
**Last Updated:** 2026-01-10
**Maintainer:** GLAM Project Team