kempersc 54dd4a9803 docs(server): add SERVER_OPERATIONS.md for Hetzner cx32 deployment

Document server disk architecture, PyTorch CPU-only setup, service
management, and recovery procedures learned from disk space crisis.

- Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB)
- PyTorch CPU-only installation via --index-url whl/cpu
- Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/
- Service restart procedures for Oxigraph, GLAM API, Qdrant, etc.
- Emergency recovery commands for disk space crises

2026-01-10 18:42:15 +01:00

12 KiB

Raw Blame History

GLAM Server Operations Guide

This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.

Server Overview

Property	Value
Provider	Hetzner Cloud
Server Name	`glam-sparql`
IP Address	`91.98.224.44`
Instance Type	cx32 (4 vCPU, 8GB RAM)
Location	nbg1-dc3 (Nuremberg, Germany)
SSH User	`root`

Disk Architecture

The server has two storage volumes with distinct purposes:

Root Volume (`/dev/sda1`)

Property	Value
Mount Point	`/`
Size	75GB
Purpose	Operating system, applications, virtual environments

Contents:

/var/lib/glam/api/ - GLAM API application and Python venv
/var/lib/glam/frontend/ - Frontend static files
/var/lib/glam/ducklake/ - DuckLake database files
/var/www/ - Web roots for Caddy
/usr/ - System binaries
/var/log/ - System logs

Data Volume (`/dev/sdb`)

Property	Value
Mount Point	`/mnt/data`
Size	49GB
Purpose	Large datasets, Oxigraph triplestore, custodian YAML files

Contents:

/mnt/data/oxigraph/ - SPARQL triplestore data
/mnt/data/custodian/ - Heritage custodian YAML files (27,459 files, ~17GB)
/mnt/data/ontologies/ - Base ontology files
/mnt/data/rdf/ - Generated RDF schemas

Directory Layout

Symbolic Links (Important!)

The following symlinks redirect data to the data volume:

/var/lib/glam/api/data/custodian → /mnt/data/custodian

Rationale: The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to /mnt/data with a symlink allows the API to find files at the expected path while storing them on the larger volume.

Full Directory Structure

/var/lib/glam/
├── api/                          # GLAM API (24GB total)
│   ├── venv/                     # Python virtual environment (~2GB after optimization)
│   │   └── lib/python3.12/site-packages/
│   │       ├── torch/            # PyTorch CPU-only (~184MB)
│   │       ├── transformers/     # HuggingFace (~115MB)
│   │       ├── sentence_transformers/
│   │       └── ...
│   ├── src/                      # API source code
│   ├── data/
│   │   ├── custodian → /mnt/data/custodian  # SYMLINK
│   │   ├── validation/
│   │   └── sparql_templates.yaml
│   ├── schemas/                  # LinkML schemas for API
│   ├── backend/                  # RAG backend code
│   └── requirements.txt
├── frontend/                     # Frontend build (~200MB)
├── ducklake/                     # DuckLake database (~3.2GB)
├── valkey/                       # Valkey cache config
└── scripts/                      # Deployment scripts

/mnt/data/
├── oxigraph/                     # SPARQL triplestore data
├── custodian/                    # Heritage custodian YAML files (17GB)
│   ├── *.yaml                    # Individual custodian files
│   ├── person/                   # Person entity profiles
│   └── web/                      # Archived web content
├── ontologies/                   # Base ontology files
└── rdf/                          # Generated RDF schemas

Services

Systemd Services

Service	Port	Description	Status Command
`oxigraph`	7878 (internal)	SPARQL triplestore	`systemctl status oxigraph`
`glam-api`	8000 (internal)	FastAPI application	`systemctl status glam-api`
`caddy`	80, 443	Reverse proxy with TLS	`systemctl status caddy`

Docker Containers

Container	Port	Description
`qdrant`	6333-6334 (internal)	Vector database
`glam-valkey-api`	8090 (internal)	Valkey cache API
`forgejo`	3000 (internal), 2222	Git server

Service Endpoints

Domain	Backend
`bronhouder.nl`	Static frontend
`archief.support`	Archief Assistent app
`sparql.bronhouder.nl`	Oxigraph SPARQL proxy
`api.bronhouder.nl`	GLAM API (if configured)

PyTorch CPU-Only Configuration

Critical: The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.

Why CPU-Only?

The cx32 instance has no GPU
CUDA libraries add ~6GB of unnecessary packages
CPU inference is sufficient for embedding generation

Installation

If PyTorch needs to be reinstalled:

# SSH to server
ssh root@91.98.224.44

# Activate venv
cd /var/lib/glam/api
source venv/bin/activate

# Uninstall GPU version
pip uninstall torch -y

# Install CPU-only version
pip install torch --index-url https://download.pytorch.org/whl/cpu

Verification

ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
# Should output: PyTorch 2.9.1+cpu, CUDA available: False

Package Sizes (Reference)

Package	GPU Version	CPU Version
`torch`	~1.7GB	~184MB
`nvidia-*`	~4.3GB	Not installed
`triton`	~594MB	Not installed

Disk Space Management

Monitoring

# Quick check
ssh root@91.98.224.44 "df -h / /mnt/data"

# Detailed breakdown
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"

Target Usage

Volume	Target	Warning	Critical
Root (`/`)	<70%	>80%	>90%
Data (`/mnt/data`)	<80%	>85%	>95%

Cleanup Procedures

1. System Logs

ssh root@91.98.224.44 "
  journalctl --vacuum-size=50M
  rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
  truncate -s 0 /var/log/syslog /var/log/auth.log
"

2. APT Cache

ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"

3. Docker Cleanup

ssh root@91.98.224.44 "docker system prune -af --volumes"

4. Python Cache

ssh root@91.98.224.44 "
  find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
  find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
"

Troubleshooting

Disk Full Recovery

Symptoms:

Services crash-looping
"No space left on device" errors
Deployment fails

Recovery Steps:

Identify largest directories:

ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20"

Clean logs (immediate relief):

ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog"

Check for unnecessary CUDA packages:

ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia"
# If >100MB, reinstall PyTorch CPU-only (see above)

Move large data to data volume:

# Example: Move custodian data
ssh root@91.98.224.44 "
  mv /var/lib/glam/api/data/custodian /mnt/data/custodian
  ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian
"

Restart services:

ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"

GLAM API Crash Loop

Symptoms:

systemctl status glam-api shows "activating (auto-restart)"
High restart counter in logs

Diagnosis:

ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"

Common Causes:

Port already in use:

ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api"

Missing CUDA libraries (after cleanup):

# Look for: "libcublas.so not found"
# Fix: Reinstall PyTorch CPU-only (see above)

Out of memory:

ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5"

Missing custodian data symlink:

ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian"
# Should be symlink to /mnt/data/custodian

Oxigraph Issues

Check status:

ssh root@91.98.224.44 "systemctl status oxigraph"

Check triple count:

curl -s "https://sparql.bronhouder.nl/query" \
  -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"

Restart:

ssh root@91.98.224.44 "systemctl restart oxigraph"

Recovery Procedures

Full Disk Recovery (Documented 2026-01-10)

On January 10, 2026, both disks reached 100% capacity. Recovery steps:

Root cause: NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume
Resolution:
- Removed NVIDIA packages: rm -rf .../site-packages/nvidia .../triton
- Reinstalled PyTorch CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu
- Moved custodian data: /var/lib/glam/api/data/custodian → /mnt/data/custodian (symlink)
Result:
- Root: 100% → 43% used (42GB free)
- Data: 100% → 69% used (15GB free)

Service Recovery Order

If multiple services are down, restart in this order:

oxigraph - SPARQL store (no dependencies)
qdrant (docker) - Vector database
glam-valkey-api (docker) - Cache
glam-api - API (depends on Qdrant, Oxigraph)
caddy - Reverse proxy (depends on backends)

ssh root@91.98.224.44 "
  systemctl restart oxigraph
  docker restart qdrant glam-valkey-api
  sleep 5
  systemctl restart glam-api
  sleep 10
  systemctl restart caddy
"

Monitoring Commands

Quick Health Check

./infrastructure/deploy.sh --status

Detailed Status

# All services
ssh root@91.98.224.44 "
  echo '=== Disk ===' && df -h / /mnt/data
  echo '=== Memory ===' && free -h
  echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
  echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
  echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
"

Website Verification

# Check all endpoints
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'

Maintenance Schedule

Daily (Automated)

Log rotation via systemd journald
Certificate renewal via Caddy (automatic)

Weekly (Manual)

# Check disk usage
ssh root@91.98.224.44 "df -h"

# Check for failed services
ssh root@91.98.224.44 "systemctl --failed"

Monthly (Manual)

# System updates
ssh root@91.98.224.44 "apt update && apt upgrade -y"

# Docker image updates
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"

# Clean old logs
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"

docs/DEPLOYMENT_GUIDE.md - Deployment procedures
.opencode/DEPLOYMENT_RULES.md - AI agent deployment rules
infrastructure/deploy.sh - Deployment script source
AGENTS.md - Rule 7 covers deployment guidelines

Last Updated: 2026-01-10 Maintainer: GLAM Project Team

12 KiB Raw Blame History