glam/docs/SERVER_OPERATIONS.md
kempersc 54dd4a9803 docs(server): add SERVER_OPERATIONS.md for Hetzner cx32 deployment
Document server disk architecture, PyTorch CPU-only setup, service
management, and recovery procedures learned from disk space crisis.

- Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB)
- PyTorch CPU-only installation via --index-url whl/cpu
- Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/
- Service restart procedures for Oxigraph, GLAM API, Qdrant, etc.
- Emergency recovery commands for disk space crises
2026-01-10 18:42:15 +01:00

12 KiB

GLAM Server Operations Guide

This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.

Server Overview

Property Value
Provider Hetzner Cloud
Server Name glam-sparql
IP Address 91.98.224.44
Instance Type cx32 (4 vCPU, 8GB RAM)
Location nbg1-dc3 (Nuremberg, Germany)
SSH User root

Disk Architecture

The server has two storage volumes with distinct purposes:

Root Volume (/dev/sda1)

Property Value
Mount Point /
Size 75GB
Purpose Operating system, applications, virtual environments

Contents:

  • /var/lib/glam/api/ - GLAM API application and Python venv
  • /var/lib/glam/frontend/ - Frontend static files
  • /var/lib/glam/ducklake/ - DuckLake database files
  • /var/www/ - Web roots for Caddy
  • /usr/ - System binaries
  • /var/log/ - System logs

Data Volume (/dev/sdb)

Property Value
Mount Point /mnt/data
Size 49GB
Purpose Large datasets, Oxigraph triplestore, custodian YAML files

Contents:

  • /mnt/data/oxigraph/ - SPARQL triplestore data
  • /mnt/data/custodian/ - Heritage custodian YAML files (27,459 files, ~17GB)
  • /mnt/data/ontologies/ - Base ontology files
  • /mnt/data/rdf/ - Generated RDF schemas

Directory Layout

The following symlinks redirect data to the data volume:

/var/lib/glam/api/data/custodian → /mnt/data/custodian

Rationale: The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to /mnt/data with a symlink allows the API to find files at the expected path while storing them on the larger volume.

Full Directory Structure

/var/lib/glam/
├── api/                          # GLAM API (24GB total)
│   ├── venv/                     # Python virtual environment (~2GB after optimization)
│   │   └── lib/python3.12/site-packages/
│   │       ├── torch/            # PyTorch CPU-only (~184MB)
│   │       ├── transformers/     # HuggingFace (~115MB)
│   │       ├── sentence_transformers/
│   │       └── ...
│   ├── src/                      # API source code
│   ├── data/
│   │   ├── custodian → /mnt/data/custodian  # SYMLINK
│   │   ├── validation/
│   │   └── sparql_templates.yaml
│   ├── schemas/                  # LinkML schemas for API
│   ├── backend/                  # RAG backend code
│   └── requirements.txt
├── frontend/                     # Frontend build (~200MB)
├── ducklake/                     # DuckLake database (~3.2GB)
├── valkey/                       # Valkey cache config
└── scripts/                      # Deployment scripts

/mnt/data/
├── oxigraph/                     # SPARQL triplestore data
├── custodian/                    # Heritage custodian YAML files (17GB)
│   ├── *.yaml                    # Individual custodian files
│   ├── person/                   # Person entity profiles
│   └── web/                      # Archived web content
├── ontologies/                   # Base ontology files
└── rdf/                          # Generated RDF schemas

Services

Systemd Services

Service Port Description Status Command
oxigraph 7878 (internal) SPARQL triplestore systemctl status oxigraph
glam-api 8000 (internal) FastAPI application systemctl status glam-api
caddy 80, 443 Reverse proxy with TLS systemctl status caddy

Docker Containers

Container Port Description
qdrant 6333-6334 (internal) Vector database
glam-valkey-api 8090 (internal) Valkey cache API
forgejo 3000 (internal), 2222 Git server

Service Endpoints

Domain Backend
bronhouder.nl Static frontend
archief.support Archief Assistent app
sparql.bronhouder.nl Oxigraph SPARQL proxy
api.bronhouder.nl GLAM API (if configured)

PyTorch CPU-Only Configuration

Critical: The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.

Why CPU-Only?

  • The cx32 instance has no GPU
  • CUDA libraries add ~6GB of unnecessary packages
  • CPU inference is sufficient for embedding generation

Installation

If PyTorch needs to be reinstalled:

# SSH to server
ssh root@91.98.224.44

# Activate venv
cd /var/lib/glam/api
source venv/bin/activate

# Uninstall GPU version
pip uninstall torch -y

# Install CPU-only version
pip install torch --index-url https://download.pytorch.org/whl/cpu

Verification

ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
# Should output: PyTorch 2.9.1+cpu, CUDA available: False

Package Sizes (Reference)

Package GPU Version CPU Version
torch ~1.7GB ~184MB
nvidia-* ~4.3GB Not installed
triton ~594MB Not installed

Disk Space Management

Monitoring

# Quick check
ssh root@91.98.224.44 "df -h / /mnt/data"

# Detailed breakdown
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"

Target Usage

Volume Target Warning Critical
Root (/) <70% >80% >90%
Data (/mnt/data) <80% >85% >95%

Cleanup Procedures

1. System Logs

ssh root@91.98.224.44 "
  journalctl --vacuum-size=50M
  rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
  truncate -s 0 /var/log/syslog /var/log/auth.log
"

2. APT Cache

ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"

3. Docker Cleanup

ssh root@91.98.224.44 "docker system prune -af --volumes"

4. Python Cache

ssh root@91.98.224.44 "
  find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
  find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
"

Troubleshooting

Disk Full Recovery

Symptoms:

  • Services crash-looping
  • "No space left on device" errors
  • Deployment fails

Recovery Steps:

  1. Identify largest directories:

    ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20"
    
  2. Clean logs (immediate relief):

    ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog"
    
  3. Check for unnecessary CUDA packages:

    ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia"
    # If >100MB, reinstall PyTorch CPU-only (see above)
    
  4. Move large data to data volume:

    # Example: Move custodian data
    ssh root@91.98.224.44 "
      mv /var/lib/glam/api/data/custodian /mnt/data/custodian
      ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian
    "
    
  5. Restart services:

    ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"
    

GLAM API Crash Loop

Symptoms:

  • systemctl status glam-api shows "activating (auto-restart)"
  • High restart counter in logs

Diagnosis:

ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"

Common Causes:

  1. Port already in use:

    ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api"
    
  2. Missing CUDA libraries (after cleanup):

    # Look for: "libcublas.so not found"
    # Fix: Reinstall PyTorch CPU-only (see above)
    
  3. Out of memory:

    ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5"
    
  4. Missing custodian data symlink:

    ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian"
    # Should be symlink to /mnt/data/custodian
    

Oxigraph Issues

Check status:

ssh root@91.98.224.44 "systemctl status oxigraph"

Check triple count:

curl -s "https://sparql.bronhouder.nl/query" \
  -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"

Restart:

ssh root@91.98.224.44 "systemctl restart oxigraph"

Recovery Procedures

Full Disk Recovery (Documented 2026-01-10)

On January 10, 2026, both disks reached 100% capacity. Recovery steps:

  1. Root cause: NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume

  2. Resolution:

    • Removed NVIDIA packages: rm -rf .../site-packages/nvidia .../triton
    • Reinstalled PyTorch CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu
    • Moved custodian data: /var/lib/glam/api/data/custodian/mnt/data/custodian (symlink)
  3. Result:

    • Root: 100% → 43% used (42GB free)
    • Data: 100% → 69% used (15GB free)

Service Recovery Order

If multiple services are down, restart in this order:

  1. oxigraph - SPARQL store (no dependencies)
  2. qdrant (docker) - Vector database
  3. glam-valkey-api (docker) - Cache
  4. glam-api - API (depends on Qdrant, Oxigraph)
  5. caddy - Reverse proxy (depends on backends)
ssh root@91.98.224.44 "
  systemctl restart oxigraph
  docker restart qdrant glam-valkey-api
  sleep 5
  systemctl restart glam-api
  sleep 10
  systemctl restart caddy
"

Monitoring Commands

Quick Health Check

./infrastructure/deploy.sh --status

Detailed Status

# All services
ssh root@91.98.224.44 "
  echo '=== Disk ===' && df -h / /mnt/data
  echo '=== Memory ===' && free -h
  echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
  echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
  echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
"

Website Verification

# Check all endpoints
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
  --data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'

Maintenance Schedule

Daily (Automated)

  • Log rotation via systemd journald
  • Certificate renewal via Caddy (automatic)

Weekly (Manual)

# Check disk usage
ssh root@91.98.224.44 "df -h"

# Check for failed services
ssh root@91.98.224.44 "systemctl --failed"

Monthly (Manual)

# System updates
ssh root@91.98.224.44 "apt update && apt upgrade -y"

# Docker image updates
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"

# Clean old logs
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"

  • docs/DEPLOYMENT_GUIDE.md - Deployment procedures
  • .opencode/DEPLOYMENT_RULES.md - AI agent deployment rules
  • infrastructure/deploy.sh - Deployment script source
  • AGENTS.md - Rule 7 covers deployment guidelines

Last Updated: 2026-01-10 Maintainer: GLAM Project Team