Document server disk architecture, PyTorch CPU-only setup, service management, and recovery procedures learned from disk space crisis. - Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB) - PyTorch CPU-only installation via --index-url whl/cpu - Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/ - Service restart procedures for Oxigraph, GLAM API, Qdrant, etc. - Emergency recovery commands for disk space crises
12 KiB
GLAM Server Operations Guide
This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.
Server Overview
| Property | Value |
|---|---|
| Provider | Hetzner Cloud |
| Server Name | glam-sparql |
| IP Address | 91.98.224.44 |
| Instance Type | cx32 (4 vCPU, 8GB RAM) |
| Location | nbg1-dc3 (Nuremberg, Germany) |
| SSH User | root |
Disk Architecture
The server has two storage volumes with distinct purposes:
Root Volume (/dev/sda1)
| Property | Value |
|---|---|
| Mount Point | / |
| Size | 75GB |
| Purpose | Operating system, applications, virtual environments |
Contents:
/var/lib/glam/api/- GLAM API application and Python venv/var/lib/glam/frontend/- Frontend static files/var/lib/glam/ducklake/- DuckLake database files/var/www/- Web roots for Caddy/usr/- System binaries/var/log/- System logs
Data Volume (/dev/sdb)
| Property | Value |
|---|---|
| Mount Point | /mnt/data |
| Size | 49GB |
| Purpose | Large datasets, Oxigraph triplestore, custodian YAML files |
Contents:
/mnt/data/oxigraph/- SPARQL triplestore data/mnt/data/custodian/- Heritage custodian YAML files (27,459 files, ~17GB)/mnt/data/ontologies/- Base ontology files/mnt/data/rdf/- Generated RDF schemas
Directory Layout
Symbolic Links (Important!)
The following symlinks redirect data to the data volume:
/var/lib/glam/api/data/custodian → /mnt/data/custodian
Rationale: The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to /mnt/data with a symlink allows the API to find files at the expected path while storing them on the larger volume.
Full Directory Structure
/var/lib/glam/
├── api/ # GLAM API (24GB total)
│ ├── venv/ # Python virtual environment (~2GB after optimization)
│ │ └── lib/python3.12/site-packages/
│ │ ├── torch/ # PyTorch CPU-only (~184MB)
│ │ ├── transformers/ # HuggingFace (~115MB)
│ │ ├── sentence_transformers/
│ │ └── ...
│ ├── src/ # API source code
│ ├── data/
│ │ ├── custodian → /mnt/data/custodian # SYMLINK
│ │ ├── validation/
│ │ └── sparql_templates.yaml
│ ├── schemas/ # LinkML schemas for API
│ ├── backend/ # RAG backend code
│ └── requirements.txt
├── frontend/ # Frontend build (~200MB)
├── ducklake/ # DuckLake database (~3.2GB)
├── valkey/ # Valkey cache config
└── scripts/ # Deployment scripts
/mnt/data/
├── oxigraph/ # SPARQL triplestore data
├── custodian/ # Heritage custodian YAML files (17GB)
│ ├── *.yaml # Individual custodian files
│ ├── person/ # Person entity profiles
│ └── web/ # Archived web content
├── ontologies/ # Base ontology files
└── rdf/ # Generated RDF schemas
Services
Systemd Services
| Service | Port | Description | Status Command |
|---|---|---|---|
oxigraph |
7878 (internal) | SPARQL triplestore | systemctl status oxigraph |
glam-api |
8000 (internal) | FastAPI application | systemctl status glam-api |
caddy |
80, 443 | Reverse proxy with TLS | systemctl status caddy |
Docker Containers
| Container | Port | Description |
|---|---|---|
qdrant |
6333-6334 (internal) | Vector database |
glam-valkey-api |
8090 (internal) | Valkey cache API |
forgejo |
3000 (internal), 2222 | Git server |
Service Endpoints
| Domain | Backend |
|---|---|
bronhouder.nl |
Static frontend |
archief.support |
Archief Assistent app |
sparql.bronhouder.nl |
Oxigraph SPARQL proxy |
api.bronhouder.nl |
GLAM API (if configured) |
PyTorch CPU-Only Configuration
Critical: The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.
Why CPU-Only?
- The cx32 instance has no GPU
- CUDA libraries add ~6GB of unnecessary packages
- CPU inference is sufficient for embedding generation
Installation
If PyTorch needs to be reinstalled:
# SSH to server
ssh root@91.98.224.44
# Activate venv
cd /var/lib/glam/api
source venv/bin/activate
# Uninstall GPU version
pip uninstall torch -y
# Install CPU-only version
pip install torch --index-url https://download.pytorch.org/whl/cpu
Verification
ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
# Should output: PyTorch 2.9.1+cpu, CUDA available: False
Package Sizes (Reference)
| Package | GPU Version | CPU Version |
|---|---|---|
torch |
~1.7GB | ~184MB |
nvidia-* |
~4.3GB | Not installed |
triton |
~594MB | Not installed |
Disk Space Management
Monitoring
# Quick check
ssh root@91.98.224.44 "df -h / /mnt/data"
# Detailed breakdown
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"
Target Usage
| Volume | Target | Warning | Critical |
|---|---|---|---|
Root (/) |
<70% | >80% | >90% |
Data (/mnt/data) |
<80% | >85% | >95% |
Cleanup Procedures
1. System Logs
ssh root@91.98.224.44 "
journalctl --vacuum-size=50M
rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
truncate -s 0 /var/log/syslog /var/log/auth.log
"
2. APT Cache
ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"
3. Docker Cleanup
ssh root@91.98.224.44 "docker system prune -af --volumes"
4. Python Cache
ssh root@91.98.224.44 "
find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
"
Troubleshooting
Disk Full Recovery
Symptoms:
- Services crash-looping
- "No space left on device" errors
- Deployment fails
Recovery Steps:
-
Identify largest directories:
ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20" -
Clean logs (immediate relief):
ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog" -
Check for unnecessary CUDA packages:
ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia" # If >100MB, reinstall PyTorch CPU-only (see above) -
Move large data to data volume:
# Example: Move custodian data ssh root@91.98.224.44 " mv /var/lib/glam/api/data/custodian /mnt/data/custodian ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian " -
Restart services:
ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"
GLAM API Crash Loop
Symptoms:
systemctl status glam-apishows "activating (auto-restart)"- High restart counter in logs
Diagnosis:
ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"
Common Causes:
-
Port already in use:
ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api" -
Missing CUDA libraries (after cleanup):
# Look for: "libcublas.so not found" # Fix: Reinstall PyTorch CPU-only (see above) -
Out of memory:
ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5" -
Missing custodian data symlink:
ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian" # Should be symlink to /mnt/data/custodian
Oxigraph Issues
Check status:
ssh root@91.98.224.44 "systemctl status oxigraph"
Check triple count:
curl -s "https://sparql.bronhouder.nl/query" \
-H "Accept: application/sparql-results+json" \
--data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"
Restart:
ssh root@91.98.224.44 "systemctl restart oxigraph"
Recovery Procedures
Full Disk Recovery (Documented 2026-01-10)
On January 10, 2026, both disks reached 100% capacity. Recovery steps:
-
Root cause: NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume
-
Resolution:
- Removed NVIDIA packages:
rm -rf .../site-packages/nvidia .../triton - Reinstalled PyTorch CPU-only:
pip install torch --index-url https://download.pytorch.org/whl/cpu - Moved custodian data:
/var/lib/glam/api/data/custodian→/mnt/data/custodian(symlink)
- Removed NVIDIA packages:
-
Result:
- Root: 100% → 43% used (42GB free)
- Data: 100% → 69% used (15GB free)
Service Recovery Order
If multiple services are down, restart in this order:
oxigraph- SPARQL store (no dependencies)qdrant(docker) - Vector databaseglam-valkey-api(docker) - Cacheglam-api- API (depends on Qdrant, Oxigraph)caddy- Reverse proxy (depends on backends)
ssh root@91.98.224.44 "
systemctl restart oxigraph
docker restart qdrant glam-valkey-api
sleep 5
systemctl restart glam-api
sleep 10
systemctl restart caddy
"
Monitoring Commands
Quick Health Check
./infrastructure/deploy.sh --status
Detailed Status
# All services
ssh root@91.98.224.44 "
echo '=== Disk ===' && df -h / /mnt/data
echo '=== Memory ===' && free -h
echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
"
Website Verification
# Check all endpoints
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
--data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'
Maintenance Schedule
Daily (Automated)
- Log rotation via systemd journald
- Certificate renewal via Caddy (automatic)
Weekly (Manual)
# Check disk usage
ssh root@91.98.224.44 "df -h"
# Check for failed services
ssh root@91.98.224.44 "systemctl --failed"
Monthly (Manual)
# System updates
ssh root@91.98.224.44 "apt update && apt upgrade -y"
# Docker image updates
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"
# Clean old logs
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"
Related Documentation
docs/DEPLOYMENT_GUIDE.md- Deployment procedures.opencode/DEPLOYMENT_RULES.md- AI agent deployment rulesinfrastructure/deploy.sh- Deployment script sourceAGENTS.md- Rule 7 covers deployment guidelines
Last Updated: 2026-01-10 Maintainer: GLAM Project Team