Document server disk architecture, PyTorch CPU-only setup, service management, and recovery procedures learned from disk space crisis. - Document dual-disk architecture (/: root 75GB, /mnt/data: 49GB) - PyTorch CPU-only installation via --index-url whl/cpu - Custodian data symlink: /mnt/data/custodian → /var/lib/glam/api/data/ - Service restart procedures for Oxigraph, GLAM API, Qdrant, etc. - Emergency recovery commands for disk space crises
443 lines
12 KiB
Markdown
443 lines
12 KiB
Markdown
# GLAM Server Operations Guide
|
|
|
|
This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server.
|
|
|
|
## Server Overview
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Provider** | Hetzner Cloud |
|
|
| **Server Name** | `glam-sparql` |
|
|
| **IP Address** | `91.98.224.44` |
|
|
| **Instance Type** | cx32 (4 vCPU, 8GB RAM) |
|
|
| **Location** | nbg1-dc3 (Nuremberg, Germany) |
|
|
| **SSH User** | `root` |
|
|
|
|
---
|
|
|
|
## Disk Architecture
|
|
|
|
The server has two storage volumes with distinct purposes:
|
|
|
|
### Root Volume (`/dev/sda1`)
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Mount Point** | `/` |
|
|
| **Size** | 75GB |
|
|
| **Purpose** | Operating system, applications, virtual environments |
|
|
|
|
**Contents:**
|
|
- `/var/lib/glam/api/` - GLAM API application and Python venv
|
|
- `/var/lib/glam/frontend/` - Frontend static files
|
|
- `/var/lib/glam/ducklake/` - DuckLake database files
|
|
- `/var/www/` - Web roots for Caddy
|
|
- `/usr/` - System binaries
|
|
- `/var/log/` - System logs
|
|
|
|
### Data Volume (`/dev/sdb`)
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Mount Point** | `/mnt/data` |
|
|
| **Size** | 49GB |
|
|
| **Purpose** | Large datasets, Oxigraph triplestore, custodian YAML files |
|
|
|
|
**Contents:**
|
|
- `/mnt/data/oxigraph/` - SPARQL triplestore data
|
|
- `/mnt/data/custodian/` - Heritage custodian YAML files (27,459 files, ~17GB)
|
|
- `/mnt/data/ontologies/` - Base ontology files
|
|
- `/mnt/data/rdf/` - Generated RDF schemas
|
|
|
|
---
|
|
|
|
## Directory Layout
|
|
|
|
### Symbolic Links (Important!)
|
|
|
|
The following symlinks redirect data to the data volume:
|
|
|
|
```
|
|
/var/lib/glam/api/data/custodian → /mnt/data/custodian
|
|
```
|
|
|
|
**Rationale:** The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to `/mnt/data` with a symlink allows the API to find files at the expected path while storing them on the larger volume.
|
|
|
|
### Full Directory Structure
|
|
|
|
```
|
|
/var/lib/glam/
|
|
├── api/ # GLAM API (24GB total)
|
|
│ ├── venv/ # Python virtual environment (~2GB after optimization)
|
|
│ │ └── lib/python3.12/site-packages/
|
|
│ │ ├── torch/ # PyTorch CPU-only (~184MB)
|
|
│ │ ├── transformers/ # HuggingFace (~115MB)
|
|
│ │ ├── sentence_transformers/
|
|
│ │ └── ...
|
|
│ ├── src/ # API source code
|
|
│ ├── data/
|
|
│ │ ├── custodian → /mnt/data/custodian # SYMLINK
|
|
│ │ ├── validation/
|
|
│ │ └── sparql_templates.yaml
|
|
│ ├── schemas/ # LinkML schemas for API
|
|
│ ├── backend/ # RAG backend code
|
|
│ └── requirements.txt
|
|
├── frontend/ # Frontend build (~200MB)
|
|
├── ducklake/ # DuckLake database (~3.2GB)
|
|
├── valkey/ # Valkey cache config
|
|
└── scripts/ # Deployment scripts
|
|
|
|
/mnt/data/
|
|
├── oxigraph/ # SPARQL triplestore data
|
|
├── custodian/ # Heritage custodian YAML files (17GB)
|
|
│ ├── *.yaml # Individual custodian files
|
|
│ ├── person/ # Person entity profiles
|
|
│ └── web/ # Archived web content
|
|
├── ontologies/ # Base ontology files
|
|
└── rdf/ # Generated RDF schemas
|
|
```
|
|
|
|
---
|
|
|
|
## Services
|
|
|
|
### Systemd Services
|
|
|
|
| Service | Port | Description | Status Command |
|
|
|---------|------|-------------|----------------|
|
|
| `oxigraph` | 7878 (internal) | SPARQL triplestore | `systemctl status oxigraph` |
|
|
| `glam-api` | 8000 (internal) | FastAPI application | `systemctl status glam-api` |
|
|
| `caddy` | 80, 443 | Reverse proxy with TLS | `systemctl status caddy` |
|
|
|
|
### Docker Containers
|
|
|
|
| Container | Port | Description |
|
|
|-----------|------|-------------|
|
|
| `qdrant` | 6333-6334 (internal) | Vector database |
|
|
| `glam-valkey-api` | 8090 (internal) | Valkey cache API |
|
|
| `forgejo` | 3000 (internal), 2222 | Git server |
|
|
|
|
### Service Endpoints
|
|
|
|
| Domain | Backend |
|
|
|--------|---------|
|
|
| `bronhouder.nl` | Static frontend |
|
|
| `archief.support` | Archief Assistent app |
|
|
| `sparql.bronhouder.nl` | Oxigraph SPARQL proxy |
|
|
| `api.bronhouder.nl` | GLAM API (if configured) |
|
|
|
|
---
|
|
|
|
## PyTorch CPU-Only Configuration
|
|
|
|
**Critical:** The server does NOT have a GPU. PyTorch must be installed in CPU-only mode.
|
|
|
|
### Why CPU-Only?
|
|
|
|
- The cx32 instance has no GPU
|
|
- CUDA libraries add ~6GB of unnecessary packages
|
|
- CPU inference is sufficient for embedding generation
|
|
|
|
### Installation
|
|
|
|
If PyTorch needs to be reinstalled:
|
|
|
|
```bash
|
|
# SSH to server
|
|
ssh root@91.98.224.44
|
|
|
|
# Activate venv
|
|
cd /var/lib/glam/api
|
|
source venv/bin/activate
|
|
|
|
# Uninstall GPU version
|
|
pip uninstall torch -y
|
|
|
|
# Install CPU-only version
|
|
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
```
|
|
|
|
### Verification
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'"
|
|
# Should output: PyTorch 2.9.1+cpu, CUDA available: False
|
|
```
|
|
|
|
### Package Sizes (Reference)
|
|
|
|
| Package | GPU Version | CPU Version |
|
|
|---------|-------------|-------------|
|
|
| `torch` | ~1.7GB | ~184MB |
|
|
| `nvidia-*` | ~4.3GB | Not installed |
|
|
| `triton` | ~594MB | Not installed |
|
|
|
|
---
|
|
|
|
## Disk Space Management
|
|
|
|
### Monitoring
|
|
|
|
```bash
|
|
# Quick check
|
|
ssh root@91.98.224.44 "df -h / /mnt/data"
|
|
|
|
# Detailed breakdown
|
|
ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh"
|
|
```
|
|
|
|
### Target Usage
|
|
|
|
| Volume | Target | Warning | Critical |
|
|
|--------|--------|---------|----------|
|
|
| Root (`/`) | <70% | >80% | >90% |
|
|
| Data (`/mnt/data`) | <80% | >85% | >95% |
|
|
|
|
### Cleanup Procedures
|
|
|
|
#### 1. System Logs
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "
|
|
journalctl --vacuum-size=50M
|
|
rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old
|
|
truncate -s 0 /var/log/syslog /var/log/auth.log
|
|
"
|
|
```
|
|
|
|
#### 2. APT Cache
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y"
|
|
```
|
|
|
|
#### 3. Docker Cleanup
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "docker system prune -af --volumes"
|
|
```
|
|
|
|
#### 4. Python Cache
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "
|
|
find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null
|
|
find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Disk Full Recovery
|
|
|
|
**Symptoms:**
|
|
- Services crash-looping
|
|
- "No space left on device" errors
|
|
- Deployment fails
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Identify largest directories:**
|
|
```bash
|
|
ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20"
|
|
```
|
|
|
|
2. **Clean logs (immediate relief):**
|
|
```bash
|
|
ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog"
|
|
```
|
|
|
|
3. **Check for unnecessary CUDA packages:**
|
|
```bash
|
|
ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia"
|
|
# If >100MB, reinstall PyTorch CPU-only (see above)
|
|
```
|
|
|
|
4. **Move large data to data volume:**
|
|
```bash
|
|
# Example: Move custodian data
|
|
ssh root@91.98.224.44 "
|
|
mv /var/lib/glam/api/data/custodian /mnt/data/custodian
|
|
ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian
|
|
"
|
|
```
|
|
|
|
5. **Restart services:**
|
|
```bash
|
|
ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy"
|
|
```
|
|
|
|
### GLAM API Crash Loop
|
|
|
|
**Symptoms:**
|
|
- `systemctl status glam-api` shows "activating (auto-restart)"
|
|
- High restart counter in logs
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager"
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Port already in use:**
|
|
```bash
|
|
ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api"
|
|
```
|
|
|
|
2. **Missing CUDA libraries (after cleanup):**
|
|
```bash
|
|
# Look for: "libcublas.so not found"
|
|
# Fix: Reinstall PyTorch CPU-only (see above)
|
|
```
|
|
|
|
3. **Out of memory:**
|
|
```bash
|
|
ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5"
|
|
```
|
|
|
|
4. **Missing custodian data symlink:**
|
|
```bash
|
|
ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian"
|
|
# Should be symlink to /mnt/data/custodian
|
|
```
|
|
|
|
### Oxigraph Issues
|
|
|
|
**Check status:**
|
|
```bash
|
|
ssh root@91.98.224.44 "systemctl status oxigraph"
|
|
```
|
|
|
|
**Check triple count:**
|
|
```bash
|
|
curl -s "https://sparql.bronhouder.nl/query" \
|
|
-H "Accept: application/sparql-results+json" \
|
|
--data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }"
|
|
```
|
|
|
|
**Restart:**
|
|
```bash
|
|
ssh root@91.98.224.44 "systemctl restart oxigraph"
|
|
```
|
|
|
|
---
|
|
|
|
## Recovery Procedures
|
|
|
|
### Full Disk Recovery (Documented 2026-01-10)
|
|
|
|
On January 10, 2026, both disks reached 100% capacity. Recovery steps:
|
|
|
|
1. **Root cause:** NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume
|
|
|
|
2. **Resolution:**
|
|
- Removed NVIDIA packages: `rm -rf .../site-packages/nvidia .../triton`
|
|
- Reinstalled PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
|
|
- Moved custodian data: `/var/lib/glam/api/data/custodian` → `/mnt/data/custodian` (symlink)
|
|
|
|
3. **Result:**
|
|
- Root: 100% → 43% used (42GB free)
|
|
- Data: 100% → 69% used (15GB free)
|
|
|
|
### Service Recovery Order
|
|
|
|
If multiple services are down, restart in this order:
|
|
|
|
1. `oxigraph` - SPARQL store (no dependencies)
|
|
2. `qdrant` (docker) - Vector database
|
|
3. `glam-valkey-api` (docker) - Cache
|
|
4. `glam-api` - API (depends on Qdrant, Oxigraph)
|
|
5. `caddy` - Reverse proxy (depends on backends)
|
|
|
|
```bash
|
|
ssh root@91.98.224.44 "
|
|
systemctl restart oxigraph
|
|
docker restart qdrant glam-valkey-api
|
|
sleep 5
|
|
systemctl restart glam-api
|
|
sleep 10
|
|
systemctl restart caddy
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Commands
|
|
|
|
### Quick Health Check
|
|
|
|
```bash
|
|
./infrastructure/deploy.sh --status
|
|
```
|
|
|
|
### Detailed Status
|
|
|
|
```bash
|
|
# All services
|
|
ssh root@91.98.224.44 "
|
|
echo '=== Disk ===' && df -h / /mnt/data
|
|
echo '=== Memory ===' && free -h
|
|
echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)'
|
|
echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}'
|
|
echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value'
|
|
"
|
|
```
|
|
|
|
### Website Verification
|
|
|
|
```bash
|
|
# Check all endpoints
|
|
curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n"
|
|
curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n"
|
|
curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \
|
|
--data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"'
|
|
```
|
|
|
|
---
|
|
|
|
## Maintenance Schedule
|
|
|
|
### Daily (Automated)
|
|
|
|
- Log rotation via systemd journald
|
|
- Certificate renewal via Caddy (automatic)
|
|
|
|
### Weekly (Manual)
|
|
|
|
```bash
|
|
# Check disk usage
|
|
ssh root@91.98.224.44 "df -h"
|
|
|
|
# Check for failed services
|
|
ssh root@91.98.224.44 "systemctl --failed"
|
|
```
|
|
|
|
### Monthly (Manual)
|
|
|
|
```bash
|
|
# System updates
|
|
ssh root@91.98.224.44 "apt update && apt upgrade -y"
|
|
|
|
# Docker image updates
|
|
ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant"
|
|
|
|
# Clean old logs
|
|
ssh root@91.98.224.44 "journalctl --vacuum-time=30d"
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `docs/DEPLOYMENT_GUIDE.md` - Deployment procedures
|
|
- `.opencode/DEPLOYMENT_RULES.md` - AI agent deployment rules
|
|
- `infrastructure/deploy.sh` - Deployment script source
|
|
- `AGENTS.md` - Rule 7 covers deployment guidelines
|
|
|
|
---
|
|
|
|
**Last Updated:** 2026-01-10
|
|
**Maintainer:** GLAM Project Team
|