diff --git a/docs/SERVER_OPERATIONS.md b/docs/SERVER_OPERATIONS.md new file mode 100644 index 0000000000..ccf3a79192 --- /dev/null +++ b/docs/SERVER_OPERATIONS.md @@ -0,0 +1,443 @@ +# GLAM Server Operations Guide + +This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server. + +## Server Overview + +| Property | Value | +|----------|-------| +| **Provider** | Hetzner Cloud | +| **Server Name** | `glam-sparql` | +| **IP Address** | `91.98.224.44` | +| **Instance Type** | cx32 (4 vCPU, 8GB RAM) | +| **Location** | nbg1-dc3 (Nuremberg, Germany) | +| **SSH User** | `root` | + +--- + +## Disk Architecture + +The server has two storage volumes with distinct purposes: + +### Root Volume (`/dev/sda1`) + +| Property | Value | +|----------|-------| +| **Mount Point** | `/` | +| **Size** | 75GB | +| **Purpose** | Operating system, applications, virtual environments | + +**Contents:** +- `/var/lib/glam/api/` - GLAM API application and Python venv +- `/var/lib/glam/frontend/` - Frontend static files +- `/var/lib/glam/ducklake/` - DuckLake database files +- `/var/www/` - Web roots for Caddy +- `/usr/` - System binaries +- `/var/log/` - System logs + +### Data Volume (`/dev/sdb`) + +| Property | Value | +|----------|-------| +| **Mount Point** | `/mnt/data` | +| **Size** | 49GB | +| **Purpose** | Large datasets, Oxigraph triplestore, custodian YAML files | + +**Contents:** +- `/mnt/data/oxigraph/` - SPARQL triplestore data +- `/mnt/data/custodian/` - Heritage custodian YAML files (27,459 files, ~17GB) +- `/mnt/data/ontologies/` - Base ontology files +- `/mnt/data/rdf/` - Generated RDF schemas + +--- + +## Directory Layout + +### Symbolic Links (Important!) + +The following symlinks redirect data to the data volume: + +``` +/var/lib/glam/api/data/custodian → /mnt/data/custodian +``` + +**Rationale:** The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to `/mnt/data` with a symlink allows the API to find files at the expected path while storing them on the larger volume. + +### Full Directory Structure + +``` +/var/lib/glam/ +├── api/ # GLAM API (24GB total) +│ ├── venv/ # Python virtual environment (~2GB after optimization) +│ │ └── lib/python3.12/site-packages/ +│ │ ├── torch/ # PyTorch CPU-only (~184MB) +│ │ ├── transformers/ # HuggingFace (~115MB) +│ │ ├── sentence_transformers/ +│ │ └── ... +│ ├── src/ # API source code +│ ├── data/ +│ │ ├── custodian → /mnt/data/custodian # SYMLINK +│ │ ├── validation/ +│ │ └── sparql_templates.yaml +│ ├── schemas/ # LinkML schemas for API +│ ├── backend/ # RAG backend code +│ └── requirements.txt +├── frontend/ # Frontend build (~200MB) +├── ducklake/ # DuckLake database (~3.2GB) +├── valkey/ # Valkey cache config +└── scripts/ # Deployment scripts + +/mnt/data/ +├── oxigraph/ # SPARQL triplestore data +├── custodian/ # Heritage custodian YAML files (17GB) +│ ├── *.yaml # Individual custodian files +│ ├── person/ # Person entity profiles +│ └── web/ # Archived web content +├── ontologies/ # Base ontology files +└── rdf/ # Generated RDF schemas +``` + +--- + +## Services + +### Systemd Services + +| Service | Port | Description | Status Command | +|---------|------|-------------|----------------| +| `oxigraph` | 7878 (internal) | SPARQL triplestore | `systemctl status oxigraph` | +| `glam-api` | 8000 (internal) | FastAPI application | `systemctl status glam-api` | +| `caddy` | 80, 443 | Reverse proxy with TLS | `systemctl status caddy` | + +### Docker Containers + +| Container | Port | Description | +|-----------|------|-------------| +| `qdrant` | 6333-6334 (internal) | Vector database | +| `glam-valkey-api` | 8090 (internal) | Valkey cache API | +| `forgejo` | 3000 (internal), 2222 | Git server | + +### Service Endpoints + +| Domain | Backend | +|--------|---------| +| `bronhouder.nl` | Static frontend | +| `archief.support` | Archief Assistent app | +| `sparql.bronhouder.nl` | Oxigraph SPARQL proxy | +| `api.bronhouder.nl` | GLAM API (if configured) | + +--- + +## PyTorch CPU-Only Configuration + +**Critical:** The server does NOT have a GPU. PyTorch must be installed in CPU-only mode. + +### Why CPU-Only? + +- The cx32 instance has no GPU +- CUDA libraries add ~6GB of unnecessary packages +- CPU inference is sufficient for embedding generation + +### Installation + +If PyTorch needs to be reinstalled: + +```bash +# SSH to server +ssh root@91.98.224.44 + +# Activate venv +cd /var/lib/glam/api +source venv/bin/activate + +# Uninstall GPU version +pip uninstall torch -y + +# Install CPU-only version +pip install torch --index-url https://download.pytorch.org/whl/cpu +``` + +### Verification + +```bash +ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'" +# Should output: PyTorch 2.9.1+cpu, CUDA available: False +``` + +### Package Sizes (Reference) + +| Package | GPU Version | CPU Version | +|---------|-------------|-------------| +| `torch` | ~1.7GB | ~184MB | +| `nvidia-*` | ~4.3GB | Not installed | +| `triton` | ~594MB | Not installed | + +--- + +## Disk Space Management + +### Monitoring + +```bash +# Quick check +ssh root@91.98.224.44 "df -h / /mnt/data" + +# Detailed breakdown +ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh" +``` + +### Target Usage + +| Volume | Target | Warning | Critical | +|--------|--------|---------|----------| +| Root (`/`) | <70% | >80% | >90% | +| Data (`/mnt/data`) | <80% | >85% | >95% | + +### Cleanup Procedures + +#### 1. System Logs + +```bash +ssh root@91.98.224.44 " + journalctl --vacuum-size=50M + rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old + truncate -s 0 /var/log/syslog /var/log/auth.log +" +``` + +#### 2. APT Cache + +```bash +ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y" +``` + +#### 3. Docker Cleanup + +```bash +ssh root@91.98.224.44 "docker system prune -af --volumes" +``` + +#### 4. Python Cache + +```bash +ssh root@91.98.224.44 " + find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null + find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null +" +``` + +--- + +## Troubleshooting + +### Disk Full Recovery + +**Symptoms:** +- Services crash-looping +- "No space left on device" errors +- Deployment fails + +**Recovery Steps:** + +1. **Identify largest directories:** + ```bash + ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20" + ``` + +2. **Clean logs (immediate relief):** + ```bash + ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog" + ``` + +3. **Check for unnecessary CUDA packages:** + ```bash + ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia" + # If >100MB, reinstall PyTorch CPU-only (see above) + ``` + +4. **Move large data to data volume:** + ```bash + # Example: Move custodian data + ssh root@91.98.224.44 " + mv /var/lib/glam/api/data/custodian /mnt/data/custodian + ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian + " + ``` + +5. **Restart services:** + ```bash + ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy" + ``` + +### GLAM API Crash Loop + +**Symptoms:** +- `systemctl status glam-api` shows "activating (auto-restart)" +- High restart counter in logs + +**Diagnosis:** +```bash +ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager" +``` + +**Common Causes:** + +1. **Port already in use:** + ```bash + ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api" + ``` + +2. **Missing CUDA libraries (after cleanup):** + ```bash + # Look for: "libcublas.so not found" + # Fix: Reinstall PyTorch CPU-only (see above) + ``` + +3. **Out of memory:** + ```bash + ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5" + ``` + +4. **Missing custodian data symlink:** + ```bash + ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian" + # Should be symlink to /mnt/data/custodian + ``` + +### Oxigraph Issues + +**Check status:** +```bash +ssh root@91.98.224.44 "systemctl status oxigraph" +``` + +**Check triple count:** +```bash +curl -s "https://sparql.bronhouder.nl/query" \ + -H "Accept: application/sparql-results+json" \ + --data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }" +``` + +**Restart:** +```bash +ssh root@91.98.224.44 "systemctl restart oxigraph" +``` + +--- + +## Recovery Procedures + +### Full Disk Recovery (Documented 2026-01-10) + +On January 10, 2026, both disks reached 100% capacity. Recovery steps: + +1. **Root cause:** NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume + +2. **Resolution:** + - Removed NVIDIA packages: `rm -rf .../site-packages/nvidia .../triton` + - Reinstalled PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu` + - Moved custodian data: `/var/lib/glam/api/data/custodian` → `/mnt/data/custodian` (symlink) + +3. **Result:** + - Root: 100% → 43% used (42GB free) + - Data: 100% → 69% used (15GB free) + +### Service Recovery Order + +If multiple services are down, restart in this order: + +1. `oxigraph` - SPARQL store (no dependencies) +2. `qdrant` (docker) - Vector database +3. `glam-valkey-api` (docker) - Cache +4. `glam-api` - API (depends on Qdrant, Oxigraph) +5. `caddy` - Reverse proxy (depends on backends) + +```bash +ssh root@91.98.224.44 " + systemctl restart oxigraph + docker restart qdrant glam-valkey-api + sleep 5 + systemctl restart glam-api + sleep 10 + systemctl restart caddy +" +``` + +--- + +## Monitoring Commands + +### Quick Health Check + +```bash +./infrastructure/deploy.sh --status +``` + +### Detailed Status + +```bash +# All services +ssh root@91.98.224.44 " + echo '=== Disk ===' && df -h / /mnt/data + echo '=== Memory ===' && free -h + echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)' + echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}' + echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value' +" +``` + +### Website Verification + +```bash +# Check all endpoints +curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n" +curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n" +curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \ + --data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"' +``` + +--- + +## Maintenance Schedule + +### Daily (Automated) + +- Log rotation via systemd journald +- Certificate renewal via Caddy (automatic) + +### Weekly (Manual) + +```bash +# Check disk usage +ssh root@91.98.224.44 "df -h" + +# Check for failed services +ssh root@91.98.224.44 "systemctl --failed" +``` + +### Monthly (Manual) + +```bash +# System updates +ssh root@91.98.224.44 "apt update && apt upgrade -y" + +# Docker image updates +ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant" + +# Clean old logs +ssh root@91.98.224.44 "journalctl --vacuum-time=30d" +``` + +--- + +## Related Documentation + +- `docs/DEPLOYMENT_GUIDE.md` - Deployment procedures +- `.opencode/DEPLOYMENT_RULES.md` - AI agent deployment rules +- `infrastructure/deploy.sh` - Deployment script source +- `AGENTS.md` - Rule 7 covers deployment guidelines + +--- + +**Last Updated:** 2026-01-10 +**Maintainer:** GLAM Project Team