# GLAM Server Operations Guide This document covers server architecture, disk management, troubleshooting, and recovery procedures for the GLAM production server. ## Server Overview | Property | Value | |----------|-------| | **Provider** | Hetzner Cloud | | **Server Name** | `glam-sparql` | | **IP Address** | `91.98.224.44` | | **Instance Type** | cx32 (4 vCPU, 8GB RAM) | | **Location** | nbg1-dc3 (Nuremberg, Germany) | | **SSH User** | `root` | --- ## Disk Architecture The server has two storage volumes with distinct purposes: ### Root Volume (`/dev/sda1`) | Property | Value | |----------|-------| | **Mount Point** | `/` | | **Size** | 75GB | | **Purpose** | Operating system, applications, virtual environments | **Contents:** - `/var/lib/glam/api/` - GLAM API application and Python venv - `/var/lib/glam/frontend/` - Frontend static files - `/var/lib/glam/ducklake/` - DuckLake database files - `/var/www/` - Web roots for Caddy - `/usr/` - System binaries - `/var/log/` - System logs ### Data Volume (`/dev/sdb`) | Property | Value | |----------|-------| | **Mount Point** | `/mnt/data` | | **Size** | 49GB | | **Purpose** | Large datasets, Oxigraph triplestore, custodian YAML files | **Contents:** - `/mnt/data/oxigraph/` - SPARQL triplestore data - `/mnt/data/custodian/` - Heritage custodian YAML files (27,459 files, ~17GB) - `/mnt/data/ontologies/` - Base ontology files - `/mnt/data/rdf/` - Generated RDF schemas --- ## Directory Layout ### Symbolic Links (Important!) The following symlinks redirect data to the data volume: ``` /var/lib/glam/api/data/custodian → /mnt/data/custodian ``` **Rationale:** The custodian data (~17GB, 27,459 YAML files) is too large for the root volume. Moving it to `/mnt/data` with a symlink allows the API to find files at the expected path while storing them on the larger volume. ### Full Directory Structure ``` /var/lib/glam/ ├── api/ # GLAM API (24GB total) │ ├── venv/ # Python virtual environment (~2GB after optimization) │ │ └── lib/python3.12/site-packages/ │ │ ├── torch/ # PyTorch CPU-only (~184MB) │ │ ├── transformers/ # HuggingFace (~115MB) │ │ ├── sentence_transformers/ │ │ └── ... │ ├── src/ # API source code │ ├── data/ │ │ ├── custodian → /mnt/data/custodian # SYMLINK │ │ ├── validation/ │ │ └── sparql_templates.yaml │ ├── schemas/ # LinkML schemas for API │ ├── backend/ # RAG backend code │ └── requirements.txt ├── frontend/ # Frontend build (~200MB) ├── ducklake/ # DuckLake database (~3.2GB) ├── valkey/ # Valkey cache config └── scripts/ # Deployment scripts /mnt/data/ ├── oxigraph/ # SPARQL triplestore data ├── custodian/ # Heritage custodian YAML files (17GB) │ ├── *.yaml # Individual custodian files │ ├── person/ # Person entity profiles │ └── web/ # Archived web content ├── ontologies/ # Base ontology files └── rdf/ # Generated RDF schemas ``` --- ## Services ### Systemd Services | Service | Port | Description | Status Command | |---------|------|-------------|----------------| | `oxigraph` | 7878 (internal) | SPARQL triplestore | `systemctl status oxigraph` | | `glam-api` | 8000 (internal) | FastAPI application | `systemctl status glam-api` | | `caddy` | 80, 443 | Reverse proxy with TLS | `systemctl status caddy` | ### Docker Containers | Container | Port | Description | |-----------|------|-------------| | `qdrant` | 6333-6334 (internal) | Vector database | | `glam-valkey-api` | 8090 (internal) | Valkey cache API | | `forgejo` | 3000 (internal), 2222 | Git server | ### Service Endpoints | Domain | Backend | |--------|---------| | `bronhouder.nl` | Static frontend | | `archief.support` | Archief Assistent app | | `sparql.bronhouder.nl` | Oxigraph SPARQL proxy | | `api.bronhouder.nl` | GLAM API (if configured) | --- ## PyTorch CPU-Only Configuration **Critical:** The server does NOT have a GPU. PyTorch must be installed in CPU-only mode. ### Why CPU-Only? - The cx32 instance has no GPU - CUDA libraries add ~6GB of unnecessary packages - CPU inference is sufficient for embedding generation ### Installation If PyTorch needs to be reinstalled: ```bash # SSH to server ssh root@91.98.224.44 # Activate venv cd /var/lib/glam/api source venv/bin/activate # Uninstall GPU version pip uninstall torch -y # Install CPU-only version pip install torch --index-url https://download.pytorch.org/whl/cpu ``` ### Verification ```bash ssh root@91.98.224.44 "source /var/lib/glam/api/venv/bin/activate && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'" # Should output: PyTorch 2.9.1+cpu, CUDA available: False ``` ### Package Sizes (Reference) | Package | GPU Version | CPU Version | |---------|-------------|-------------| | `torch` | ~1.7GB | ~184MB | | `nvidia-*` | ~4.3GB | Not installed | | `triton` | ~594MB | Not installed | --- ## Disk Space Management ### Monitoring ```bash # Quick check ssh root@91.98.224.44 "df -h / /mnt/data" # Detailed breakdown ssh root@91.98.224.44 "du -sh /var/lib/glam/* /mnt/data/* 2>/dev/null | sort -rh" ``` ### Target Usage | Volume | Target | Warning | Critical | |--------|--------|---------|----------| | Root (`/`) | <70% | >80% | >90% | | Data (`/mnt/data`) | <80% | >85% | >95% | ### Cleanup Procedures #### 1. System Logs ```bash ssh root@91.98.224.44 " journalctl --vacuum-size=50M rm -f /var/log/*.gz /var/log/*.1 /var/log/*.old truncate -s 0 /var/log/syslog /var/log/auth.log " ``` #### 2. APT Cache ```bash ssh root@91.98.224.44 "apt-get clean && apt-get autoremove -y" ``` #### 3. Docker Cleanup ```bash ssh root@91.98.224.44 "docker system prune -af --volumes" ``` #### 4. Python Cache ```bash ssh root@91.98.224.44 " find /var/lib/glam/api/venv -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null find /var/lib/glam/api/venv -type f -name '*.pyc' -delete 2>/dev/null " ``` --- ## Troubleshooting ### Disk Full Recovery **Symptoms:** - Services crash-looping - "No space left on device" errors - Deployment fails **Recovery Steps:** 1. **Identify largest directories:** ```bash ssh root@91.98.224.44 "du -sh /* 2>/dev/null | sort -rh | head -20" ``` 2. **Clean logs (immediate relief):** ```bash ssh root@91.98.224.44 "journalctl --vacuum-size=50M && truncate -s 0 /var/log/syslog" ``` 3. **Check for unnecessary CUDA packages:** ```bash ssh root@91.98.224.44 "du -sh /var/lib/glam/api/venv/lib/python*/site-packages/nvidia" # If >100MB, reinstall PyTorch CPU-only (see above) ``` 4. **Move large data to data volume:** ```bash # Example: Move custodian data ssh root@91.98.224.44 " mv /var/lib/glam/api/data/custodian /mnt/data/custodian ln -s /mnt/data/custodian /var/lib/glam/api/data/custodian " ``` 5. **Restart services:** ```bash ssh root@91.98.224.44 "systemctl restart glam-api oxigraph caddy" ``` ### GLAM API Crash Loop **Symptoms:** - `systemctl status glam-api` shows "activating (auto-restart)" - High restart counter in logs **Diagnosis:** ```bash ssh root@91.98.224.44 "journalctl -u glam-api -n 100 --no-pager" ``` **Common Causes:** 1. **Port already in use:** ```bash ssh root@91.98.224.44 "fuser -k 8000/tcp; systemctl restart glam-api" ``` 2. **Missing CUDA libraries (after cleanup):** ```bash # Look for: "libcublas.so not found" # Fix: Reinstall PyTorch CPU-only (see above) ``` 3. **Out of memory:** ```bash ssh root@91.98.224.44 "free -h && dmesg | grep -i 'out of memory' | tail -5" ``` 4. **Missing custodian data symlink:** ```bash ssh root@91.98.224.44 "ls -la /var/lib/glam/api/data/custodian" # Should be symlink to /mnt/data/custodian ``` ### Oxigraph Issues **Check status:** ```bash ssh root@91.98.224.44 "systemctl status oxigraph" ``` **Check triple count:** ```bash curl -s "https://sparql.bronhouder.nl/query" \ -H "Accept: application/sparql-results+json" \ --data-urlencode "query=SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }" ``` **Restart:** ```bash ssh root@91.98.224.44 "systemctl restart oxigraph" ``` --- ## Recovery Procedures ### Full Disk Recovery (Documented 2026-01-10) On January 10, 2026, both disks reached 100% capacity. Recovery steps: 1. **Root cause:** NVIDIA CUDA packages (~6GB) + custodian data (~17GB) on root volume 2. **Resolution:** - Removed NVIDIA packages: `rm -rf .../site-packages/nvidia .../triton` - Reinstalled PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu` - Moved custodian data: `/var/lib/glam/api/data/custodian` → `/mnt/data/custodian` (symlink) 3. **Result:** - Root: 100% → 43% used (42GB free) - Data: 100% → 69% used (15GB free) ### Service Recovery Order If multiple services are down, restart in this order: 1. `oxigraph` - SPARQL store (no dependencies) 2. `qdrant` (docker) - Vector database 3. `glam-valkey-api` (docker) - Cache 4. `glam-api` - API (depends on Qdrant, Oxigraph) 5. `caddy` - Reverse proxy (depends on backends) ```bash ssh root@91.98.224.44 " systemctl restart oxigraph docker restart qdrant glam-valkey-api sleep 5 systemctl restart glam-api sleep 10 systemctl restart caddy " ``` --- ## Monitoring Commands ### Quick Health Check ```bash ./infrastructure/deploy.sh --status ``` ### Detailed Status ```bash # All services ssh root@91.98.224.44 " echo '=== Disk ===' && df -h / /mnt/data echo '=== Memory ===' && free -h echo '=== Services ===' && systemctl status oxigraph glam-api caddy --no-pager | grep -E '(●|Active:)' echo '=== Docker ===' && docker ps --format 'table {{.Names}}\t{{.Status}}' echo '=== Triples ===' && curl -s 'http://localhost:7878/query' -H 'Accept: application/sparql-results+json' --data-urlencode 'query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }' | jq -r '.results.bindings[0].c.value' " ``` ### Website Verification ```bash # Check all endpoints curl -s "https://bronhouder.nl" -o /dev/null -w "bronhouder.nl: %{http_code}\n" curl -s "https://archief.support" -o /dev/null -w "archief.support: %{http_code}\n" curl -s "https://sparql.bronhouder.nl/query" -H "Accept: application/sparql-results+json" \ --data-urlencode "query=SELECT (COUNT(*) as ?c) WHERE { ?s ?p ?o }" | jq -r '"Oxigraph: \(.results.bindings[0].c.value) triples"' ``` --- ## Maintenance Schedule ### Daily (Automated) - Log rotation via systemd journald - Certificate renewal via Caddy (automatic) ### Weekly (Manual) ```bash # Check disk usage ssh root@91.98.224.44 "df -h" # Check for failed services ssh root@91.98.224.44 "systemctl --failed" ``` ### Monthly (Manual) ```bash # System updates ssh root@91.98.224.44 "apt update && apt upgrade -y" # Docker image updates ssh root@91.98.224.44 "docker pull qdrant/qdrant:latest && docker restart qdrant" # Clean old logs ssh root@91.98.224.44 "journalctl --vacuum-time=30d" ``` --- ## Related Documentation - `docs/DEPLOYMENT_GUIDE.md` - Deployment procedures - `.opencode/DEPLOYMENT_RULES.md` - AI agent deployment rules - `infrastructure/deploy.sh` - Deployment script source - `AGENTS.md` - Rule 7 covers deployment guidelines --- **Last Updated:** 2026-01-10 **Maintainer:** GLAM Project Team