glam/infrastructure
kempersc 7cf10084b4 Implement scripts for schema modifications and ontology verification
- Added `fix_dual_class_link.py` to remove dual class link references from specified YAML files.
- Created `fix_specific_ghosts.py` to apply specific replacements in YAML files based on defined mappings.
- Introduced `migrate_staff_count.py` to migrate staff count references to a new structure in specified YAML files.
- Developed `migrate_type_slots.py` to replace type-related slots with new identifiers across YAML files.
- Implemented `scan_ghost_references.py` to identify and report ghost references to archived slots and classes in YAML files.
- Added `verify_ontology_terms.py` to verify the presence of ontology terms in specified ontology files against schema definitions.
2026-01-29 17:10:25 +01:00
..
caddy fix(caddy): disable browser caching for schema files 2026-01-11 23:38:08 +01:00
scripts feat(infra): add fast push-based schema sync to production 2026-01-11 01:22:47 +01:00
sql feat: add PostGIS international boundary architecture 2025-12-07 14:34:39 +01:00
systemd feat(infra): add webhook-based schema deployment pipeline 2026-01-10 21:45:02 +01:00
terraform remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00
api-requirements.txt normalise custodian entries 2025-12-09 07:56:35 +01:00
deploy-data.sh update entries 2025-11-30 23:30:29 +01:00
deploy-linkml.sh Implement feature X to enhance user experience and fix bug Y in module Z 2026-01-29 00:12:27 +01:00
deploy.sh Implement scripts for schema modifications and ontology verification 2026-01-29 17:10:25 +01:00
git-push-schemas.sh feat(infra): add fast push-based schema sync to production 2026-01-11 01:22:47 +01:00
README.md normalise custodian entries 2025-12-09 07:56:35 +01:00
setup-cicd.sh update entries 2025-11-30 23:30:29 +01:00
sync-schemas.sh feat(infra): add webhook-based schema deployment pipeline 2026-01-10 21:45:02 +01:00

GLAM Infrastructure as Code

This directory contains Terraform configuration for deploying the GLAM Heritage Custodian Ontology infrastructure to Hetzner Cloud.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Hetzner Cloud                               │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              CX21 Server (2 vCPU, 4GB RAM)               │   │
│  │                                                          │   │
│  │  ┌──────────────────┐   ┌──────────────────────────┐   │   │
│  │  │    Caddy         │   │      Oxigraph            │   │   │
│  │  │  (Reverse Proxy) │──▶│  (SPARQL Triplestore)    │   │   │
│  │  │  :443, :80       │   │      :7878               │   │   │
│  │  └──────────────────┘   └──────────────────────────┘   │   │
│  │           │                        │                    │   │
│  │           │              ┌─────────┴─────────┐         │   │
│  │           │              │  /mnt/data/oxigraph         │   │
│  │           │              │  (Persistent Storage)       │   │
│  │           │              └─────────────────────────────┘   │   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │                 Qdrant (Docker)                      │   │
│  │  │              Vector Database for RAG                 │   │
│  │  │              :6333 (REST), :6334 (gRPC)             │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │              Frontend (Static Files)                 │   │
│  │  │              /var/www/glam-frontend                  │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Hetzner Volume (50GB SSD)                  │   │
│  │              /mnt/data                                   │   │
│  │  - Oxigraph database files                              │   │
│  │  - Qdrant storage & snapshots                           │   │
│  │  - Ontology files (.ttl, .rdf, .owl)                    │   │
│  │  - LinkML schemas (.yaml)                               │   │
│  │  - UML diagrams (.mmd)                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Components

  • Oxigraph: High-performance SPARQL triplestore for RDF data
  • Qdrant: Vector database for semantic search and RAG (Retrieval-Augmented Generation)
  • Caddy: Modern web server with automatic HTTPS
  • Hetzner Cloud Server: CX21 (2 vCPU, 4GB RAM, 40GB SSD)
  • Hetzner Volume: 50GB persistent storage for data

Prerequisites

  1. Terraform (>= 1.0)
  2. Hetzner Cloud Account
  3. Hetzner API Token

Setup

1. Create Hetzner API Token

  1. Log in to Hetzner Cloud Console
  2. Select your project (or create a new one)
  3. Go to Security > API Tokens
  4. Generate a new token with Read & Write permissions
  5. Copy the token (it's only shown once!)

2. Configure Terraform

cd infrastructure/terraform

# Copy the example variables file
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
nano terraform.tfvars

Set the following variables:

  • hcloud_token: Your Hetzner API token
  • domain: Your domain name (e.g., sparql.glam-ontology.org)
  • ssh_public_key: Path to your SSH public key

3. Initialize and Apply

# Initialize Terraform
terraform init

# Preview changes
terraform plan

# Apply configuration
terraform apply

4. Deploy Data

After the infrastructure is created, deploy your ontology data:

# SSH into the server
ssh root@$(terraform output -raw server_ip)

# Load ontology files into Oxigraph
cd /var/lib/glam/scripts
./load-ontologies.sh

Configuration Files

File Description
main.tf Main infrastructure resources
variables.tf Input variables
outputs.tf Output values
cloud-init.yaml Server initialization script
terraform.tfvars.example Example variable values

Costs

Estimated monthly costs (EUR):

Resource Cost
CX21 Server ~€5.95/month
50GB Volume ~€2.50/month
IPv4 Address Included
Total ~€8.45/month

Maintenance

Backup Data

# Backup Oxigraph database
ssh root@$SERVER_IP "tar -czf /tmp/oxigraph-backup.tar.gz /mnt/data/oxigraph"
scp root@$SERVER_IP:/tmp/oxigraph-backup.tar.gz ./backups/

Update Ontologies

# Copy new ontology files
scp -r schemas/20251121/rdf/*.ttl root@$SERVER_IP:/mnt/data/ontologies/

# Reload into Oxigraph
ssh root@$SERVER_IP "/var/lib/glam/scripts/load-ontologies.sh"

View Logs

ssh root@$SERVER_IP "journalctl -u oxigraph -f"
ssh root@$SERVER_IP "journalctl -u caddy -f"

CI/CD with GitHub Actions

The project includes automatic deployment via GitHub Actions. Changes to infrastructure, data, or frontend trigger automatic deployments.

Automatic Triggers

Deployments are triggered on push to main branch when files change in:

  • infrastructure/** → Infrastructure deployment (Terraform)
  • schemas/20251121/rdf/** → Data sync
  • schemas/20251121/linkml/** → Data sync
  • data/ontology/** → Data sync
  • frontend/** → Frontend build & deploy

Setup CI/CD

Run the setup script to generate SSH keys and get instructions:

./setup-cicd.sh

This will:

  1. Generate an ED25519 SSH key pair for deployments
  2. Create a .env file template (if missing)
  3. Print instructions for GitHub repository setup

Required GitHub Secrets

Go to Repository → Settings → Secrets and variables → Actions → Secrets:

Secret Description
HETZNER_HC_API_TOKEN Hetzner Cloud API token (same as in .env)
DEPLOY_SSH_PRIVATE_KEY Content of .ssh/glam_deploy_key (generated by setup script)
TF_API_TOKEN (Optional) Terraform Cloud token for remote state

Required GitHub Variables

Go to Repository → Settings → Secrets and variables → Actions → Variables:

Variable Example
GLAM_DOMAIN sparql.glam-ontology.org
ADMIN_EMAIL admin@example.org

Manual Deployment

You can manually trigger deployments from the GitHub Actions tab with options:

  • Deploy infrastructure changes - Run Terraform apply
  • Deploy ontology/schema data - Sync data files to server
  • Deploy frontend build - Build and deploy frontend
  • Reload data into Oxigraph - Reimport all RDF data

Local Deployment

For local deployments without GitHub Actions:

# Full deployment
./deploy.sh --all

# Just infrastructure
./deploy.sh --infra

# Just data
./deploy.sh --data

# Just frontend
./deploy.sh --frontend

# Just Qdrant vector database
./deploy.sh --qdrant

# Reload Oxigraph
./deploy.sh --reload

Qdrant Vector Database

Qdrant is a self-hosted vector database used for semantic search over heritage institution data. It enables RAG (Retrieval-Augmented Generation) in the DSPy SPARQL generation module.

Configuration

  • REST API: http://localhost:6333 (internal)
  • gRPC API: http://localhost:6334 (internal)
  • External Access: https://<domain>/qdrant/ (via Caddy reverse proxy)
  • Storage: /mnt/data/qdrant/storage/
  • Snapshots: /mnt/data/qdrant/snapshots/
  • Resource Limits: 2GB RAM, 2 CPUs

Deployment

# Deploy/restart Qdrant
./deploy.sh --qdrant

Health Check

# Check Qdrant health
ssh root@$SERVER_IP "curl http://localhost:6333/health"

# List collections
ssh root@$SERVER_IP "curl http://localhost:6333/collections"

DSPy Integration

The Qdrant retriever is integrated with DSPy for RAG-enhanced SPARQL query generation:

from glam_extractor.api.qdrant_retriever import HeritageCustodianRetriever

# Create retriever
retriever = HeritageCustodianRetriever(
    host="localhost",
    port=6333,
)

# Search for relevant institutions
results = retriever("museums in Amsterdam", k=5)

# Add institution to index
retriever.add_institution(
    name="Rijksmuseum",
    description="National museum of arts and history",
    institution_type="MUSEUM",
    country="NL",
    city="Amsterdam",
)

Environment Variables

Variable Default Description
QDRANT_HOST localhost Qdrant server hostname
QDRANT_PORT 6333 Qdrant REST API port
QDRANT_ENABLED true Enable/disable Qdrant integration
OPENAI_API_KEY - Required for embedding generation

Destroy Infrastructure

terraform destroy

Warning: This will delete all data. Make sure to backup first!