History

kempersc 7cf10084b4 Implement scripts for schema modifications and ontology verification - Added `fix_dual_class_link.py` to remove dual class link references from specified YAML files. - Created `fix_specific_ghosts.py` to apply specific replacements in YAML files based on defined mappings. - Introduced `migrate_staff_count.py` to migrate staff count references to a new structure in specified YAML files. - Developed `migrate_type_slots.py` to replace type-related slots with new identifiers across YAML files. - Implemented `scan_ghost_references.py` to identify and report ghost references to archived slots and classes in YAML files. - Added `verify_ontology_terms.py` to verify the presence of ontology terms in specified ontology files against schema definitions.		2026-01-29 17:10:25 +01:00
..
caddy	fix(caddy): disable browser caching for schema files	2026-01-11 23:38:08 +01:00
scripts	feat(infra): add fast push-based schema sync to production	2026-01-11 01:22:47 +01:00
sql	feat: add PostGIS international boundary architecture	2025-12-07 14:34:39 +01:00
systemd	feat(infra): add webhook-based schema deployment pipeline	2026-01-10 21:45:02 +01:00
terraform	remove a,bihguous web-claims	2025-12-21 00:01:54 +01:00
api-requirements.txt	normalise custodian entries	2025-12-09 07:56:35 +01:00
deploy-data.sh	update entries	2025-11-30 23:30:29 +01:00
deploy-linkml.sh	Implement feature X to enhance user experience and fix bug Y in module Z	2026-01-29 00:12:27 +01:00
deploy.sh	Implement scripts for schema modifications and ontology verification	2026-01-29 17:10:25 +01:00
git-push-schemas.sh	feat(infra): add fast push-based schema sync to production	2026-01-11 01:22:47 +01:00
README.md	normalise custodian entries	2025-12-09 07:56:35 +01:00
setup-cicd.sh	update entries	2025-11-30 23:30:29 +01:00
sync-schemas.sh	feat(infra): add webhook-based schema deployment pipeline	2026-01-10 21:45:02 +01:00

README.md

GLAM Infrastructure as Code

This directory contains Terraform configuration for deploying the GLAM Heritage Custodian Ontology infrastructure to Hetzner Cloud.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Hetzner Cloud                               │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              CX21 Server (2 vCPU, 4GB RAM)               │   │
│  │                                                          │   │
│  │  ┌──────────────────┐   ┌──────────────────────────┐   │   │
│  │  │    Caddy         │   │      Oxigraph            │   │   │
│  │  │  (Reverse Proxy) │──▶│  (SPARQL Triplestore)    │   │   │
│  │  │  :443, :80       │   │      :7878               │   │   │
│  │  └──────────────────┘   └──────────────────────────┘   │   │
│  │           │                        │                    │   │
│  │           │              ┌─────────┴─────────┐         │   │
│  │           │              │  /mnt/data/oxigraph         │   │
│  │           │              │  (Persistent Storage)       │   │
│  │           │              └─────────────────────────────┘   │   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │                 Qdrant (Docker)                      │   │
│  │  │              Vector Database for RAG                 │   │
│  │  │              :6333 (REST), :6334 (gRPC)             │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │           │                                              │   │
│  │  ┌────────▼────────────────────────────────────────────┐   │
│  │  │              Frontend (Static Files)                 │   │
│  │  │              /var/www/glam-frontend                  │   │
│  │  └─────────────────────────────────────────────────────┘   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Hetzner Volume (50GB SSD)                  │   │
│  │              /mnt/data                                   │   │
│  │  - Oxigraph database files                              │   │
│  │  - Qdrant storage & snapshots                           │   │
│  │  - Ontology files (.ttl, .rdf, .owl)                    │   │
│  │  - LinkML schemas (.yaml)                               │   │
│  │  - UML diagrams (.mmd)                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Components

Oxigraph: High-performance SPARQL triplestore for RDF data
Qdrant: Vector database for semantic search and RAG (Retrieval-Augmented Generation)
Caddy: Modern web server with automatic HTTPS
Hetzner Cloud Server: CX21 (2 vCPU, 4GB RAM, 40GB SSD)
Hetzner Volume: 50GB persistent storage for data

Prerequisites

Terraform (>= 1.0)
Hetzner Cloud Account
Hetzner API Token

Setup

1. Create Hetzner API Token

Log in to Hetzner Cloud Console
Select your project (or create a new one)
Go to Security > API Tokens
Generate a new token with Read & Write permissions
Copy the token (it's only shown once!)

2. Configure Terraform

cd infrastructure/terraform

# Copy the example variables file
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
nano terraform.tfvars

Set the following variables:

hcloud_token: Your Hetzner API token
domain: Your domain name (e.g., sparql.glam-ontology.org)
ssh_public_key: Path to your SSH public key

3. Initialize and Apply

# Initialize Terraform
terraform init

# Preview changes
terraform plan

# Apply configuration
terraform apply

4. Deploy Data

After the infrastructure is created, deploy your ontology data:

# SSH into the server
ssh root@$(terraform output -raw server_ip)

# Load ontology files into Oxigraph
cd /var/lib/glam/scripts
./load-ontologies.sh

Configuration Files

File	Description
`main.tf`	Main infrastructure resources
`variables.tf`	Input variables
`outputs.tf`	Output values
`cloud-init.yaml`	Server initialization script
`terraform.tfvars.example`	Example variable values

Costs

Estimated monthly costs (EUR):

Resource	Cost
CX21 Server	~€5.95/month
50GB Volume	~€2.50/month
IPv4 Address	Included
Total	~€8.45/month

Maintenance

Backup Data

# Backup Oxigraph database
ssh root@$SERVER_IP "tar -czf /tmp/oxigraph-backup.tar.gz /mnt/data/oxigraph"
scp root@$SERVER_IP:/tmp/oxigraph-backup.tar.gz ./backups/

Update Ontologies

# Copy new ontology files
scp -r schemas/20251121/rdf/*.ttl root@$SERVER_IP:/mnt/data/ontologies/

# Reload into Oxigraph
ssh root@$SERVER_IP "/var/lib/glam/scripts/load-ontologies.sh"

View Logs

ssh root@$SERVER_IP "journalctl -u oxigraph -f"
ssh root@$SERVER_IP "journalctl -u caddy -f"

CI/CD with GitHub Actions

The project includes automatic deployment via GitHub Actions. Changes to infrastructure, data, or frontend trigger automatic deployments.

Automatic Triggers

Deployments are triggered on push to main branch when files change in:

infrastructure/** → Infrastructure deployment (Terraform)
schemas/20251121/rdf/** → Data sync
schemas/20251121/linkml/** → Data sync
data/ontology/** → Data sync
frontend/** → Frontend build & deploy

Setup CI/CD

Run the setup script to generate SSH keys and get instructions:

./setup-cicd.sh

This will:

Generate an ED25519 SSH key pair for deployments
Create a .env file template (if missing)
Print instructions for GitHub repository setup

Required GitHub Secrets

Go to Repository → Settings → Secrets and variables → Actions → Secrets:

Secret	Description
`HETZNER_HC_API_TOKEN`	Hetzner Cloud API token (same as in `.env`)
`DEPLOY_SSH_PRIVATE_KEY`	Content of `.ssh/glam_deploy_key` (generated by setup script)
`TF_API_TOKEN`	(Optional) Terraform Cloud token for remote state

Required GitHub Variables

Go to Repository → Settings → Secrets and variables → Actions → Variables:

Variable	Example
`GLAM_DOMAIN`	`sparql.glam-ontology.org`
`ADMIN_EMAIL`	`admin@example.org`

Manual Deployment

You can manually trigger deployments from the GitHub Actions tab with options:

Deploy infrastructure changes - Run Terraform apply
Deploy ontology/schema data - Sync data files to server
Deploy frontend build - Build and deploy frontend
Reload data into Oxigraph - Reimport all RDF data

Local Deployment

For local deployments without GitHub Actions:

# Full deployment
./deploy.sh --all

# Just infrastructure
./deploy.sh --infra

# Just data
./deploy.sh --data

# Just frontend
./deploy.sh --frontend

# Just Qdrant vector database
./deploy.sh --qdrant

# Reload Oxigraph
./deploy.sh --reload

Qdrant Vector Database

Qdrant is a self-hosted vector database used for semantic search over heritage institution data. It enables RAG (Retrieval-Augmented Generation) in the DSPy SPARQL generation module.

Configuration

REST API: http://localhost:6333 (internal)
gRPC API: http://localhost:6334 (internal)
External Access: https://<domain>/qdrant/ (via Caddy reverse proxy)
Storage: /mnt/data/qdrant/storage/
Snapshots: /mnt/data/qdrant/snapshots/
Resource Limits: 2GB RAM, 2 CPUs

Deployment

# Deploy/restart Qdrant
./deploy.sh --qdrant

Health Check

# Check Qdrant health
ssh root@$SERVER_IP "curl http://localhost:6333/health"

# List collections
ssh root@$SERVER_IP "curl http://localhost:6333/collections"

DSPy Integration

The Qdrant retriever is integrated with DSPy for RAG-enhanced SPARQL query generation:

from glam_extractor.api.qdrant_retriever import HeritageCustodianRetriever

# Create retriever
retriever = HeritageCustodianRetriever(
    host="localhost",
    port=6333,
)

# Search for relevant institutions
results = retriever("museums in Amsterdam", k=5)

# Add institution to index
retriever.add_institution(
    name="Rijksmuseum",
    description="National museum of arts and history",
    institution_type="MUSEUM",
    country="NL",
    city="Amsterdam",
)

Environment Variables

Variable	Default	Description
`QDRANT_HOST`	`localhost`	Qdrant server hostname
`QDRANT_PORT`	`6333`	Qdrant REST API port
`QDRANT_ENABLED`	`true`	Enable/disable Qdrant integration
`OPENAI_API_KEY`	-	Required for embedding generation

Destroy Infrastructure

terraform destroy

Warning: This will delete all data. Make sure to backup first!