glam/frontend/TRIPLESTORE_OXIGRAPH_SETUP.md
kempersc 2761857b0d Add scripts for converting OWL/Turtle ontology to Mermaid and PlantUML diagrams
- Implemented `owl_to_mermaid.py` to convert OWL/Turtle files into Mermaid class diagrams.
- Implemented `owl_to_plantuml.py` to convert OWL/Turtle files into PlantUML class diagrams.
- Added two new PlantUML files for custodian multi-aspect diagrams.
2025-11-22 23:01:13 +01:00

697 lines
17 KiB
Markdown

# Triplestore Setup: Oxigraph for RDF Visualizer
**Date**: 2025-11-22
**Status**: Planning / Not Yet Implemented
**Priority**: Phase 3 - Task 6 & 7 (Query Builder + SPARQL Execution)
---
## Overview
The GLAM project will use **Oxigraph** as the RDF triplestore for the frontend visualization application. Oxigraph was explicitly chosen in the project planning documents (September 2025) as part of the knowledge graph infrastructure.
---
## Why Oxigraph?
### Selected Benefits
From the project documentation (ontology conversation `2025-09-09`):
> **Phase 4 - Knowledge Graph Infrastructure (120 hours):**
> - TypeDB hypergraph database
> - Oxigraph RDF triple store
### Advantages for This Project
1. **Lightweight & Embeddable**
- Can run in-process with the frontend (via WASM) OR as a separate server
- Minimal setup compared to Virtuoso, Blazegraph, or GraphDB
- Perfect for prototype/demonstration use cases
2. **SPARQL 1.1 Compliant**
- Full SPARQL query support (SELECT, CONSTRUCT, ASK, DESCRIBE)
- Standards-compliant implementation
- Compatible with all RDF formats (Turtle, N-Triples, RDF/XML, JSON-LD, etc.)
3. **Active Development**
- Modern Rust implementation (high performance, memory safety)
- GitHub: https://github.com/oxigraph/oxigraph
- Actively maintained (2018-present)
4. **Multiple Deployment Options**
- **Server mode**: HTTP API for remote queries
- **Library mode**: Embedded in Python/Rust applications
- **WASM mode**: In-browser triplestore (experimental)
5. **Cultural Heritage Sector Adoption**
- Used in European cultural heritage projects
- Recommended for prototype/exploratory projects (530-hour estimate research)
- Proven for organizational data modeling
---
## Current Status
### Backend (Python)
The Python backend includes RDF processing dependencies:
```toml
# pyproject.toml - RDF and semantic web dependencies
rdflib = "^7.0.0" # RDF parsing/serialization
SPARQLWrapper = "^2.0.0" # SPARQL query execution
```
**Status**: ✅ Dependencies installed
**Usage**: Backend can parse RDF files and generate SPARQL queries
### Frontend (TypeScript/React)
**Status**: ❌ Not yet integrated
**Todo**: Add Oxigraph client or SPARQL HTTP client
---
## Architecture Options
### Option 1: Oxigraph Server (Recommended)
**Setup**:
```bash
# Install Oxigraph server
cargo install oxigraph_server
# Or use Docker
docker pull oxigraph/oxigraph
# Start server with persistent storage
oxigraph_server --location ./data/oxigraph --bind 127.0.0.1:7878
```
**Frontend Integration**:
```typescript
// src/lib/sparql/oxigraph-client.ts
export async function querySparql(query: string): Promise<any> {
const response = await fetch('http://localhost:7878/query', {
method: 'POST',
headers: {
'Content-Type': 'application/sparql-query',
'Accept': 'application/sparql-results+json',
},
body: query,
});
return response.json();
}
```
**Pros**:
- ✅ Separate process (doesn't block frontend)
- ✅ Standard HTTP API (easy to integrate)
- ✅ Supports large datasets
- ✅ Can be shared across multiple clients
**Cons**:
- ⚠️ Requires running separate server process
- ⚠️ Adds deployment complexity
---
### Option 2: Oxigraph WASM (Experimental)
**Setup**:
```bash
# Install Oxigraph WASM package
npm install oxigraph
```
**Frontend Integration**:
```typescript
// src/lib/sparql/oxigraph-wasm.ts
import { Store } from 'oxigraph';
export class OxigraphStore {
private store: Store;
constructor() {
this.store = new Store();
}
async loadRDF(rdfData: string, format: string) {
await this.store.load(rdfData, format, null, null);
}
async query(sparql: string) {
return this.store.query(sparql);
}
}
```
**Pros**:
- ✅ No server required (runs in browser)
- ✅ Zero configuration
- ✅ Perfect for demos and prototypes
- ✅ Offline-capable
**Cons**:
- ⚠️ Experimental (WASM support still evolving)
- ⚠️ Limited to browser memory (dataset size constraints)
- ⚠️ May have performance limitations
---
### Option 3: Backend Proxy (Hybrid)
**Setup**:
- Backend runs Oxigraph server
- Frontend queries backend API
- Backend proxies to Oxigraph
**Frontend Integration**:
```typescript
// src/lib/sparql/backend-proxy.ts
export async function querySparql(query: string): Promise<any> {
const response = await fetch('/api/sparql', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query }),
});
return response.json();
}
```
**Pros**:
- ✅ Backend controls triplestore lifecycle
- ✅ Can add authentication/authorization
- ✅ Can preprocess/validate queries
- ✅ Hides triplestore implementation details
**Cons**:
- ⚠️ Requires backend development
- ⚠️ More complex architecture
---
## Recommendation
**For Phase 3 (Task 6 & 7)**: Use **Option 1 (Oxigraph Server)**
### Rationale
1. **Proven Approach**: Standard HTTP API matches project architecture research
2. **Scalable**: Can handle Denmark dataset (43,429 triples) and beyond
3. **Simple Development**: Frontend can focus on SPARQL query building, not triplestore management
4. **Future-Proof**: Easy to swap for other triplestores (Virtuoso, Blazegraph) later if needed
5. **Docker-Ready**: Can be containerized for production deployment
---
## Implementation Plan
### Phase 3 - Task 6: Query Builder
**Goal**: Visual SPARQL query builder UI
**Deliverables**:
1. Query builder component with visual interface
2. Subject-Predicate-Object pattern builder
3. Filter conditions UI
4. SPARQL syntax preview (live syntax highlighting)
5. Query validation (syntax checking)
6. Query templates library (pre-built queries)
**Oxigraph Integration**:
- NOT REQUIRED yet
- Query builder generates SPARQL strings only
- Can test queries against static RDF files using `rdflib` (Python backend)
**Estimated Time**: 4-5 hours
---
### Phase 3 - Task 7: SPARQL Query Execution
**Goal**: Execute SPARQL queries against loaded RDF data
**Deliverables**:
1. Oxigraph server setup and configuration
2. SPARQL HTTP client (`src/lib/sparql/oxigraph-client.ts`)
3. Query execution hook (`src/hooks/useSparqlQuery.ts`)
4. Results viewer component (table, JSON, graph views)
5. Export results (CSV, JSON, RDF)
6. Query performance metrics (execution time, result count)
**Oxigraph Integration Steps**:
#### Step 1: Install Oxigraph Server
```bash
# Via Cargo
cargo install oxigraph_server
# OR via Docker
docker pull oxigraph/oxigraph
```
#### Step 2: Start Oxigraph with Test Data
```bash
# Load Denmark dataset (43,429 triples)
oxigraph_server \
--location ./data/oxigraph \
--bind 127.0.0.1:7878
# In another terminal, load RDF data
curl -X POST \
-H 'Content-Type: text/turtle' \
--data-binary @data/rdf/denmark_complete.ttl \
http://localhost:7878/store
```
#### Step 3: Create SPARQL Client Module
```typescript
// src/lib/sparql/oxigraph-client.ts
export interface SparqlResult {
head: { vars: string[] };
results: {
bindings: Array<Record<string, { type: string; value: string }>>;
};
}
export async function executeSparql(query: string): Promise<SparqlResult> {
const response = await fetch('http://localhost:7878/query', {
method: 'POST',
headers: {
'Content-Type': 'application/sparql-query',
'Accept': 'application/sparql-results+json',
},
body: query,
});
if (!response.ok) {
throw new Error(`SPARQL query failed: ${response.statusText}`);
}
return response.json();
}
```
#### Step 4: Create Query Execution Hook
```typescript
// src/hooks/useSparqlQuery.ts
import { useState, useCallback } from 'react';
import { executeSparql, type SparqlResult } from '../lib/sparql/oxigraph-client';
export function useSparqlQuery() {
const [results, setResults] = useState<SparqlResult | null>(null);
const [isLoading, setIsLoading] = useState(false);
const [error, setError] = useState<string | null>(null);
const [executionTime, setExecutionTime] = useState<number>(0);
const executeQuery = useCallback(async (query: string) => {
setIsLoading(true);
setError(null);
const startTime = performance.now();
try {
const result = await executeSparql(query);
setResults(result);
setExecutionTime(performance.now() - startTime);
} catch (err) {
setError(err instanceof Error ? err.message : 'Query execution failed');
setResults(null);
} finally {
setIsLoading(false);
}
}, []);
return { results, isLoading, error, executionTime, executeQuery };
}
```
#### Step 5: Create Results Viewer Component
```typescript
// src/components/query/ResultsViewer.tsx
import type { SparqlResult } from '../../lib/sparql/oxigraph-client';
interface ResultsViewerProps {
results: SparqlResult;
executionTime: number;
}
export function ResultsViewer({ results, executionTime }: ResultsViewerProps) {
const { head, results: data } = results;
const bindings = data.bindings;
return (
<div className="results-viewer">
<div className="results-header">
<span>{bindings.length} results</span>
<span>Execution time: {executionTime.toFixed(2)}ms</span>
</div>
<table className="results-table">
<thead>
<tr>
{head.vars.map(variable => (
<th key={variable}>{variable}</th>
))}
</tr>
</thead>
<tbody>
{bindings.map((binding, index) => (
<tr key={index}>
{head.vars.map(variable => (
<td key={variable}>
{binding[variable]?.value || '-'}
</td>
))}
</tr>
))}
</tbody>
</table>
</div>
);
}
```
**Estimated Time**: 6-8 hours
---
## Testing Strategy
### Unit Tests
```typescript
// tests/unit/sparql-client.test.ts
import { describe, it, expect, vi } from 'vitest';
import { executeSparql } from '../../src/lib/sparql/oxigraph-client';
describe('executeSparql', () => {
it('should execute SELECT query and return results', async () => {
global.fetch = vi.fn().mockResolvedValue({
ok: true,
json: async () => ({
head: { vars: ['subject'] },
results: {
bindings: [
{ subject: { type: 'uri', value: 'http://example.org/inst1' } },
],
},
}),
});
const query = 'SELECT ?subject WHERE { ?subject a schema:Museum }';
const results = await executeSparql(query);
expect(results.results.bindings).toHaveLength(1);
});
it('should throw error on failed query', async () => {
global.fetch = vi.fn().mockResolvedValue({
ok: false,
statusText: 'Bad Request',
});
const query = 'INVALID SPARQL';
await expect(executeSparql(query)).rejects.toThrow('SPARQL query failed');
});
});
```
### Integration Tests
```typescript
// tests/integration/oxigraph.test.ts
import { describe, it, expect, beforeAll } from 'vitest';
import { executeSparql } from '../../src/lib/sparql/oxigraph-client';
describe('Oxigraph Integration', () => {
beforeAll(async () => {
// Assumes Oxigraph server is running on localhost:7878
// with test data loaded
});
it('should query loaded RDF data', async () => {
const query = `
PREFIX schema: <http://schema.org/>
SELECT ?museum WHERE {
?museum a schema:Museum .
} LIMIT 10
`;
const results = await executeSparql(query);
expect(results.results.bindings.length).toBeGreaterThan(0);
});
});
```
---
## Sample SPARQL Queries
### Query 1: Find All Museums
```sparql
PREFIX schema: <http://schema.org/>
SELECT ?museum ?name WHERE {
?museum a schema:Museum .
?museum schema:name ?name .
}
ORDER BY ?name
LIMIT 100
```
### Query 2: Count Institutions by Type
```sparql
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>
SELECT ?type (COUNT(?institution) AS ?count) WHERE {
?institution a ?type .
FILTER(?type IN (schema:Museum, schema:Library, schema:ArchiveOrganization))
}
GROUP BY ?type
ORDER BY DESC(?count)
```
### Query 3: Find Institutions in a City
```sparql
PREFIX schema: <http://schema.org/>
SELECT ?institution ?name ?address WHERE {
?institution schema:name ?name .
?institution schema:address ?addrNode .
?addrNode schema:addressLocality "København K" .
?addrNode schema:streetAddress ?address .
}
ORDER BY ?name
```
### Query 4: Find Institutions with Wikidata Links
```sparql
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
SELECT ?institution ?name ?wikidataURI WHERE {
?institution schema:name ?name .
?institution owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
}
LIMIT 100
```
---
## Configuration
### Oxigraph Server Config
```bash
# Start Oxigraph server with configuration
oxigraph_server \
--location ./data/oxigraph \ # Data directory
--bind 0.0.0.0:7878 \ # Bind address
--cors "*" \ # CORS for frontend development
--readonly false # Allow data loading
```
### Environment Variables
```bash
# .env.local (frontend)
VITE_SPARQL_ENDPOINT=http://localhost:7878
VITE_SPARQL_QUERY_TIMEOUT=30000 # 30 seconds
```
### TypeScript Config
```typescript
// src/config/sparql.ts
export const SPARQL_CONFIG = {
endpoint: import.meta.env.VITE_SPARQL_ENDPOINT || 'http://localhost:7878',
timeout: Number(import.meta.env.VITE_SPARQL_QUERY_TIMEOUT) || 30000,
corsEnabled: true,
};
```
---
## Performance Considerations
### Dataset Size
| Dataset | Triples | Query Time | Notes |
|---------|---------|------------|-------|
| Denmark (Current) | 43,429 | <100ms | Fast for prototyping |
| Netherlands (Estimated) | ~500,000 | <500ms | Should be manageable |
| Global (Future) | 5-10M | 1-5s | May need optimization |
### Optimization Strategies
1. **Indexing**: Oxigraph automatically indexes triples (no manual configuration)
2. **Query Limits**: Always use `LIMIT` clauses during development
3. **Caching**: Cache frequent query results in frontend (localStorage)
4. **Pagination**: Implement OFFSET/LIMIT for large result sets
5. **Prefixes**: Use PREFIX declarations to reduce query size
---
## Deployment
### Development
```bash
# Start Oxigraph locally
oxigraph_server --location ./data/oxigraph --bind 127.0.0.1:7878
# Load test data
curl -X POST \
-H 'Content-Type: text/turtle' \
--data-binary @data/rdf/denmark_complete.ttl \
http://localhost:7878/store
```
### Docker (Production)
```dockerfile
# Dockerfile.oxigraph
FROM oxigraph/oxigraph:latest
# Copy RDF data to container
COPY data/rdf /data/rdf
# Expose SPARQL endpoint
EXPOSE 7878
# Start server with persistent storage
CMD ["--location", "/data/oxigraph", "--bind", "0.0.0.0:7878"]
```
```yaml
# docker-compose.yml
version: '3.8'
services:
oxigraph:
image: oxigraph/oxigraph:latest
ports:
- "7878:7878"
volumes:
- ./data/oxigraph:/data/oxigraph
- ./data/rdf:/data/rdf:ro
command: --location /data/oxigraph --bind 0.0.0.0:7878 --cors "*"
frontend:
build: ./frontend
ports:
- "5173:5173"
environment:
- VITE_SPARQL_ENDPOINT=http://oxigraph:7878
depends_on:
- oxigraph
```
---
## Alternatives Considered
### Virtuoso
- **Pros**: Very mature, enterprise-grade, excellent performance
- **Cons**: Complex setup, heavyweight (2GB+ memory), overkill for prototype
### Blazegraph
- **Pros**: Full SPARQL 1.1, good documentation
- **Cons**: Java dependency, discontinued (last release 2019)
### Apache Jena Fuseki
- **Pros**: Mature, full-featured, active development
- **Cons**: Java dependency, more complex than Oxigraph
### GraphDB
- **Pros**: Commercial support, advanced reasoning, SHACL validation
- **Cons**: Proprietary (free edition has limits), complex setup
**Decision**: Oxigraph wins for simplicity, modern stack (Rust), and cultural heritage sector adoption.
---
## Next Steps
### Immediate (Task 6 - Query Builder)
1. Document Oxigraph decision (this file)
2. Implement visual query builder UI (4-5 hours)
3. Create SPARQL syntax validator
4. Build query templates library
### Next Session (Task 7 - SPARQL Execution)
1. Install Oxigraph server (local + Docker)
2. Load Denmark dataset (43,429 triples)
3. Create SPARQL client module
4. Implement query execution hook
5. Build results viewer component
6. Add export functionality (CSV, JSON, RDF)
7. Write integration tests
---
## References
### Oxigraph Documentation
- **GitHub**: https://github.com/oxigraph/oxigraph
- **Architecture**: https://github.com/oxigraph/oxigraph/wiki/Architecture
- **HTTP API**: https://github.com/oxigraph/oxigraph/wiki/HTTP-API
### Project Documentation
- **Planning**: `ontology/2025-09-09T08-31-07-*-Linked_Data_Cultural_Heritage_Project.json`
- **RDF Datasets**: `data/rdf/README.md`
- **Schema**: `schemas/20251121/rdf/` (8 RDF formats)
### SPARQL Resources
- **W3C SPARQL 1.1**: https://www.w3.org/TR/sparql11-query/
- **SPARQL by Example**: https://www.w3.org/2009/Talks/0615-qbe/
- **RDF/SPARQL Tutorial**: https://www.linkeddatatools.com/querying-semantic-data
---
**Status**: Planning Complete
**Next**: Implement Query Builder (Phase 3 - Task 6)
**Estimated Timeline**: 10-13 hours total (Query Builder 4-5h + SPARQL Execution 6-8h)
**Last Updated**: 2025-11-22
**Author**: OpenCode AI Agent