glam/data/rdf/README.md
2025-11-19 23:25:22 +01:00

12 KiB

Heritage Institution RDF Exports

This directory contains Linked Open Data exports of heritage institution datasets in W3C-compliant RDF formats.

Available Datasets

Denmark 🇩🇰 - COMPLETE (November 2025)

Dataset: denmark_complete.*
Status: Production-ready
Last Updated: 2025-11-19

Format File Size Use Case
Turtle denmark_complete.ttl 2.27 MB Human-readable, SPARQL queries
RDF/XML denmark_complete.rdf 3.96 MB Machine processing, legacy systems
JSON-LD denmark_complete.jsonld 5.16 MB Web APIs, JavaScript applications
N-Triples denmark_complete.nt 6.24 MB Line-oriented processing, MapReduce

Statistics

  • Institutions: 2,348 (555 libraries, 594 archives, 1,199 branches)
  • RDF Triples: 43,429
  • Ontologies Used: 9 (CPOV, Schema.org, RICO, ORG, PROV-O, SKOS, Dublin Core, OWL, Heritage)
  • Wikidata Links: 769 institutions (32.8%)
  • ISIL Codes: 555 institutions (23.6%)
  • GHCID Identifiers: 998 institutions (42.5%)

Coverage by Institution Type

Type Count ISIL GHCID Wikidata
Main Libraries 555 100% 78% High
Archives 594 0% (by design) 95% Moderate
Library Branches 1,199 Inherited 0% (by design) Low

Ontology Alignment

All RDF exports follow these international standards:

Core Ontologies

  1. CPOV (Core Public Organisation Vocabulary)

  2. Schema.org

    • Namespace: http://schema.org/
    • Usage: Names, addresses, descriptions, types
    • Types: schema:Library, schema:ArchiveOrganization, schema:Museum
    • Spec: https://schema.org/
  3. SKOS (Simple Knowledge Organization System)

Specialized Ontologies

  1. RICO (Records in Contexts Ontology)

  2. ORG (W3C Organization Ontology)

  3. PROV-O (Provenance Ontology)

Linking Ontologies

  1. OWL (Web Ontology Language)

  2. Dublin Core Terms

  3. Heritage (Project-Specific)

    • Namespace: https://w3id.org/heritage/custodian/
    • Usage: GHCID identifiers, UUID properties
    • Spec: See /docs/PERSISTENT_IDENTIFIERS.md

SPARQL Query Examples

Query 1: Find all libraries in a specific city

PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>

SELECT ?library ?name ?address WHERE {
  ?library a cpov:PublicOrganisation, schema:Library .
  ?library schema:name ?name .
  ?library schema:address ?addrNode .
  ?addrNode schema:addressLocality "København K" .
  ?addrNode schema:streetAddress ?address .
}
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?wikidataID WHERE {
  ?institution schema:name ?name .
  ?institution owl:sameAs ?wikidataURI .
  FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
  BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}

Query 3: Find library hierarchies (parent-child branches)

PREFIX org: <http://www.w3.org/ns/org#>
PREFIX schema: <http://schema.org/>

SELECT ?parent ?parentName ?child ?childName WHERE {
  ?child org:subOrganizationOf ?parent .
  ?parent schema:name ?parentName .
  ?child schema:name ?childName .
}
LIMIT 100

Query 4: Count institutions by type

PREFIX schema: <http://schema.org/>

SELECT ?type (COUNT(?inst) AS ?count) WHERE {
  ?inst a ?type .
  FILTER(?type IN (schema:Library, schema:ArchiveOrganization, schema:Museum))
}
GROUP BY ?type

Query 5: Find archives with specific ISIL codes

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?archive ?name ?isil WHERE {
  ?archive a schema:ArchiveOrganization .
  ?archive schema:name ?name .
  ?archive dcterms:identifier ?isil .
  FILTER(STRSTARTS(?isil, "DK-"))
}

Query 6: Get provenance for all institutions

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?source WHERE {
  ?institution schema:name ?name .
  ?institution prov:wasGeneratedBy ?activity .
  ?activity dcterms:source ?source .
}
LIMIT 100

Usage Examples

Loading RDF with Python (rdflib)

from rdflib import Graph

# Load Turtle format
g = Graph()
g.parse("denmark_complete.ttl", format="turtle")

print(f"Loaded {len(g)} triples")

# Query with SPARQL
qres = g.query("""
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""")

for row in qres:
    print(row.name)

Loading RDF with Apache Jena (Java)

import org.apache.jena.rdf.model.*;
import org.apache.jena.query.*;

// Load RDF/XML format
Model model = ModelFactory.createDefaultModel();
model.read("denmark_complete.rdf");

// Query with SPARQL
String queryString = """
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""";

Query query = QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query, model);
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query);

Loading JSON-LD with JavaScript

const jsonld = require('jsonld');
const fs = require('fs');

// Load JSON-LD
const doc = JSON.parse(fs.readFileSync('denmark_complete.jsonld', 'utf8'));

// Expand to N-Quads
jsonld.toRDF(doc, {format: 'application/n-quads'}).then(nquads => {
  console.log(`Loaded ${nquads.split('\n').length} triples`);
});

Setting Up a SPARQL Endpoint

Option 1: Apache Jena Fuseki (Open Source)

# Download Jena Fuseki
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.10.0.tar.gz
tar xzf apache-jena-fuseki-4.10.0.tar.gz
cd apache-jena-fuseki-4.10.0

# Start server
./fuseki-server --update --mem /denmark

# Load data
curl -X POST http://localhost:3030/denmark/data \
  --data-binary @denmark_complete.ttl \
  -H "Content-Type: text/turtle"

# Query endpoint
curl -X POST http://localhost:3030/denmark/query \
  --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"

Option 2: GraphDB (Free Edition)

  1. Download GraphDB Free from https://www.ontotext.com/products/graphdb/download/
  2. Install and start GraphDB
  3. Create new repository "denmark"
  4. Import denmark_complete.ttl via web UI
  5. Query via SPARQL interface at http://localhost:7200/sparql

W3ID Persistent Identifiers

All institutions have persistent URIs following the pattern:

https://w3id.org/heritage/custodian/dk/{isil-or-id}

Examples:

  • Royal Library: https://w3id.org/heritage/custodian/dk/190101
  • Copenhagen Libraries: https://w3id.org/heritage/custodian/dk/710100
  • Danish National Archives: https://w3id.org/heritage/custodian/dk/archive/rigsarkivet

Content Negotiation (when w3id.org registration complete):

# Get HTML representation
curl https://w3id.org/heritage/custodian/dk/710100

# Get Turtle RDF
curl -H "Accept: text/turtle" https://w3id.org/heritage/custodian/dk/710100

# Get JSON-LD
curl -H "Accept: application/ld+json" https://w3id.org/heritage/custodian/dk/710100

Data Quality & Provenance

All RDF exports include complete provenance metadata using PROV-O:

<https://w3id.org/heritage/custodian/dk/710100> 
    prov:wasGeneratedBy [
        a prov:Activity ;
        dcterms:source "ISIL_REGISTRY" ;
        prov:startedAtTime "2025-11-19T10:00:00Z"^^xsd:dateTime ;
        prov:endedAtTime "2025-11-19T10:30:00Z"^^xsd:dateTime
    ] .

Data Tier Classification (see AGENTS.md):

  • TIER_1_AUTHORITATIVE: Official registries (ISIL, national library databases)
  • TIER_2_VERIFIED: Verified web scraping (Arkiv.dk)
  • TIER_3_CROWD_SOURCED: Wikidata, OpenStreetMap
  • TIER_4_INFERRED: NLP-extracted from conversations

Denmark Dataset:

  • Main libraries (555): TIER_1 (ISIL registry)
  • Archives (594): TIER_2 (Arkiv.dk verified scraping)
  • Wikidata links (769): TIER_3 (crowd-sourced)

Validation

All RDF files have been validated using:

Syntax Validation

# Turtle syntax check
rapper -i turtle -o ntriples denmark_complete.ttl > /dev/null

# RDF/XML syntax check
rapper -i rdfxml -o ntriples denmark_complete.rdf > /dev/null

# JSON-LD context validation
jsonld validate denmark_complete.jsonld

Semantic Validation

  • All URIs resolve to w3id.org namespace (when registration complete)
  • owl:sameAs links point to valid Wikidata entities
  • Hierarchical relationships use standard ORG vocabulary
  • ISIL codes link to isil.org registry
  • GHCID identifiers follow project specification

Citation

If you use this dataset in research, please cite:

@dataset{danish_glam_rdf_2025,
  author = {GLAM Extractor Project},
  title = {Danish Heritage Institutions Linked Open Data},
  year = {2025},
  month = {November},
  version = {1.0},
  url = {https://github.com/yourusername/glam-extractor},
  note = {2,348 institutions (555 libraries, 594 archives, 1,199 branches), 43,429 RDF triples}
}

  • Project README: /README.md
  • LinkML Schema: /schemas/heritage_custodian.yaml
  • Persistent Identifiers: /docs/PERSISTENT_IDENTIFIERS.md
  • Ontology Extensions: /docs/ONTOLOGY_EXTENSIONS.md
  • Denmark Session Summary: /SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md

Contributing

To add new country datasets or improve existing RDF exports:

  1. Follow ontology alignment guidelines in /docs/ONTOLOGY_EXTENSIONS.md
  2. Use RDF exporter template: /scripts/export_denmark_rdf.py
  3. Validate with SPARQL queries before publishing
  4. Update this README with new dataset statistics

License

This data is published under CC0 1.0 Universal (Public Domain). You may use, modify, and distribute it freely without restrictions.

Individual institution data may be subject to different licenses from source registries. Consult:


Last Updated: 2025-11-19
Maintainer: GLAM Extractor Project
Contact: GitHub Issues