kempersc/glam

Fork 0

kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

12 KiB

Raw Blame History

Heritage Institution RDF Exports

This directory contains Linked Open Data exports of heritage institution datasets in W3C-compliant RDF formats.

Available Datasets

Denmark 🇩🇰 - COMPLETE (November 2025)

Dataset: denmark_complete.*
Status: ✅ Production-ready
Last Updated: 2025-11-19

Format	File	Size	Use Case
Turtle	`denmark_complete.ttl`	2.27 MB	Human-readable, SPARQL queries
RDF/XML	`denmark_complete.rdf`	3.96 MB	Machine processing, legacy systems
JSON-LD	`denmark_complete.jsonld`	5.16 MB	Web APIs, JavaScript applications
N-Triples	`denmark_complete.nt`	6.24 MB	Line-oriented processing, MapReduce

Statistics

Institutions: 2,348 (555 libraries, 594 archives, 1,199 branches)
RDF Triples: 43,429
Ontologies Used: 9 (CPOV, Schema.org, RICO, ORG, PROV-O, SKOS, Dublin Core, OWL, Heritage)
Wikidata Links: 769 institutions (32.8%)
ISIL Codes: 555 institutions (23.6%)
GHCID Identifiers: 998 institutions (42.5%)

Coverage by Institution Type

Type	Count	ISIL	GHCID	Wikidata
Main Libraries	555	100%	78%	High
Archives	594	0% (by design)	95%	Moderate
Library Branches	1,199	Inherited	0% (by design)	Low

Ontology Alignment

All RDF exports follow these international standards:

Core Ontologies

CPOV (Core Public Organisation Vocabulary)
- Namespace: http://data.europa.eu/m8g/
- Usage: Public sector organization type
- Spec: https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/core-public-organisation-vocabulary
Schema.org
- Namespace: http://schema.org/
- Usage: Names, addresses, descriptions, types
- Types: schema:Library, schema:ArchiveOrganization, schema:Museum
- Spec: https://schema.org/
SKOS (Simple Knowledge Organization System)
- Namespace: http://www.w3.org/2004/02/skos/core#
- Usage: Preferred/alternative labels
- Spec: https://www.w3.org/TR/skos-reference/

Specialized Ontologies

RICO (Records in Contexts Ontology)
- Namespace: https://www.ica.org/standards/RiC/ontology#
- Usage: Archival description (for archives)
- Spec: https://www.ica.org/standards/RiC/ontology
ORG (W3C Organization Ontology)
- Namespace: http://www.w3.org/ns/org#
- Usage: Hierarchical relationships (library branches → main libraries)
- Spec: https://www.w3.org/TR/vocab-org/
PROV-O (Provenance Ontology)
- Namespace: http://www.w3.org/ns/prov#
- Usage: Data provenance tracking
- Spec: https://www.w3.org/TR/prov-o/

Linking Ontologies

OWL (Web Ontology Language)
- Namespace: http://www.w3.org/2002/07/owl#
- Usage: Semantic equivalence (owl:sameAs for Wikidata links)
- Spec: https://www.w3.org/TR/owl2-primer/
Dublin Core Terms
- Namespace: http://purl.org/dc/terms/
- Usage: Identifiers, descriptions, metadata
- Spec: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
Heritage (Project-Specific)
- Namespace: https://w3id.org/heritage/custodian/
- Usage: GHCID identifiers, UUID properties
- Spec: See /docs/PERSISTENT_IDENTIFIERS.md

SPARQL Query Examples

Query 1: Find all libraries in a specific city

PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>

SELECT ?library ?name ?address WHERE {
  ?library a cpov:PublicOrganisation, schema:Library .
  ?library schema:name ?name .
  ?library schema:address ?addrNode .
  ?addrNode schema:addressLocality "København K" .
  ?addrNode schema:streetAddress ?address .
}

Query 2: Find all institutions with Wikidata links

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?wikidataID WHERE {
  ?institution schema:name ?name .
  ?institution owl:sameAs ?wikidataURI .
  FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
  BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}

Query 3: Find library hierarchies (parent-child branches)

PREFIX org: <http://www.w3.org/ns/org#>
PREFIX schema: <http://schema.org/>

SELECT ?parent ?parentName ?child ?childName WHERE {
  ?child org:subOrganizationOf ?parent .
  ?parent schema:name ?parentName .
  ?child schema:name ?childName .
}
LIMIT 100

Query 4: Count institutions by type

PREFIX schema: <http://schema.org/>

SELECT ?type (COUNT(?inst) AS ?count) WHERE {
  ?inst a ?type .
  FILTER(?type IN (schema:Library, schema:ArchiveOrganization, schema:Museum))
}
GROUP BY ?type

Query 5: Find archives with specific ISIL codes

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?archive ?name ?isil WHERE {
  ?archive a schema:ArchiveOrganization .
  ?archive schema:name ?name .
  ?archive dcterms:identifier ?isil .
  FILTER(STRSTARTS(?isil, "DK-"))
}

Query 6: Get provenance for all institutions

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?source WHERE {
  ?institution schema:name ?name .
  ?institution prov:wasGeneratedBy ?activity .
  ?activity dcterms:source ?source .
}
LIMIT 100

Usage Examples

Loading RDF with Python (rdflib)

from rdflib import Graph

# Load Turtle format
g = Graph()
g.parse("denmark_complete.ttl", format="turtle")

print(f"Loaded {len(g)} triples")

# Query with SPARQL
qres = g.query("""
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""")

for row in qres:
    print(row.name)

Loading RDF with Apache Jena (Java)

import org.apache.jena.rdf.model.*;
import org.apache.jena.query.*;

// Load RDF/XML format
Model model = ModelFactory.createDefaultModel();
model.read("denmark_complete.rdf");

// Query with SPARQL
String queryString = """
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""";

Query query = QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query, model);
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query);

Loading JSON-LD with JavaScript

const jsonld = require('jsonld');
const fs = require('fs');

// Load JSON-LD
const doc = JSON.parse(fs.readFileSync('denmark_complete.jsonld', 'utf8'));

// Expand to N-Quads
jsonld.toRDF(doc, {format: 'application/n-quads'}).then(nquads => {
  console.log(`Loaded ${nquads.split('\n').length} triples`);
});

Setting Up a SPARQL Endpoint

Option 1: Apache Jena Fuseki (Open Source)

# Download Jena Fuseki
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.10.0.tar.gz
tar xzf apache-jena-fuseki-4.10.0.tar.gz
cd apache-jena-fuseki-4.10.0

# Start server
./fuseki-server --update --mem /denmark

# Load data
curl -X POST http://localhost:3030/denmark/data \
  --data-binary @denmark_complete.ttl \
  -H "Content-Type: text/turtle"

# Query endpoint
curl -X POST http://localhost:3030/denmark/query \
  --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"

Option 2: GraphDB (Free Edition)

Download GraphDB Free from https://www.ontotext.com/products/graphdb/download/
Install and start GraphDB
Create new repository "denmark"
Import denmark_complete.ttl via web UI
Query via SPARQL interface at http://localhost:7200/sparql

W3ID Persistent Identifiers

All institutions have persistent URIs following the pattern:

https://w3id.org/heritage/custodian/dk/{isil-or-id}

Examples:

Royal Library: https://w3id.org/heritage/custodian/dk/190101
Copenhagen Libraries: https://w3id.org/heritage/custodian/dk/710100
Danish National Archives: https://w3id.org/heritage/custodian/dk/archive/rigsarkivet

Content Negotiation (when w3id.org registration complete):

# Get HTML representation
curl https://w3id.org/heritage/custodian/dk/710100

# Get Turtle RDF
curl -H "Accept: text/turtle" https://w3id.org/heritage/custodian/dk/710100

# Get JSON-LD
curl -H "Accept: application/ld+json" https://w3id.org/heritage/custodian/dk/710100

Data Quality & Provenance

All RDF exports include complete provenance metadata using PROV-O:

<https://w3id.org/heritage/custodian/dk/710100> 
    prov:wasGeneratedBy [
        a prov:Activity ;
        dcterms:source "ISIL_REGISTRY" ;
        prov:startedAtTime "2025-11-19T10:00:00Z"^^xsd:dateTime ;
        prov:endedAtTime "2025-11-19T10:30:00Z"^^xsd:dateTime
    ] .

Data Tier Classification (see AGENTS.md):

TIER_1_AUTHORITATIVE: Official registries (ISIL, national library databases)
TIER_2_VERIFIED: Verified web scraping (Arkiv.dk)
TIER_3_CROWD_SOURCED: Wikidata, OpenStreetMap
TIER_4_INFERRED: NLP-extracted from conversations

Denmark Dataset:

Main libraries (555): TIER_1 (ISIL registry)
Archives (594): TIER_2 (Arkiv.dk verified scraping)
Wikidata links (769): TIER_3 (crowd-sourced)

Validation

All RDF files have been validated using:

Syntax Validation

# Turtle syntax check
rapper -i turtle -o ntriples denmark_complete.ttl > /dev/null

# RDF/XML syntax check
rapper -i rdfxml -o ntriples denmark_complete.rdf > /dev/null

# JSON-LD context validation
jsonld validate denmark_complete.jsonld

Semantic Validation

✅ All URIs resolve to w3id.org namespace (when registration complete)
✅ owl:sameAs links point to valid Wikidata entities
✅ Hierarchical relationships use standard ORG vocabulary
✅ ISIL codes link to isil.org registry
✅ GHCID identifiers follow project specification

Citation

If you use this dataset in research, please cite:

@dataset{danish_glam_rdf_2025,
  author = {GLAM Extractor Project},
  title = {Danish Heritage Institutions Linked Open Data},
  year = {2025},
  month = {November},
  version = {1.0},
  url = {https://github.com/yourusername/glam-extractor},
  note = {2,348 institutions (555 libraries, 594 archives, 1,199 branches), 43,429 RDF triples}
}

Project README: /README.md
LinkML Schema: /schemas/heritage_custodian.yaml
Persistent Identifiers: /docs/PERSISTENT_IDENTIFIERS.md
Ontology Extensions: /docs/ONTOLOGY_EXTENSIONS.md
Denmark Session Summary: /SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md

Contributing

To add new country datasets or improve existing RDF exports:

Follow ontology alignment guidelines in /docs/ONTOLOGY_EXTENSIONS.md
Use RDF exporter template: /scripts/export_denmark_rdf.py
Validate with SPARQL queries before publishing
Update this README with new dataset statistics

License

This data is published under CC0 1.0 Universal (Public Domain). You may use, modify, and distribute it freely without restrictions.

Individual institution data may be subject to different licenses from source registries. Consult:

Danish ISIL Registry: https://slks.dk/isil
Arkiv.dk: https://arkiv.dk
Wikidata: CC0 (https://www.wikidata.org/wiki/Wikidata:Data_access#Licensing)

Last Updated: 2025-11-19
Maintainer: GLAM Extractor Project
Contact: GitHub Issues

12 KiB Raw Blame History