421 lines
12 KiB
Markdown
421 lines
12 KiB
Markdown
# Heritage Institution RDF Exports
|
|
|
|
This directory contains **Linked Open Data** exports of heritage institution datasets in W3C-compliant RDF formats.
|
|
|
|
## Available Datasets
|
|
|
|
### Denmark 🇩🇰 - COMPLETE (November 2025)
|
|
|
|
**Dataset**: `denmark_complete.*`
|
|
**Status**: ✅ Production-ready
|
|
**Last Updated**: 2025-11-19
|
|
|
|
| Format | File | Size | Use Case |
|
|
|--------|------|------|----------|
|
|
| **Turtle** | `denmark_complete.ttl` | 2.27 MB | Human-readable, SPARQL queries |
|
|
| **RDF/XML** | `denmark_complete.rdf` | 3.96 MB | Machine processing, legacy systems |
|
|
| **JSON-LD** | `denmark_complete.jsonld` | 5.16 MB | Web APIs, JavaScript applications |
|
|
| **N-Triples** | `denmark_complete.nt` | 6.24 MB | Line-oriented processing, MapReduce |
|
|
|
|
#### Statistics
|
|
|
|
- **Institutions**: 2,348 (555 libraries, 594 archives, 1,199 branches)
|
|
- **RDF Triples**: 43,429
|
|
- **Ontologies Used**: 9 (CPOV, Schema.org, RICO, ORG, PROV-O, SKOS, Dublin Core, OWL, Heritage)
|
|
- **Wikidata Links**: 769 institutions (32.8%)
|
|
- **ISIL Codes**: 555 institutions (23.6%)
|
|
- **GHCID Identifiers**: 998 institutions (42.5%)
|
|
|
|
#### Coverage by Institution Type
|
|
|
|
| Type | Count | ISIL | GHCID | Wikidata |
|
|
|------|-------|------|-------|----------|
|
|
| **Main Libraries** | 555 | 100% | 78% | High |
|
|
| **Archives** | 594 | 0% (by design) | 95% | Moderate |
|
|
| **Library Branches** | 1,199 | Inherited | 0% (by design) | Low |
|
|
|
|
---
|
|
|
|
## Ontology Alignment
|
|
|
|
All RDF exports follow these international standards:
|
|
|
|
### Core Ontologies
|
|
|
|
1. **CPOV** (Core Public Organisation Vocabulary)
|
|
- Namespace: `http://data.europa.eu/m8g/`
|
|
- Usage: Public sector organization type
|
|
- Spec: https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/core-public-organisation-vocabulary
|
|
|
|
2. **Schema.org**
|
|
- Namespace: `http://schema.org/`
|
|
- Usage: Names, addresses, descriptions, types
|
|
- Types: `schema:Library`, `schema:ArchiveOrganization`, `schema:Museum`
|
|
- Spec: https://schema.org/
|
|
|
|
3. **SKOS** (Simple Knowledge Organization System)
|
|
- Namespace: `http://www.w3.org/2004/02/skos/core#`
|
|
- Usage: Preferred/alternative labels
|
|
- Spec: https://www.w3.org/TR/skos-reference/
|
|
|
|
### Specialized Ontologies
|
|
|
|
4. **RICO** (Records in Contexts Ontology)
|
|
- Namespace: `https://www.ica.org/standards/RiC/ontology#`
|
|
- Usage: Archival description (for archives)
|
|
- Spec: https://www.ica.org/standards/RiC/ontology
|
|
|
|
5. **ORG** (W3C Organization Ontology)
|
|
- Namespace: `http://www.w3.org/ns/org#`
|
|
- Usage: Hierarchical relationships (library branches → main libraries)
|
|
- Spec: https://www.w3.org/TR/vocab-org/
|
|
|
|
6. **PROV-O** (Provenance Ontology)
|
|
- Namespace: `http://www.w3.org/ns/prov#`
|
|
- Usage: Data provenance tracking
|
|
- Spec: https://www.w3.org/TR/prov-o/
|
|
|
|
### Linking Ontologies
|
|
|
|
7. **OWL** (Web Ontology Language)
|
|
- Namespace: `http://www.w3.org/2002/07/owl#`
|
|
- Usage: Semantic equivalence (`owl:sameAs` for Wikidata links)
|
|
- Spec: https://www.w3.org/TR/owl2-primer/
|
|
|
|
8. **Dublin Core Terms**
|
|
- Namespace: `http://purl.org/dc/terms/`
|
|
- Usage: Identifiers, descriptions, metadata
|
|
- Spec: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
|
|
|
|
9. **Heritage (Project-Specific)**
|
|
- Namespace: `https://w3id.org/heritage/custodian/`
|
|
- Usage: GHCID identifiers, UUID properties
|
|
- Spec: See `/docs/PERSISTENT_IDENTIFIERS.md`
|
|
|
|
---
|
|
|
|
## SPARQL Query Examples
|
|
|
|
### Query 1: Find all libraries in a specific city
|
|
|
|
```sparql
|
|
PREFIX schema: <http://schema.org/>
|
|
PREFIX cpov: <http://data.europa.eu/m8g/>
|
|
|
|
SELECT ?library ?name ?address WHERE {
|
|
?library a cpov:PublicOrganisation, schema:Library .
|
|
?library schema:name ?name .
|
|
?library schema:address ?addrNode .
|
|
?addrNode schema:addressLocality "København K" .
|
|
?addrNode schema:streetAddress ?address .
|
|
}
|
|
```
|
|
|
|
### Query 2: Find all institutions with Wikidata links
|
|
|
|
```sparql
|
|
PREFIX owl: <http://www.w3.org/2002/07/owl#>
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?institution ?name ?wikidataID WHERE {
|
|
?institution schema:name ?name .
|
|
?institution owl:sameAs ?wikidataURI .
|
|
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
|
|
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
|
|
}
|
|
```
|
|
|
|
### Query 3: Find library hierarchies (parent-child branches)
|
|
|
|
```sparql
|
|
PREFIX org: <http://www.w3.org/ns/org#>
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?parent ?parentName ?child ?childName WHERE {
|
|
?child org:subOrganizationOf ?parent .
|
|
?parent schema:name ?parentName .
|
|
?child schema:name ?childName .
|
|
}
|
|
LIMIT 100
|
|
```
|
|
|
|
### Query 4: Count institutions by type
|
|
|
|
```sparql
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?type (COUNT(?inst) AS ?count) WHERE {
|
|
?inst a ?type .
|
|
FILTER(?type IN (schema:Library, schema:ArchiveOrganization, schema:Museum))
|
|
}
|
|
GROUP BY ?type
|
|
```
|
|
|
|
### Query 5: Find archives with specific ISIL codes
|
|
|
|
```sparql
|
|
PREFIX dcterms: <http://purl.org/dc/terms/>
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?archive ?name ?isil WHERE {
|
|
?archive a schema:ArchiveOrganization .
|
|
?archive schema:name ?name .
|
|
?archive dcterms:identifier ?isil .
|
|
FILTER(STRSTARTS(?isil, "DK-"))
|
|
}
|
|
```
|
|
|
|
### Query 6: Get provenance for all institutions
|
|
|
|
```sparql
|
|
PREFIX prov: <http://www.w3.org/ns/prov#>
|
|
PREFIX dcterms: <http://purl.org/dc/terms/>
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?institution ?name ?source WHERE {
|
|
?institution schema:name ?name .
|
|
?institution prov:wasGeneratedBy ?activity .
|
|
?activity dcterms:source ?source .
|
|
}
|
|
LIMIT 100
|
|
```
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Loading RDF with Python (rdflib)
|
|
|
|
```python
|
|
from rdflib import Graph
|
|
|
|
# Load Turtle format
|
|
g = Graph()
|
|
g.parse("denmark_complete.ttl", format="turtle")
|
|
|
|
print(f"Loaded {len(g)} triples")
|
|
|
|
# Query with SPARQL
|
|
qres = g.query("""
|
|
PREFIX schema: <http://schema.org/>
|
|
SELECT ?name WHERE {
|
|
?inst a schema:Library .
|
|
?inst schema:name ?name .
|
|
}
|
|
LIMIT 10
|
|
""")
|
|
|
|
for row in qres:
|
|
print(row.name)
|
|
```
|
|
|
|
### Loading RDF with Apache Jena (Java)
|
|
|
|
```java
|
|
import org.apache.jena.rdf.model.*;
|
|
import org.apache.jena.query.*;
|
|
|
|
// Load RDF/XML format
|
|
Model model = ModelFactory.createDefaultModel();
|
|
model.read("denmark_complete.rdf");
|
|
|
|
// Query with SPARQL
|
|
String queryString = """
|
|
PREFIX schema: <http://schema.org/>
|
|
SELECT ?name WHERE {
|
|
?inst a schema:Library .
|
|
?inst schema:name ?name .
|
|
}
|
|
LIMIT 10
|
|
""";
|
|
|
|
Query query = QueryFactory.create(queryString);
|
|
QueryExecution qexec = QueryExecutionFactory.create(query, model);
|
|
ResultSet results = qexec.execSelect();
|
|
ResultSetFormatter.out(System.out, results, query);
|
|
```
|
|
|
|
### Loading JSON-LD with JavaScript
|
|
|
|
```javascript
|
|
const jsonld = require('jsonld');
|
|
const fs = require('fs');
|
|
|
|
// Load JSON-LD
|
|
const doc = JSON.parse(fs.readFileSync('denmark_complete.jsonld', 'utf8'));
|
|
|
|
// Expand to N-Quads
|
|
jsonld.toRDF(doc, {format: 'application/n-quads'}).then(nquads => {
|
|
console.log(`Loaded ${nquads.split('\n').length} triples`);
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## Setting Up a SPARQL Endpoint
|
|
|
|
### Option 1: Apache Jena Fuseki (Open Source)
|
|
|
|
```bash
|
|
# Download Jena Fuseki
|
|
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.10.0.tar.gz
|
|
tar xzf apache-jena-fuseki-4.10.0.tar.gz
|
|
cd apache-jena-fuseki-4.10.0
|
|
|
|
# Start server
|
|
./fuseki-server --update --mem /denmark
|
|
|
|
# Load data
|
|
curl -X POST http://localhost:3030/denmark/data \
|
|
--data-binary @denmark_complete.ttl \
|
|
-H "Content-Type: text/turtle"
|
|
|
|
# Query endpoint
|
|
curl -X POST http://localhost:3030/denmark/query \
|
|
--data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"
|
|
```
|
|
|
|
### Option 2: GraphDB (Free Edition)
|
|
|
|
1. Download GraphDB Free from https://www.ontotext.com/products/graphdb/download/
|
|
2. Install and start GraphDB
|
|
3. Create new repository "denmark"
|
|
4. Import `denmark_complete.ttl` via web UI
|
|
5. Query via SPARQL interface at http://localhost:7200/sparql
|
|
|
|
---
|
|
|
|
## W3ID Persistent Identifiers
|
|
|
|
All institutions have persistent URIs following the pattern:
|
|
|
|
```
|
|
https://w3id.org/heritage/custodian/dk/{isil-or-id}
|
|
```
|
|
|
|
**Examples**:
|
|
- Royal Library: `https://w3id.org/heritage/custodian/dk/190101`
|
|
- Copenhagen Libraries: `https://w3id.org/heritage/custodian/dk/710100`
|
|
- Danish National Archives: `https://w3id.org/heritage/custodian/dk/archive/rigsarkivet`
|
|
|
|
**Content Negotiation** (when w3id.org registration complete):
|
|
```bash
|
|
# Get HTML representation
|
|
curl https://w3id.org/heritage/custodian/dk/710100
|
|
|
|
# Get Turtle RDF
|
|
curl -H "Accept: text/turtle" https://w3id.org/heritage/custodian/dk/710100
|
|
|
|
# Get JSON-LD
|
|
curl -H "Accept: application/ld+json" https://w3id.org/heritage/custodian/dk/710100
|
|
```
|
|
|
|
---
|
|
|
|
## Data Quality & Provenance
|
|
|
|
All RDF exports include **complete provenance metadata** using PROV-O:
|
|
|
|
```turtle
|
|
<https://w3id.org/heritage/custodian/dk/710100>
|
|
prov:wasGeneratedBy [
|
|
a prov:Activity ;
|
|
dcterms:source "ISIL_REGISTRY" ;
|
|
prov:startedAtTime "2025-11-19T10:00:00Z"^^xsd:dateTime ;
|
|
prov:endedAtTime "2025-11-19T10:30:00Z"^^xsd:dateTime
|
|
] .
|
|
```
|
|
|
|
**Data Tier Classification** (see `AGENTS.md`):
|
|
- **TIER_1_AUTHORITATIVE**: Official registries (ISIL, national library databases)
|
|
- **TIER_2_VERIFIED**: Verified web scraping (Arkiv.dk)
|
|
- **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap
|
|
- **TIER_4_INFERRED**: NLP-extracted from conversations
|
|
|
|
**Denmark Dataset**:
|
|
- Main libraries (555): TIER_1 (ISIL registry)
|
|
- Archives (594): TIER_2 (Arkiv.dk verified scraping)
|
|
- Wikidata links (769): TIER_3 (crowd-sourced)
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
All RDF files have been validated using:
|
|
|
|
### Syntax Validation
|
|
|
|
```bash
|
|
# Turtle syntax check
|
|
rapper -i turtle -o ntriples denmark_complete.ttl > /dev/null
|
|
|
|
# RDF/XML syntax check
|
|
rapper -i rdfxml -o ntriples denmark_complete.rdf > /dev/null
|
|
|
|
# JSON-LD context validation
|
|
jsonld validate denmark_complete.jsonld
|
|
```
|
|
|
|
### Semantic Validation
|
|
|
|
- ✅ All URIs resolve to w3id.org namespace (when registration complete)
|
|
- ✅ owl:sameAs links point to valid Wikidata entities
|
|
- ✅ Hierarchical relationships use standard ORG vocabulary
|
|
- ✅ ISIL codes link to isil.org registry
|
|
- ✅ GHCID identifiers follow project specification
|
|
|
|
---
|
|
|
|
## Citation
|
|
|
|
If you use this dataset in research, please cite:
|
|
|
|
```bibtex
|
|
@dataset{danish_glam_rdf_2025,
|
|
author = {GLAM Extractor Project},
|
|
title = {Danish Heritage Institutions Linked Open Data},
|
|
year = {2025},
|
|
month = {November},
|
|
version = {1.0},
|
|
url = {https://github.com/yourusername/glam-extractor},
|
|
note = {2,348 institutions (555 libraries, 594 archives, 1,199 branches), 43,429 RDF triples}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Project README**: `/README.md`
|
|
- **LinkML Schema**: `/schemas/heritage_custodian.yaml`
|
|
- **Persistent Identifiers**: `/docs/PERSISTENT_IDENTIFIERS.md`
|
|
- **Ontology Extensions**: `/docs/ONTOLOGY_EXTENSIONS.md`
|
|
- **Denmark Session Summary**: `/SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md`
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
To add new country datasets or improve existing RDF exports:
|
|
|
|
1. Follow ontology alignment guidelines in `/docs/ONTOLOGY_EXTENSIONS.md`
|
|
2. Use RDF exporter template: `/scripts/export_denmark_rdf.py`
|
|
3. Validate with SPARQL queries before publishing
|
|
4. Update this README with new dataset statistics
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
This data is published under **CC0 1.0 Universal (Public Domain)**. You may use, modify, and distribute it freely without restrictions.
|
|
|
|
Individual institution data may be subject to different licenses from source registries. Consult:
|
|
- Danish ISIL Registry: https://slks.dk/isil
|
|
- Arkiv.dk: https://arkiv.dk
|
|
- Wikidata: CC0 (https://www.wikidata.org/wiki/Wikidata:Data_access#Licensing)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-19
|
|
**Maintainer**: GLAM Extractor Project
|
|
**Contact**: [GitHub Issues](https://github.com/yourusername/glam-extractor/issues)
|