# Heritage Institution RDF Exports

This directory contains **Linked Open Data** exports of heritage institution datasets in W3C-compliant RDF formats.

## Available Datasets

### Denmark 🇩🇰 - COMPLETE (November 2025)

**Dataset**: `denmark_complete.*`  
**Status**: ✅ Production-ready  
**Last Updated**: 2025-11-19

| Format | File | Size | Use Case |
|--------|------|------|----------|
| **Turtle** | `denmark_complete.ttl` | 2.27 MB | Human-readable, SPARQL queries |
| **RDF/XML** | `denmark_complete.rdf` | 3.96 MB | Machine processing, legacy systems |
| **JSON-LD** | `denmark_complete.jsonld` | 5.16 MB | Web APIs, JavaScript applications |
| **N-Triples** | `denmark_complete.nt` | 6.24 MB | Line-oriented processing, MapReduce |

#### Statistics

- **Institutions**: 2,348 (555 libraries, 594 archives, 1,199 branches)
- **RDF Triples**: 43,429
- **Ontologies Used**: 9 (CPOV, Schema.org, RICO, ORG, PROV-O, SKOS, Dublin Core, OWL, Heritage)
- **Wikidata Links**: 769 institutions (32.8%)
- **ISIL Codes**: 555 institutions (23.6%)
- **GHCID Identifiers**: 998 institutions (42.5%)

#### Coverage by Institution Type

| Type | Count | ISIL | GHCID | Wikidata |
|------|-------|------|-------|----------|
| **Main Libraries** | 555 | 100% | 78% | High |
| **Archives** | 594 | 0% (by design) | 95% | Moderate |
| **Library Branches** | 1,199 | Inherited | 0% (by design) | Low |

---

## Ontology Alignment

All RDF exports follow these international standards:

### Core Ontologies

1. **CPOV** (Core Public Organisation Vocabulary)
   - Namespace: `http://data.europa.eu/m8g/`
   - Usage: Public sector organization type
   - Spec: https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/core-public-organisation-vocabulary

2. **Schema.org**
   - Namespace: `http://schema.org/`
   - Usage: Names, addresses, descriptions, types
   - Types: `schema:Library`, `schema:ArchiveOrganization`, `schema:Museum`
   - Spec: https://schema.org/

3. **SKOS** (Simple Knowledge Organization System)
   - Namespace: `http://www.w3.org/2004/02/skos/core#`
   - Usage: Preferred/alternative labels
   - Spec: https://www.w3.org/TR/skos-reference/

### Specialized Ontologies

4. **RICO** (Records in Contexts Ontology)
   - Namespace: `https://www.ica.org/standards/RiC/ontology#`
   - Usage: Archival description (for archives)
   - Spec: https://www.ica.org/standards/RiC/ontology

5. **ORG** (W3C Organization Ontology)
   - Namespace: `http://www.w3.org/ns/org#`
   - Usage: Hierarchical relationships (library branches → main libraries)
   - Spec: https://www.w3.org/TR/vocab-org/

6. **PROV-O** (Provenance Ontology)
   - Namespace: `http://www.w3.org/ns/prov#`
   - Usage: Data provenance tracking
   - Spec: https://www.w3.org/TR/prov-o/

### Linking Ontologies

7. **OWL** (Web Ontology Language)
   - Namespace: `http://www.w3.org/2002/07/owl#`
   - Usage: Semantic equivalence (`owl:sameAs` for Wikidata links)
   - Spec: https://www.w3.org/TR/owl2-primer/

8. **Dublin Core Terms**
   - Namespace: `http://purl.org/dc/terms/`
   - Usage: Identifiers, descriptions, metadata
   - Spec: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

9. **Heritage (Project-Specific)**
   - Namespace: `https://w3id.org/heritage/custodian/`
   - Usage: GHCID identifiers, UUID properties
   - Spec: See `/docs/PERSISTENT_IDENTIFIERS.md`

---

## SPARQL Query Examples

### Query 1: Find all libraries in a specific city

```sparql
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>

SELECT ?library ?name ?address WHERE {
  ?library a cpov:PublicOrganisation, schema:Library .
  ?library schema:name ?name .
  ?library schema:address ?addrNode .
  ?addrNode schema:addressLocality "København K" .
  ?addrNode schema:streetAddress ?address .
}
```

### Query 2: Find all institutions with Wikidata links

```sparql
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?wikidataID WHERE {
  ?institution schema:name ?name .
  ?institution owl:sameAs ?wikidataURI .
  FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
  BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}
```

### Query 3: Find library hierarchies (parent-child branches)

```sparql
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX schema: <http://schema.org/>

SELECT ?parent ?parentName ?child ?childName WHERE {
  ?child org:subOrganizationOf ?parent .
  ?parent schema:name ?parentName .
  ?child schema:name ?childName .
}
LIMIT 100
```

### Query 4: Count institutions by type

```sparql
PREFIX schema: <http://schema.org/>

SELECT ?type (COUNT(?inst) AS ?count) WHERE {
  ?inst a ?type .
  FILTER(?type IN (schema:Library, schema:ArchiveOrganization, schema:Museum))
}
GROUP BY ?type
```

### Query 5: Find archives with specific ISIL codes

```sparql
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?archive ?name ?isil WHERE {
  ?archive a schema:ArchiveOrganization .
  ?archive schema:name ?name .
  ?archive dcterms:identifier ?isil .
  FILTER(STRSTARTS(?isil, "DK-"))
}
```

### Query 6: Get provenance for all institutions

```sparql
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?source WHERE {
  ?institution schema:name ?name .
  ?institution prov:wasGeneratedBy ?activity .
  ?activity dcterms:source ?source .
}
LIMIT 100
```

---

## Usage Examples

### Loading RDF with Python (rdflib)

```python
from rdflib import Graph

# Load Turtle format
g = Graph()
g.parse("denmark_complete.ttl", format="turtle")

print(f"Loaded {len(g)} triples")

# Query with SPARQL
qres = g.query("""
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""")

for row in qres:
    print(row.name)
```

### Loading RDF with Apache Jena (Java)

```java
import org.apache.jena.rdf.model.*;
import org.apache.jena.query.*;

// Load RDF/XML format
Model model = ModelFactory.createDefaultModel();
model.read("denmark_complete.rdf");

// Query with SPARQL
String queryString = """
    PREFIX schema: <http://schema.org/>
    SELECT ?name WHERE {
        ?inst a schema:Library .
        ?inst schema:name ?name .
    }
    LIMIT 10
""";

Query query = QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query, model);
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query);
```

### Loading JSON-LD with JavaScript

```javascript
const jsonld = require('jsonld');
const fs = require('fs');

// Load JSON-LD
const doc = JSON.parse(fs.readFileSync('denmark_complete.jsonld', 'utf8'));

// Expand to N-Quads
jsonld.toRDF(doc, {format: 'application/n-quads'}).then(nquads => {
  console.log(`Loaded ${nquads.split('\n').length} triples`);
});
```

---

## Setting Up a SPARQL Endpoint

### Option 1: Apache Jena Fuseki (Open Source)

```bash
# Download Jena Fuseki
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.10.0.tar.gz
tar xzf apache-jena-fuseki-4.10.0.tar.gz
cd apache-jena-fuseki-4.10.0

# Start server
./fuseki-server --update --mem /denmark

# Load data
curl -X POST http://localhost:3030/denmark/data \
  --data-binary @denmark_complete.ttl \
  -H "Content-Type: text/turtle"

# Query endpoint
curl -X POST http://localhost:3030/denmark/query \
  --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"
```

### Option 2: GraphDB (Free Edition)

1. Download GraphDB Free from https://www.ontotext.com/products/graphdb/download/
2. Install and start GraphDB
3. Create new repository "denmark"
4. Import `denmark_complete.ttl` via web UI
5. Query via SPARQL interface at http://localhost:7200/sparql

---

## W3ID Persistent Identifiers

All institutions have persistent URIs following the pattern:

```
https://w3id.org/heritage/custodian/dk/{isil-or-id}
```

**Examples**:
- Royal Library: `https://w3id.org/heritage/custodian/dk/190101`
- Copenhagen Libraries: `https://w3id.org/heritage/custodian/dk/710100`
- Danish National Archives: `https://w3id.org/heritage/custodian/dk/archive/rigsarkivet`

**Content Negotiation** (when w3id.org registration complete):
```bash
# Get HTML representation
curl https://w3id.org/heritage/custodian/dk/710100

# Get Turtle RDF
curl -H "Accept: text/turtle" https://w3id.org/heritage/custodian/dk/710100

# Get JSON-LD
curl -H "Accept: application/ld+json" https://w3id.org/heritage/custodian/dk/710100
```

---

## Data Quality & Provenance

All RDF exports include **complete provenance metadata** using PROV-O:

```turtle
<https://w3id.org/heritage/custodian/dk/710100> 
    prov:wasGeneratedBy [
        a prov:Activity ;
        dcterms:source "ISIL_REGISTRY" ;
        prov:startedAtTime "2025-11-19T10:00:00Z"^^xsd:dateTime ;
        prov:endedAtTime "2025-11-19T10:30:00Z"^^xsd:dateTime
    ] .
```

**Data Tier Classification** (see `AGENTS.md`):
- **TIER_1_AUTHORITATIVE**: Official registries (ISIL, national library databases)
- **TIER_2_VERIFIED**: Verified web scraping (Arkiv.dk)
- **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap
- **TIER_4_INFERRED**: NLP-extracted from conversations

**Denmark Dataset**:
- Main libraries (555): TIER_1 (ISIL registry)
- Archives (594): TIER_2 (Arkiv.dk verified scraping)
- Wikidata links (769): TIER_3 (crowd-sourced)

---

## Validation

All RDF files have been validated using:

### Syntax Validation

```bash
# Turtle syntax check
rapper -i turtle -o ntriples denmark_complete.ttl > /dev/null

# RDF/XML syntax check
rapper -i rdfxml -o ntriples denmark_complete.rdf > /dev/null

# JSON-LD context validation
jsonld validate denmark_complete.jsonld
```

### Semantic Validation

- ✅ All URIs resolve to w3id.org namespace (when registration complete)
- ✅ owl:sameAs links point to valid Wikidata entities
- ✅ Hierarchical relationships use standard ORG vocabulary
- ✅ ISIL codes link to isil.org registry
- ✅ GHCID identifiers follow project specification

---

## Citation

If you use this dataset in research, please cite:

```bibtex
@dataset{danish_glam_rdf_2025,
  author = {GLAM Extractor Project},
  title = {Danish Heritage Institutions Linked Open Data},
  year = {2025},
  month = {November},
  version = {1.0},
  url = {https://github.com/yourusername/glam-extractor},
  note = {2,348 institutions (555 libraries, 594 archives, 1,199 branches), 43,429 RDF triples}
}
```

---

## Related Documentation

- **Project README**: `/README.md`
- **LinkML Schema**: `/schemas/heritage_custodian.yaml`
- **Persistent Identifiers**: `/docs/PERSISTENT_IDENTIFIERS.md`
- **Ontology Extensions**: `/docs/ONTOLOGY_EXTENSIONS.md`
- **Denmark Session Summary**: `/SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md`

---

## Contributing

To add new country datasets or improve existing RDF exports:

1. Follow ontology alignment guidelines in `/docs/ONTOLOGY_EXTENSIONS.md`
2. Use RDF exporter template: `/scripts/export_denmark_rdf.py`
3. Validate with SPARQL queries before publishing
4. Update this README with new dataset statistics

---

## License

This data is published under **CC0 1.0 Universal (Public Domain)**. You may use, modify, and distribute it freely without restrictions.

Individual institution data may be subject to different licenses from source registries. Consult:
- Danish ISIL Registry: https://slks.dk/isil
- Arkiv.dk: https://arkiv.dk
- Wikidata: CC0 (https://www.wikidata.org/wiki/Wikidata:Data_access#Licensing)

---

**Last Updated**: 2025-11-19  
**Maintainer**: GLAM Extractor Project  
**Contact**: [GitHub Issues](https://github.com/yourusername/glam-extractor/issues)