glam/docs/ontology_integration_design.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

398 lines
12 KiB
Markdown

# Heritage Custodian Ontology Integration Design
**Version**: 0.2.0
**Date**: 2025-11-05
**Status**: DRAFT - Awaiting subagent analysis
## Executive Summary
This document outlines the design for integrating multiple established ontologies into the Heritage Custodian LinkML schema:
- **TOOI** (Dutch government organizational ontology)
- **CPOV** (Core Public Organization Vocabulary - EU)
- **Schema.org** (web semantics)
- **EDM** (Europeana Data Model)
- **PROV-O** (W3C Provenance Ontology)
## Key Ontology Patterns Identified
### 1. TOOI Temporal Model
**Source**: `data/ontology/tooiont.ttl`
TOOI provides a sophisticated temporal tracking system for organizational changes:
#### Core Classes
```turtle
tooi:Overheidsorganisatie
- rdfs:subClassOf org:FormalOrganization
- rdfs:subClassOf prov:Agent
- rdfs:subClassOf prov:Entity
- rdfs:subClassOf prov:Organization
```
**Key Properties**:
- `prov:generatedAtTime` - when organization was founded
- `prov:invalidatedAtTime` - when organization ceased to exist
- `tooi:begindatum` - first calendar day organization existed (date component of generatedAtTime)
- `tooi:einddatum` - last calendar day organization existed (date component of invalidatedAtTime)
- `tooi:afkorting` - abbreviation
- `tooi:alternatieveNaam` - alternative names (multivalued)
- `dcterms:isPartOf` - parent organization (recursive)
- `tooi:organisatiecode` - unique organization code
#### Change Event Model
```turtle
tooi:Wijzigingsgebeurtenis (ChangeEvent)
- rdfs:subClassOf prov:Activity
- Properties:
- tooi:tijdstipWijziging (dateTime when change formally occurred)
- tooi:tijdstipRegistratie (dateTime when change was registered)
- tooi:redenWijziging (reason for change)
- tooi:heeftJuridischeGrondslag (legal basis for change)
tooi:ExistentieleWijziging (ExistentialChange)
- rdfs:subClassOf tooi:Wijzigingsgebeurtenis
- Subtypes:
- tooi:Afsplitsing (Split/spinoff)
- tooi:Fusie (Merger)
- tooi:Opheffing (Dissolution)
```
**Design Implication**: Our current `ghcid_history` uses a simple list of history entries. We should consider adding a `ChangeEvent` class that follows TOOI's pattern of linking changes to PROV-O activities.
### 2. CPOV Public Organization Model
**Source**: `data/ontology/core-public-organisation-ap.ttl`
CPOV focuses on public sector organizations with EU interoperability:
#### Core Classes
```turtle
cpov:PublicOrganisation
- rdfs:subClassOf org:Organization
- Represents government/public heritage organizations
cpov:ContactPoint
- Properties:
- cpov:email
- cpov:telephone
- cpov:contactPage (foaf:Document)
```
**Design Implication**: Our `ContactInfo` class aligns well with `cpov:ContactPoint`. We should add `class_uri: cpov:ContactPoint` mapping.
### 3. PROV-O Provenance Model
Both TOOI and CPOV heavily use PROV-O for temporal and provenance tracking:
- `prov:Entity` - things with provenance
- `prov:Activity` - activities that affect entities
- `prov:Agent` - agents responsible for activities
- `prov:generatedAtTime` - when entity was created
- `prov:invalidatedAtTime` - when entity ceased to be valid
- `prov:wasGeneratedBy` - links entity to creating activity
- `prov:wasInvalidatedBy` - links entity to ending activity
**Design Implication**: We should make our `HeritageCustodian` a subclass of `prov:Entity` and use PROV-O properties for temporal tracking.
## Proposed Schema Extensions
### New Classes to Add
#### 1. ChangeEvent
Models organizational changes over time (inspired by TOOI):
```yaml
ChangeEvent:
description: >-
An event that changed the state of a heritage custodian organization
(e.g., founding, closure, relocation, name change, merger, split).
Based on tooi:Wijzigingsgebeurtenis pattern.
class_uri: prov:Activity
mixins:
- tooi:Wijzigingsgebeurtenis
slots:
- change_type # founding, closure, relocation, rename, merger, split
- effective_date # when change formally occurred (tooi:tijdstipWijziging)
- registration_date # when change was recorded (tooi:tijdstipRegistratie)
- reason # reason for change (tooi:redenWijziging)
- legal_basis # legal document/regulation (tooi:heeftJuridischeGrondslag)
- affected_organization # link to HeritageCustodian
- resulting_ghcid # new GHCID after this change
- previous_ghcid # GHCID before this change
```
#### 2. OrganizationalUnit
For departments/branches of larger institutions:
```yaml
OrganizationalUnit:
description: >-
A unit, department, or branch within a larger heritage custodian organization.
class_uri: org:OrganizationalUnit
is_a: HeritageCustodian
slots:
- unit_type # department, branch, division, section
- parent_unit # recursive
```
### Properties to Enhance
#### Temporal Properties
Add PROV-O temporal tracking to `HeritageCustodian`:
```yaml
# In HeritageCustodian class
slots:
- prov_generated_at # maps to prov:generatedAtTime
- prov_invalidated_at # maps to prov:invalidatedAtTime
- change_history # list of ChangeEvent instances
# Slot definitions
prov_generated_at:
description: Timestamp when organization was formally founded
range: datetime
slot_uri: prov:generatedAtTime
prov_invalidated_at:
description: Timestamp when organization ceased to exist
range: datetime
slot_uri: prov:invalidatedAtTime
change_history:
description: Historical record of changes to this organization
range: ChangeEvent
multivalued: true
inlined: true
inlined_as_list: true
```
#### Name Properties (TOOI-inspired)
```yaml
official_name:
description: Official legal name of the organization
range: string
slot_uri: tooi:officieleNaamInclSoort
sorting_name:
description: Name formatted for alphabetical sorting
range: string
slot_uri: tooi:officieleNaamSorteer
abbreviation:
description: Official abbreviation or acronym
range: string
slot_uri: tooi:afkorting
```
### Ontology Mappings to Update
#### HeritageCustodian
```yaml
HeritageCustodian:
class_uri: org:Organization
mixins:
- prov:Entity # Add PROV-O provenance tracking
- tooi:Overheidsorganisatie # For Dutch institutions
- cpov:PublicOrganisation # For government institutions
- schema:Organization # For Schema.org compatibility
```
#### ContactInfo
```yaml
ContactInfo:
class_uri: cpov:ContactPoint
exact_mappings:
- schema:ContactPoint
```
### Enumerations to Add
#### ChangeTypeEnum
```yaml
ChangeTypeEnum:
description: Types of organizational changes
permissible_values:
FOUNDING:
description: Organization was founded
meaning: tooi:Oprichting
CLOSURE:
description: Organization ceased operations
meaning: tooi:Opheffing
MERGER:
description: Organization merged with another
meaning: tooi:Fusie
SPLIT:
description: Organization split into multiple entities
meaning: tooi:Afsplitsing
RELOCATION:
description: Organization moved to new location
NAME_CHANGE:
description: Organization changed its name
TYPE_CHANGE:
description: Institution type changed
STATUS_CHANGE:
description: Operational status changed
```
## Integration with GHCID System
The GHCID system already tracks identifier changes via `ghcid_history`. We should:
1. **Keep `ghcid_history`** as-is (simple, functional)
2. **Add `change_history`** for richer semantic change tracking
3. **Link the two**: Each `GHCIDHistoryEntry` should reference a `ChangeEvent` if applicable
### Example Mapping
```yaml
# Simple GHCID history (current system)
ghcid_history:
- ghcid: "NL-NH-AMS-M-RM"
ghcid_numeric: 12345678901234567890
valid_from: "2020-01-01T00:00:00Z"
valid_to: null
reason: "Initial identifier"
institution_name: "Rijksmuseum"
location_city: "Amsterdam"
location_country: "NL"
# Rich semantic change history (new system)
change_history:
- change_type: FOUNDING
effective_date: "1800-11-19T00:00:00Z"
registration_date: "2020-01-01T00:00:00Z"
reason: "Founded as national art museum"
resulting_ghcid: "NL-NH-AMS-M-RM"
affected_organization: "https://example.org/custodian/12345"
```
## EDM Aggregator/Provider Pattern
*(Awaiting subagent analysis of EDM conversations)*
Expected patterns:
- `edm:ProvidedCHO` (Cultural Heritage Object)
- `edm:WebResource` (digital representation)
- `edm:Agent` (provider organization)
- `ore:Aggregation` (metadata aggregation)
## Namespace Strategy
### Recommended Approach
1. **Create our own namespace**: `https://w3id.org/heritage/custodian/`
2. **Reuse existing properties** via `slot_uri` mappings
3. **Define custom properties** only when no suitable property exists
### Prefix Registry
```yaml
prefixes:
heritage: https://w3id.org/heritage/custodian/
tooi: https://identifier.overheid.nl/tooi/def/ont/
cpov: http://data.europa.eu/m8g/
org: http://www.w3.org/ns/org#
prov: http://www.w3.org/ns/prov#
schema: http://schema.org/
rico: https://www.ica.org/standards/RiC/ontology#
edm: http://www.europeana.eu/schemas/edm/
ore: http://www.openarchives.org/ore/terms/
```
## Validation Strategy
### SHACL Constraints
TOOI uses SHACL extensively for validation. We should:
1. **Generate SHACL from LinkML** using `gen-shacl`
2. **Define custom constraints** for:
- GHCID format validation
- Date consistency (founded_date < closed_date)
- Identifier uniqueness
- Geographic coordinate validation
### Example SHACL (conceptual)
```turtle
heritage:HeritageCustodianShape
a sh:NodeShape ;
sh:targetClass heritage:HeritageCustodian ;
sh:property [
sh:path heritage:ghcid_current ;
sh:pattern "^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$" ;
] ;
sh:property [
sh:path prov:generatedAtTime ;
sh:maxCount 1 ;
sh:datatype xsd:dateTime ;
] .
```
## Implementation Roadmap
### Phase 1: Core Extensions (Current Priority)
- [ ] Add `ChangeEvent` class
- [ ] Add PROV-O temporal properties to `HeritageCustodian`
- [ ] Add `ChangeTypeEnum`
- [ ] Update `class_uri` and `slot_uri` mappings
- [ ] Create example instances
### Phase 2: TOOI Integration
- [ ] Add `OrganizationalUnit` class
- [ ] Add Dutch-specific TOOI properties
- [ ] Implement TOOI name variants (official, preferred, sorting)
- [ ] Add legal basis tracking
### Phase 3: EDM Integration
- [ ] Add aggregator/provider relationship model
- [ ] Add collection digitization tracking
- [ ] Add EDM-specific metadata
### Phase 4: Validation
- [ ] Generate SHACL constraints from LinkML
- [ ] Implement custom validators
- [ ] Create validation test suite
## Open Questions
1. **Class hierarchy**: Should `HeritageCustodian` use `is_a` or `mixins` for multiple ontology mappings?
- **Recommendation**: Use `mixins` to avoid diamond inheritance issues
2. **Temporal model**: Dual tracking (`founded_date` vs `prov:generatedAtTime`)?
- **Recommendation**: Keep both - `founded_date` for simple queries, PROV-O for semantic interoperability
3. **Change events**: Link to `ghcid_history` or keep separate?
- **Recommendation**: Keep separate but allow optional cross-references
4. **Dutch-specific fields**: In base class or subclass?
- **Current approach**: `DutchHeritageCustodian` subclass
## References
- TOOI Ontology: `data/ontology/tooiont.ttl`
- CPOV Ontology: `data/ontology/core-public-organisation-ap.ttl`
- W3C PROV-O: https://www.w3.org/TR/prov-o/
- W3C Org Ontology: https://www.w3.org/TR/vocab-org/
- LinkML Docs: https://linkml.io/linkml/
---
**Next Steps**:
1. Wait for subagent analysis of ontology conversations
2. Refine design based on subagent findings
3. Implement `heritage_custodian_extended.yaml`
4. Create example instances
5. Validate with LinkML tools