glam/docs/ONTOLOGY_EXTENSIONS.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

577 lines
20 KiB
Markdown

# Ontology Extensions and Schema Evolution
This document tracks extensions to the Heritage Custodian LinkML schema based on real-world data extraction findings. All extensions are mapped to base ontologies (CIDOC-CRM, Schema.org, RiC-O, etc.) to maintain semantic interoperability.
## Version History
| Version | Date | Description |
|---------|------|-------------|
| 0.2.1 | 2025-11-09 | Added LEARNING_MANAGEMENT to DigitalPlatformTypeEnum (Libyan extraction) |
| 0.2.0 | 2025-11-05 | Modular schema reorganization |
---
## Extensions Log
### 2025-11-09: LEARNING_MANAGEMENT Platform Type
**Schema File**: `schemas/enums.yaml`
**Enum**: `DigitalPlatformTypeEnum`
**Added Value**: `LEARNING_MANAGEMENT`
#### Gap Identified
During extraction of Libyan heritage institutions, 3 universities (Misurata, Benghazi, University of Tripoli) were found using learning management systems (Google Classroom, Moodle) for heritage education and digital resource delivery. The existing `DigitalPlatformTypeEnum` did not have an appropriate category for LMS platforms.
**Source Data**:
- `data/instances/libya_universities_batch1.json` (lines 78, 190, 286)
- Misurata University: Google Classroom Integration
- Benghazi University: Moodle platform for heritage courses
- University of Tripoli: Moodle integration
**Original Schema Coverage**:
- COLLECTION_MANAGEMENT ❌ (too specific - for museum/archive systems)
- DIGITAL_REPOSITORY ❌ (for digital preservation, not learning)
- DISCOVERY_PORTAL ❌ (for search/discovery, not education)
- WEBSITE ❌ (too generic)
- GENERIC ❌ (too generic, loses semantic meaning)
#### Proposal
Add `LEARNING_MANAGEMENT` to `DigitalPlatformTypeEnum`:
```yaml
LEARNING_MANAGEMENT:
description: Learning management systems for heritage education (Moodle, Google Classroom, Blackboard, Canvas)
meaning: schema:LearningResource
```
#### Ontology Mapping
**Base Ontology**: Schema.org
**Class**: `schema:LearningResource`
**Reference**: https://schema.org/LearningResource
**RDF Serialization**:
```turtle
@prefix schema: <http://schema.org/> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .
<https://w3id.org/heritage/custodian/ly/misurata-lms> a heritage:DigitalPlatform ;
heritage:platform_name "Google Classroom Integration" ;
heritage:platform_type "LEARNING_MANAGEMENT" ;
rdf:type schema:LearningResource ;
schema:isPartOf <https://w3id.org/heritage/custodian/ly/misurata-university> .
```
#### Use Cases
1. **Heritage Education Tracking**: Document how institutions deliver heritage education digitally
2. **Platform Integration Mapping**: Identify which LMS platforms are used in heritage sector
3. **E-Learning Resource Discovery**: Enable discovery of heritage learning platforms
4. **Digital Pedagogy Research**: Support research on digital heritage education methods
#### Implementation
**Status**: ✅ Implemented (2025-11-09)
**Affected Files**:
- `schemas/enums.yaml` (lines 191-212, added LEARNING_MANAGEMENT at line 208)
**Validation**:
- Libyan extraction data now validates correctly
- 3 institutions using LEARNING_MANAGEMENT platform type
**Backward Compatibility**:
- New enum value is additive (non-breaking change)
- Existing data unaffected
- Future extractions can use new value
#### Related Work
**Similar Patterns in Other Domains**:
- Schema.org `schema:Course` - For structured course information
- LTI (Learning Tools Interoperability) - Standard for LMS integration
- LRMI (Learning Resource Metadata Initiative) - Metadata for learning resources
**Future Extensions**:
- Consider adding `course_url` slot to DigitalPlatform for linking to specific courses
- May need `MetadataStandardEnum` value for LRMI if heritage institutions adopt it
---
## Integrating TOOI and CPOV Ontologies
The GLAM project builds on two foundational ontologies for organizational data modeling. **AI agents should always consult these ontologies** when designing extraction pipelines or extending the schema.
### TOOI - Dutch Government Organizational Ontology
**File**: `/data/ontology/tooiont.ttl`
**Namespace**: `https://identifier.overheid.nl/tooi/def/ont/`
**Purpose**: Model Dutch government organizations, their lifecycle events, and temporal changes
**Key Classes**:
- `tooi:Overheidsorganisatie` - Government organization (base for `DutchHeritageCustodian`)
- `tooi:Wijzigingsgebeurtenis` - Change event (merger, split, closure)
- `tooi:organisatieIdentificatie` - Organizational identifiers
**Key Properties**:
- `tooi:officieleNaamInclSoort` - Official name including organizational type
- `tooi:begindatum` - Start date (founding, change effective date)
- `tooi:einddatum` - End date (closure, change expiry)
- `tooi:resultaat` - Resulting organization from change event
- `tooi:voorafgaandeOrganisatie` - Predecessor organization
**PROV-O Integration**:
TOOI uses PROV-O (W3C Provenance Ontology) for temporal tracking:
- Change events as `prov:Activity`
- Organizations linked via `prov:wasInfluencedBy` and `prov:generated`
- Temporal bounds via `prov:atTime`
**Heritage Custodian Mapping**:
```yaml
# LinkML schema/dutch.yaml extends TOOI
DutchHeritageCustodian:
is_a: HeritageCustodian
class_uri: tooi:Overheidsorganisatie # Maps to TOOI base class
slots:
- isil_code # Maps to tooi:organisatieIdentificatie
- change_history # Maps to tooi:Wijzigingsgebeurtenis
```
**RDF Serialization Example**:
```turtle
@prefix tooi: <https://identifier.overheid.nl/tooi/def/ont/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .
<https://w3id.org/heritage/custodian/nl/noord-hollands-archief>
a tooi:Overheidsorganisatie, heritage:HeritageCustodian ;
tooi:officieleNaamInclSoort "Noord-Hollands Archief" ;
tooi:begindatum "2001-01-01"^^xsd:date ;
heritage:institution_type "ARCHIVE" ;
heritage:isil_code "NL-HlmNHA" .
# Change event: Merger of two archives
<https://w3id.org/heritage/custodian/event/nha-merger-2001>
a tooi:Wijzigingsgebeurtenis, prov:Activity ;
prov:atTime "2001-01-01T00:00:00Z"^^xsd:dateTime ;
tooi:resultaat <https://w3id.org/heritage/custodian/nl/noord-hollands-archief> ;
tooi:voorafgaandeOrganisatie
<https://w3id.org/heritage/custodian/nl/gemeentearchief-haarlem>,
<https://w3id.org/heritage/custodian/nl/rijksarchief-noord-holland> ;
heritage:change_type "MERGER" ;
heritage:event_description "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland" .
```
**When to Use TOOI**:
- ✅ Extracting Dutch heritage institutions (government archives, state museums)
- ✅ Modeling mergers, splits, reorganizations of Dutch organizations
- ✅ Tracking historical changes to organizational structure
- ✅ Integrating with Dutch national registries (ISIL, KvK)
- ❌ Non-Dutch institutions (use CPOV instead)
- ❌ Private collections without government affiliation
---
### CPOV - EU Core Public Organisation Vocabulary
**Files**:
- `/data/ontology/core-public-organisation-ap.ttl` (RDF schema)
- `/data/ontology/core-public-organisation-ap.jsonld` (JSON-LD context)
**Namespace**: `http://data.europa.eu/m8g/`
**Purpose**: EU-wide vocabulary for public sector organizations (governments, NGOs, cultural institutions)
**Key Classes**:
- `cpov:PublicOrganisation` - Any public-sector organization (base for global heritage custodians)
- `cv:ChangeEvent` - Organizational change (founding, closure, name change)
- `cv:ContactPoint` - Contact information for public services
- `locn:Address` - Physical location details
**Key Properties**:
- `dct:identifier` - Formal identifier (ISIL, national registry ID)
- `skos:prefLabel` - Preferred name
- `skos:altLabel` - Alternative names
- `dct:temporal` - Temporal coverage (founding to closure)
- `cv:contactPoint` - Contact details
- `locn:address` - Physical address
**W3C Org Ontology Integration**:
CPOV builds on W3C Organization Ontology:
- `org:Organization` - Base organizational structure
- `org:hasUnit` - Hierarchical relationships (parent-child)
- `org:linkedTo` - Partnerships, networks
- `org:changedBy` - Change events affecting organization
**Heritage Custodian Mapping**:
```yaml
# LinkML schemas/core.yaml aligns with CPOV
HeritageCustodian:
class_uri: cpov:PublicOrganisation # Maps to CPOV for EU-wide interoperability
slots:
name:
slot_uri: skos:prefLabel
alternative_names:
slot_uri: skos:altLabel
identifiers:
slot_uri: dct:identifier
locations:
slot_uri: locn:address
change_history:
slot_uri: cv:ChangeEvent
```
**RDF Serialization Example**:
```turtle
@prefix cpov: <http://data.europa.eu/m8g/> .
@prefix cv: <http://data.europa.eu/m8g/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix schema: <http://schema.org/> .
<https://w3id.org/heritage/custodian/br/biblioteca-nacional>
a cpov:PublicOrganisation ;
skos:prefLabel "Biblioteca Nacional do Brasil"@pt ;
skos:altLabel "National Library of Brazil"@en, "BNB"@pt ;
dct:identifier [
a dct:Identifier ;
skos:notation "BR-RjBN" ;
dct:creator "International Standard Identifier for Libraries and Related Organisations"
] ;
locn:address [
a locn:Address ;
locn:thoroughfare "Avenida Rio Branco, 219" ;
locn:postCode "20040-008" ;
locn:adminUnitL2 "Rio de Janeiro" ;
locn:adminUnitL1 "BR"
] ;
dct:temporal [
schema:startDate "1810-01-01"^^xsd:date
] .
# Change event: Founding
<https://w3id.org/heritage/custodian/event/bnb-founding>
a cv:ChangeEvent ;
dct:date "1810-01-01"^^xsd:date ;
dct:type "FOUNDING" ;
dct:description "Founded by King João VI of Portugal as Royal Library"@en ;
cv:changedOrganisation <https://w3id.org/heritage/custodian/br/biblioteca-nacional> .
```
**When to Use CPOV**:
- ✅ Extracting non-Dutch European heritage institutions (France, Germany, Belgium, etc.)
- ✅ Modeling public-sector cultural organizations (national museums, state archives)
- ✅ EU Linked Open Data alignment (Europeana, DPLA)
- ✅ Cross-border organizational relationships (EU heritage networks)
- ⚠️ Global institutions outside EU (use CPOV patterns but add regional ontologies)
- ❌ Purely private collections (consider Schema.org `schema:Organization` instead)
---
### Ontology Decision Tree for AI Agents
When designing extraction pipelines, choose the appropriate ontology:
```
Is the institution Dutch?
├─ YES → Use TOOI (tooi:Overheidsorganisatie)
│ Map to schemas/dutch.yaml
│ Extract ISIL codes, KvK numbers
└─ NO → Is the institution in the EU?
├─ YES → Use CPOV (cpov:PublicOrganisation)
│ Map to schemas/core.yaml
│ Extract EU-standard identifiers
└─ NO → Use CPOV patterns + regional ontologies
Example: Brazilian institutions → CPOV + national heritage codes
Fallback to Schema.org for private/informal collections
```
**Combining Ontologies**:
Institutions can implement MULTIPLE ontology classes:
```turtle
<https://w3id.org/heritage/custodian/nl/rijksmuseum>
a tooi:Overheidsorganisatie, # Dutch government organization
cpov:PublicOrganisation, # EU public sector
schema:Museum, # Schema.org for web discoverability
crm:E74_Group ; # CIDOC-CRM for cultural heritage domain
...
```
---
### Practical Extraction Workflow
**Step 1: Read Ontology Files**
Before designing extraction logic, review:
```bash
# Dutch institutions
cat /data/ontology/tooiont.ttl | grep "tooi:Overheidsorganisatie" -A 10
# EU/global institutions
cat /data/ontology/core-public-organisation-ap.ttl | grep "cpov:PublicOrganisation" -A 10
# JSON-LD context for CPOV
cat /data/ontology/core-public-organisation-ap.jsonld
```
**Step 2: Map Conversation Data to Ontology Classes**
Identify which ontology properties correspond to extracted data:
| Extracted Data | TOOI Property | CPOV Property | Schema.org |
|----------------|---------------|---------------|------------|
| Institution name | `tooi:officieleNaamInclSoort` | `skos:prefLabel` | `schema:name` |
| Founding date | `tooi:begindatum` | `schema:startDate` | `schema:foundingDate` |
| ISIL code | `tooi:organisatieIdentificatie` | `dct:identifier` | `schema:identifier` |
| Address | (use `locn:Address`) | `locn:address` | `schema:address` |
| Merger event | `tooi:Wijzigingsgebeurtenis` | `cv:ChangeEvent` | `schema:Event` |
**Step 3: Generate RDF-Compatible LinkML**
LinkML YAML automatically maps to RDF when `class_uri` and `slot_uri` are defined:
```yaml
# Extraction output (LinkML YAML)
- id: https://w3id.org/heritage/custodian/nl/amsterdam-museum
name: Amsterdam Museum
institution_type: MUSEUM
identifiers:
- identifier_scheme: ISIL
identifier_value: NL-AsdAM
locations:
- city: Amsterdam
country: NL
change_history:
- event_id: https://w3id.org/heritage/custodian/event/am-renaming-2011
change_type: NAME_CHANGE
event_date: "2011-01-01"
event_description: "Renamed from Amsterdams Historisch Museum to Amsterdam Museum"
```
**Step 4: Export to RDF**
LinkML automatically serializes to RDF/Turtle with ontology mappings:
```bash
# Use linkml-convert (when implemented)
linkml-convert -s schemas/heritage_custodian.yaml \
-t ttl \
data/instances/netherlands_batch1.yaml \
> output/netherlands_batch1.ttl
```
---
### Extension Guidelines for AI Agents
When extracting data reveals a gap in the schema, follow this process:
### 1. Document the Gap
- **What data was found?** (exact field values, institution names)
- **Why doesn't existing schema fit?** (explain semantic mismatch)
- **How many instances?** (frequency of occurrence)
- **Geographic/domain scope?** (is this regional or global?)
### 2. Research Base Ontologies
Check existing ontologies for appropriate mappings (in priority order):
1. **TOOI** (`/data/ontology/tooiont.ttl`) - Dutch government organizations (if applicable)
2. **CPOV** (`/data/ontology/core-public-organisation-ap.ttl`) - EU public sector organizations
3. **Schema.org** (`/data/ontology/schemaorg.owl`) - Web semantics, broad coverage
4. **CIDOC-CRM** (`/data/ontology/CIDOC_CRM_v7.1.3.rdf`) - Cultural heritage domain
5. **RiC-O** (Records in Contexts) - Archival description
6. **BIBFRAME** - Bibliographic resources
7. **Dublin Core** (`dcterms:`) - Metadata elements
**Prefer existing ontology classes over inventing new ones.**
**Search Strategy**:
```bash
# Search for relevant classes in ontologies
rg "Organisatie|Organization|Museum|Archive" /data/ontology/*.ttl
rg "ChangeEvent|Wijziging|Merger" /data/ontology/*.ttl
```
### 3. Propose Extension
Create a proposal including:
- **Enum/slot name**: Follow LinkML naming conventions (snake_case for slots, UPPER_CASE for enums)
- **Description**: Clear, concise explanation of the concept
- **Meaning**: Link to base ontology class (`meaning: schema:ClassName`)
- **Use cases**: Minimum 2-3 real-world use cases
- **RDF example**: Show how it serializes to RDF
### 4. Validate with Real Data
- Test the extension against the data that revealed the gap
- Check if it applies to other extracted datasets
- Ensure backward compatibility (prefer additive changes)
### 5. Update Documentation
- Add entry to this file (ONTOLOGY_EXTENSIONS.md)
- Update schema version number if needed
- Note affected files and line numbers
- Document validation results
---
## Schema Evolution Principles
### 1. Ontology Reuse Over Invention
**Always prefer**:
- Existing ontology classes (Schema.org, CIDOC-CRM, RiC-O)
- Widely adopted standards (Dublin Core, BIBFRAME)
- Industry conventions (ISIL codes, Wikidata identifiers)
**Avoid**:
- Inventing new properties when existing ones exist
- Creating parallel taxonomies to established standards
- Over-specialization (prefer general + description field)
### 2. Additive Changes > Breaking Changes
**Safe changes** (additive):
- ✅ Add new enum values
- ✅ Add optional slots
- ✅ Add new classes
- ✅ Expand multivalued slots
**Breaking changes** (avoid):
- ❌ Remove enum values
- ❌ Change slot ranges
- ❌ Make optional slots required
- ❌ Rename classes/slots
**If breaking change is necessary**:
- Document migration path in `/docs/MIGRATION.md`
- Provide conversion script in `/scripts/migrations/`
- Bump major version number (0.2.x → 0.3.0)
### 3. Evidence-Based Extensions
**Require**:
- Minimum 2-3 real-world instances found in extraction
- Clear semantic gap (no existing enum/slot fits)
- Use case justification (why is this distinction important?)
**Don't extend for**:
- Single outlier instances (use free-text description instead)
- Regional idiosyncrasies (consider Dutch-specific extension module)
- Speculative future needs (extend when needed, not preemptively)
### 4. Semantic Clarity
**Good enum/slot names**:
- `LEARNING_MANAGEMENT` - Clear, unambiguous, scoped to heritage education
- `collection_type` - Flexible, allows domain-specific values
- `platform_url` - Self-explanatory, no ambiguity
**Poor enum/slot names**:
- `SYSTEM` - Too generic, unclear semantics
- `other_stuff` - Vague, unmaintainable
- `lms` - Abbreviation, unclear to non-experts
### 5. Balance Granularity and Usability
**Too coarse**:
```yaml
# BAD: Loses semantic precision
platform_type: GENERIC
notes: "This is a learning management system"
```
**Too fine-grained**:
```yaml
# BAD: Unmaintainable, too many enums
platform_type: MOODLE_LMS
platform_type: GOOGLE_CLASSROOM_LMS
platform_type: BLACKBOARD_LMS
platform_type: CANVAS_LMS
```
**Just right**:
```yaml
# GOOD: Semantic category + specific name
platform_type: LEARNING_MANAGEMENT
platform_name: "Moodle"
```
---
## Future Extension Candidates
These are **potential** extensions identified but not yet implemented (waiting for more evidence):
### CollectionTypeEnum
**Status**: ⏳ Under review
**Current Implementation**: Free text (`collection_type: string`)
**Found in Libyan Data**:
- "archaeological", "bibliographic", "archival" (standard)
- "historical", "architectural", "mixed", "digital objects" (non-standard)
**Proposal**: Create optional controlled vocabulary while keeping free text fallback
**Questions**:
- Is there an existing standard (AAT, LCSH subject headings)?
- Would enum improve data quality or restrict flexibility?
- Do different countries use different typologies?
**Decision**: Defer until we have 50+ institutions to analyze usage patterns.
---
### UNESCO Heritage Status
**Status**: ✅ Adequate (no extension needed)
**Current Implementation**: Use `Identifier` class with `identifier_scheme: UNESCO_WHC`
**Found in Libyan Data**:
- 5 UNESCO World Heritage Sites with WHC identifiers
- Status changes tracked via `ChangeEvent` (inscription, delisting)
**Conclusion**: Current schema handles this well. No extension needed.
---
### War/Conflict Heritage Markers
**Status**: ⏳ Monitoring
**Found in Libyan Data**:
- Misrata War Museum (2011 Libyan Civil War)
- Tobruk WWII Commonwealth War Cemetery
**Current Handling**: Use `description` field + `subjects` in `Collection` class
**Question**: Should we add `conflict_period` or `war_era` enum for specialized search?
**Decision**: Monitor usage across more conflict-affected countries (Syria, Yemen, Bosnia). Defer extension for now.
---
## References
- **Base Ontologies**: `/data/ontology/` directory
- `CIDOC_CRM_v7.1.3.rdf` - Cultural heritage modeling
- `schemaorg.owl` - Schema.org vocabulary
- **LinkML Documentation**: https://linkml.io/linkml/
- **Schema Design Patterns**: `/docs/plan/global_glam/05-design-patterns.md`
- **Data Standardization**: `/docs/plan/global_glam/04-data-standardization.md`
---
**Maintained by**: GLAM Data Extraction Project
**Last Updated**: 2025-11-09
**Schema Version**: 0.2.1 (development)