glam/docs/ONTOLOGY_EXTENSIONS.md

# Ontology Extensions and Schema Evolution

This document tracks extensions to the Heritage Custodian LinkML schema based on real-world data extraction findings. All extensions are mapped to base ontologies (CIDOC-CRM, Schema.org, RiC-O, etc.) to maintain semantic interoperability.

## Version History

| Version | Date | Description |
|---------|------|-------------|
| 0.2.1 | 2025-11-09 | Added LEARNING_MANAGEMENT to DigitalPlatformTypeEnum (Libyan extraction) |
| 0.2.0 | 2025-11-05 | Modular schema reorganization |

---

## Extensions Log

### 2025-11-09: LEARNING_MANAGEMENT Platform Type

**Schema File**: `schemas/enums.yaml`
**Enum**: `DigitalPlatformTypeEnum`
**Added Value**: `LEARNING_MANAGEMENT`

#### Gap Identified

During extraction of Libyan heritage institutions, 3 universities (Misurata, Benghazi, University of Tripoli) were found using learning management systems (Google Classroom, Moodle) for heritage education and digital resource delivery. The existing `DigitalPlatformTypeEnum` did not have an appropriate category for LMS platforms.

**Source Data**:
- `data/instances/libya_universities_batch1.json` (lines 78, 190, 286)
- Misurata University: Google Classroom Integration
- Benghazi University: Moodle platform for heritage courses
- University of Tripoli: Moodle integration

**Original Schema Coverage**:
- COLLECTION_MANAGEMENT ❌ (too specific - for museum/archive systems)
- DIGITAL_REPOSITORY ❌ (for digital preservation, not learning)
- DISCOVERY_PORTAL ❌ (for search/discovery, not education)
- WEBSITE ❌ (too generic)
- GENERIC ❌ (too generic, loses semantic meaning)

#### Proposal

Add `LEARNING_MANAGEMENT` to `DigitalPlatformTypeEnum`:

```yaml
LEARNING_MANAGEMENT:
  description: Learning management systems for heritage education (Moodle, Google Classroom, Blackboard, Canvas)
  meaning: schema:LearningResource
```

#### Ontology Mapping

**Base Ontology**: Schema.org
**Class**: `schema:LearningResource`
**Reference**: https://schema.org/LearningResource

**RDF Serialization**:
```turtle
@prefix schema: <http://schema.org/> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/ly/misurata-lms> a heritage:DigitalPlatform ;
    heritage:platform_name "Google Classroom Integration" ;
    heritage:platform_type "LEARNING_MANAGEMENT" ;
    rdf:type schema:LearningResource ;
    schema:isPartOf <https://w3id.org/heritage/custodian/ly/misurata-university> .
```

#### Use Cases

1. **Heritage Education Tracking**: Document how institutions deliver heritage education digitally
2. **Platform Integration Mapping**: Identify which LMS platforms are used in heritage sector
3. **E-Learning Resource Discovery**: Enable discovery of heritage learning platforms
4. **Digital Pedagogy Research**: Support research on digital heritage education methods

#### Implementation

**Status**: ✅ Implemented (2025-11-09)

**Affected Files**:
- `schemas/enums.yaml` (lines 191-212, added LEARNING_MANAGEMENT at line 208)

**Validation**:
- Libyan extraction data now validates correctly
- 3 institutions using LEARNING_MANAGEMENT platform type

**Backward Compatibility**:
- New enum value is additive (non-breaking change)
- Existing data unaffected
- Future extractions can use new value

#### Related Work

**Similar Patterns in Other Domains**:
- Schema.org `schema:Course` - For structured course information
- LTI (Learning Tools Interoperability) - Standard for LMS integration
- LRMI (Learning Resource Metadata Initiative) - Metadata for learning resources

**Future Extensions**:
- Consider adding `course_url` slot to DigitalPlatform for linking to specific courses
- May need `MetadataStandardEnum` value for LRMI if heritage institutions adopt it

---

## Integrating TOOI and CPOV Ontologies

The GLAM project builds on two foundational ontologies for organizational data modeling. **AI agents should always consult these ontologies** when designing extraction pipelines or extending the schema.

### TOOI - Dutch Government Organizational Ontology

**File**: `/data/ontology/tooiont.ttl`
**Namespace**: `https://identifier.overheid.nl/tooi/def/ont/`
**Purpose**: Model Dutch government organizations, their lifecycle events, and temporal changes

**Key Classes**:
- `tooi:Overheidsorganisatie` - Government organization (base for `DutchHeritageCustodian`)
- `tooi:Wijzigingsgebeurtenis` - Change event (merger, split, closure)
- `tooi:organisatieIdentificatie` - Organizational identifiers

**Key Properties**:
- `tooi:officieleNaamInclSoort` - Official name including organizational type
- `tooi:begindatum` - Start date (founding, change effective date)
- `tooi:einddatum` - End date (closure, change expiry)
- `tooi:resultaat` - Resulting organization from change event
- `tooi:voorafgaandeOrganisatie` - Predecessor organization

**PROV-O Integration**:
TOOI uses PROV-O (W3C Provenance Ontology) for temporal tracking:
- Change events as `prov:Activity`
- Organizations linked via `prov:wasInfluencedBy` and `prov:generated`
- Temporal bounds via `prov:atTime`

**Heritage Custodian Mapping**:
```yaml
# LinkML schema/dutch.yaml extends TOOI
DutchHeritageCustodian:
  is_a: HeritageCustodian
  class_uri: tooi:Overheidsorganisatie  # Maps to TOOI base class

  slots:
    - isil_code  # Maps to tooi:organisatieIdentificatie
    - change_history  # Maps to tooi:Wijzigingsgebeurtenis
```

**RDF Serialization Example**:
```turtle
@prefix tooi: <https://identifier.overheid.nl/tooi/def/ont/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/nl/noord-hollands-archief>
    a tooi:Overheidsorganisatie, heritage:HeritageCustodian ;
    tooi:officieleNaamInclSoort "Noord-Hollands Archief" ;
    tooi:begindatum "2001-01-01"^^xsd:date ;
    heritage:institution_type "ARCHIVE" ;
    heritage:isil_code "NL-HlmNHA" .

# Change event: Merger of two archives
<https://w3id.org/heritage/custodian/event/nha-merger-2001>
    a tooi:Wijzigingsgebeurtenis, prov:Activity ;
    prov:atTime "2001-01-01T00:00:00Z"^^xsd:dateTime ;
    tooi:resultaat <https://w3id.org/heritage/custodian/nl/noord-hollands-archief> ;
    tooi:voorafgaandeOrganisatie
        <https://w3id.org/heritage/custodian/nl/gemeentearchief-haarlem>,
        <https://w3id.org/heritage/custodian/nl/rijksarchief-noord-holland> ;
    heritage:change_type "MERGER" ;
    heritage:event_description "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland" .
```

**When to Use TOOI**:
- ✅ Extracting Dutch heritage institutions (government archives, state museums)
- ✅ Modeling mergers, splits, reorganizations of Dutch organizations
- ✅ Tracking historical changes to organizational structure
- ✅ Integrating with Dutch national registries (ISIL, KvK)
- ❌ Non-Dutch institutions (use CPOV instead)
- ❌ Private collections without government affiliation

---

### CPOV - EU Core Public Organisation Vocabulary

**Files**:
- `/data/ontology/core-public-organisation-ap.ttl` (RDF schema)
- `/data/ontology/core-public-organisation-ap.jsonld` (JSON-LD context)

**Namespace**: `http://data.europa.eu/m8g/`
**Purpose**: EU-wide vocabulary for public sector organizations (governments, NGOs, cultural institutions)

**Key Classes**:
- `cpov:PublicOrganisation` - Any public-sector organization (base for global heritage custodians)
- `cv:ChangeEvent` - Organizational change (founding, closure, name change)
- `cv:ContactPoint` - Contact information for public services
- `locn:Address` - Physical location details

**Key Properties**:
- `dct:identifier` - Formal identifier (ISIL, national registry ID)
- `skos:prefLabel` - Preferred name
- `skos:altLabel` - Alternative names
- `dct:temporal` - Temporal coverage (founding to closure)
- `cv:contactPoint` - Contact details
- `locn:address` - Physical address

**W3C Org Ontology Integration**:
CPOV builds on W3C Organization Ontology:
- `org:Organization` - Base organizational structure
- `org:hasUnit` - Hierarchical relationships (parent-child)
- `org:linkedTo` - Partnerships, networks
- `org:changedBy` - Change events affecting organization

**Heritage Custodian Mapping**:
```yaml
# LinkML schemas/core.yaml aligns with CPOV
HeritageCustodian:
  class_uri: cpov:PublicOrganisation  # Maps to CPOV for EU-wide interoperability

  slots:
    name:
      slot_uri: skos:prefLabel
    alternative_names:
      slot_uri: skos:altLabel
    identifiers:
      slot_uri: dct:identifier
    locations:
      slot_uri: locn:address
    change_history:
      slot_uri: cv:ChangeEvent
```

**RDF Serialization Example**:
```turtle
@prefix cpov: <http://data.europa.eu/m8g/> .
@prefix cv: <http://data.europa.eu/m8g/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix schema: <http://schema.org/> .

<https://w3id.org/heritage/custodian/br/biblioteca-nacional>
    a cpov:PublicOrganisation ;
    skos:prefLabel "Biblioteca Nacional do Brasil"@pt ;
    skos:altLabel "National Library of Brazil"@en, "BNB"@pt ;
    dct:identifier [
        a dct:Identifier ;
        skos:notation "BR-RjBN" ;
        dct:creator "International Standard Identifier for Libraries and Related Organisations"
    ] ;
    locn:address [
        a locn:Address ;
        locn:thoroughfare "Avenida Rio Branco, 219" ;
        locn:postCode "20040-008" ;
        locn:adminUnitL2 "Rio de Janeiro" ;
        locn:adminUnitL1 "BR"
    ] ;
    dct:temporal [
        schema:startDate "1810-01-01"^^xsd:date
    ] .

# Change event: Founding
<https://w3id.org/heritage/custodian/event/bnb-founding>
    a cv:ChangeEvent ;
    dct:date "1810-01-01"^^xsd:date ;
    dct:type "FOUNDING" ;
    dct:description "Founded by King João VI of Portugal as Royal Library"@en ;
    cv:changedOrganisation <https://w3id.org/heritage/custodian/br/biblioteca-nacional> .
```

**When to Use CPOV**:
- ✅ Extracting non-Dutch European heritage institutions (France, Germany, Belgium, etc.)
- ✅ Modeling public-sector cultural organizations (national museums, state archives)
- ✅ EU Linked Open Data alignment (Europeana, DPLA)
- ✅ Cross-border organizational relationships (EU heritage networks)
- ⚠️ Global institutions outside EU (use CPOV patterns but add regional ontologies)
- ❌ Purely private collections (consider Schema.org `schema:Organization` instead)

---

### Ontology Decision Tree for AI Agents

When designing extraction pipelines, choose the appropriate ontology:

```
Is the institution Dutch?
├─ YES → Use TOOI (tooi:Overheidsorganisatie)
│         Map to schemas/dutch.yaml
│         Extract ISIL codes, KvK numbers
│
└─ NO → Is the institution in the EU?
         ├─ YES → Use CPOV (cpov:PublicOrganisation)
         │         Map to schemas/core.yaml
         │         Extract EU-standard identifiers
         │
         └─ NO → Use CPOV patterns + regional ontologies
                  Example: Brazilian institutions → CPOV + national heritage codes
                  Fallback to Schema.org for private/informal collections
```

**Combining Ontologies**:
Institutions can implement MULTIPLE ontology classes:

```turtle
<https://w3id.org/heritage/custodian/nl/rijksmuseum>
    a tooi:Overheidsorganisatie,    # Dutch government organization
      cpov:PublicOrganisation,          # EU public sector
      schema:Museum,                    # Schema.org for web discoverability
      crm:E74_Group ;                   # CIDOC-CRM for cultural heritage domain
    ...
```

---

### Practical Extraction Workflow

**Step 1: Read Ontology Files**

Before designing extraction logic, review:
```bash
# Dutch institutions
cat /data/ontology/tooiont.ttl | grep "tooi:Overheidsorganisatie" -A 10

# EU/global institutions
cat /data/ontology/core-public-organisation-ap.ttl | grep "cpov:PublicOrganisation" -A 10

# JSON-LD context for CPOV
cat /data/ontology/core-public-organisation-ap.jsonld
```

**Step 2: Map Conversation Data to Ontology Classes**

Identify which ontology properties correspond to extracted data:

| Extracted Data | TOOI Property | CPOV Property | Schema.org |
|----------------|---------------|---------------|------------|
| Institution name | `tooi:officieleNaamInclSoort` | `skos:prefLabel` | `schema:name` |
| Founding date | `tooi:begindatum` | `schema:startDate` | `schema:foundingDate` |
| ISIL code | `tooi:organisatieIdentificatie` | `dct:identifier` | `schema:identifier` |
| Address | (use `locn:Address`) | `locn:address` | `schema:address` |
| Merger event | `tooi:Wijzigingsgebeurtenis` | `cv:ChangeEvent` | `schema:Event` |

**Step 3: Generate RDF-Compatible LinkML**

LinkML YAML automatically maps to RDF when `class_uri` and `slot_uri` are defined:

```yaml
# Extraction output (LinkML YAML)
- id: https://w3id.org/heritage/custodian/nl/amsterdam-museum
  name: Amsterdam Museum
  institution_type: MUSEUM
  identifiers:
    - identifier_scheme: ISIL
      identifier_value: NL-AsdAM
  locations:
    - city: Amsterdam
      country: NL
  change_history:
    - event_id: https://w3id.org/heritage/custodian/event/am-renaming-2011
      change_type: NAME_CHANGE
      event_date: "2011-01-01"
      event_description: "Renamed from Amsterdams Historisch Museum to Amsterdam Museum"
```

**Step 4: Export to RDF**

LinkML automatically serializes to RDF/Turtle with ontology mappings:

```bash
# Use linkml-convert (when implemented)
linkml-convert -s schemas/heritage_custodian.yaml \
               -t ttl \
               data/instances/netherlands_batch1.yaml \
               > output/netherlands_batch1.ttl
```

---

### Extension Guidelines for AI Agents

When extracting data reveals a gap in the schema, follow this process:

### 1. Document the Gap

- **What data was found?** (exact field values, institution names)
- **Why doesn't existing schema fit?** (explain semantic mismatch)
- **How many instances?** (frequency of occurrence)
- **Geographic/domain scope?** (is this regional or global?)

### 2. Research Base Ontologies

Check existing ontologies for appropriate mappings (in priority order):

1. **TOOI** (`/data/ontology/tooiont.ttl`) - Dutch government organizations (if applicable)
2. **CPOV** (`/data/ontology/core-public-organisation-ap.ttl`) - EU public sector organizations
3. **Schema.org** (`/data/ontology/schemaorg.owl`) - Web semantics, broad coverage
4. **CIDOC-CRM** (`/data/ontology/CIDOC_CRM_v7.1.3.rdf`) - Cultural heritage domain
5. **RiC-O** (Records in Contexts) - Archival description
6. **BIBFRAME** - Bibliographic resources
7. **Dublin Core** (`dcterms:`) - Metadata elements

**Prefer existing ontology classes over inventing new ones.**

**Search Strategy**:
```bash
# Search for relevant classes in ontologies
rg "Organisatie|Organization|Museum|Archive" /data/ontology/*.ttl
rg "ChangeEvent|Wijziging|Merger" /data/ontology/*.ttl
```

### 3. Propose Extension

Create a proposal including:
- **Enum/slot name**: Follow LinkML naming conventions (snake_case for slots, UPPER_CASE for enums)
- **Description**: Clear, concise explanation of the concept
- **Meaning**: Link to base ontology class (`meaning: schema:ClassName`)
- **Use cases**: Minimum 2-3 real-world use cases
- **RDF example**: Show how it serializes to RDF

### 4. Validate with Real Data

- Test the extension against the data that revealed the gap
- Check if it applies to other extracted datasets
- Ensure backward compatibility (prefer additive changes)

### 5. Update Documentation

- Add entry to this file (ONTOLOGY_EXTENSIONS.md)
- Update schema version number if needed
- Note affected files and line numbers
- Document validation results

---

## Schema Evolution Principles

### 1. Ontology Reuse Over Invention

**Always prefer**:
- Existing ontology classes (Schema.org, CIDOC-CRM, RiC-O)
- Widely adopted standards (Dublin Core, BIBFRAME)
- Industry conventions (ISIL codes, Wikidata identifiers)

**Avoid**:
- Inventing new properties when existing ones exist
- Creating parallel taxonomies to established standards
- Over-specialization (prefer general + description field)

### 2. Additive Changes > Breaking Changes

**Safe changes** (additive):
- ✅ Add new enum values
- ✅ Add optional slots
- ✅ Add new classes
- ✅ Expand multivalued slots

**Breaking changes** (avoid):
- ❌ Remove enum values
- ❌ Change slot ranges
- ❌ Make optional slots required
- ❌ Rename classes/slots

**If breaking change is necessary**:
- Document migration path in `/docs/MIGRATION.md`
- Provide conversion script in `/scripts/migrations/`
- Bump major version number (0.2.x → 0.3.0)

### 3. Evidence-Based Extensions

**Require**:
- Minimum 2-3 real-world instances found in extraction
- Clear semantic gap (no existing enum/slot fits)
- Use case justification (why is this distinction important?)

**Don't extend for**:
- Single outlier instances (use free-text description instead)
- Regional idiosyncrasies (consider Dutch-specific extension module)
- Speculative future needs (extend when needed, not preemptively)

### 4. Semantic Clarity

**Good enum/slot names**:
- `LEARNING_MANAGEMENT` - Clear, unambiguous, scoped to heritage education
- `collection_type` - Flexible, allows domain-specific values
- `platform_url` - Self-explanatory, no ambiguity

**Poor enum/slot names**:
- `SYSTEM` - Too generic, unclear semantics
- `other_stuff` - Vague, unmaintainable
- `lms` - Abbreviation, unclear to non-experts

### 5. Balance Granularity and Usability

**Too coarse**:
```yaml
# BAD: Loses semantic precision
platform_type: GENERIC
notes: "This is a learning management system"
```

**Too fine-grained**:
```yaml
# BAD: Unmaintainable, too many enums
platform_type: MOODLE_LMS
platform_type: GOOGLE_CLASSROOM_LMS
platform_type: BLACKBOARD_LMS
platform_type: CANVAS_LMS
```

**Just right**:
```yaml
# GOOD: Semantic category + specific name
platform_type: LEARNING_MANAGEMENT
platform_name: "Moodle"
```

---

## Future Extension Candidates

These are **potential** extensions identified but not yet implemented (waiting for more evidence):

### CollectionTypeEnum

**Status**: ⏳ Under review
**Current Implementation**: Free text (`collection_type: string`)
**Found in Libyan Data**:
- "archaeological", "bibliographic", "archival" (standard)
- "historical", "architectural", "mixed", "digital objects" (non-standard)

**Proposal**: Create optional controlled vocabulary while keeping free text fallback

**Questions**:
- Is there an existing standard (AAT, LCSH subject headings)?
- Would enum improve data quality or restrict flexibility?
- Do different countries use different typologies?

**Decision**: Defer until we have 50+ institutions to analyze usage patterns.

---

### UNESCO Heritage Status

**Status**: ✅ Adequate (no extension needed)
**Current Implementation**: Use `Identifier` class with `identifier_scheme: UNESCO_WHC`

**Found in Libyan Data**:
- 5 UNESCO World Heritage Sites with WHC identifiers
- Status changes tracked via `ChangeEvent` (inscription, delisting)

**Conclusion**: Current schema handles this well. No extension needed.

---

### War/Conflict Heritage Markers

**Status**: ⏳ Monitoring
**Found in Libyan Data**:
- Misrata War Museum (2011 Libyan Civil War)
- Tobruk WWII Commonwealth War Cemetery

**Current Handling**: Use `description` field + `subjects` in `Collection` class

**Question**: Should we add `conflict_period` or `war_era` enum for specialized search?

**Decision**: Monitor usage across more conflict-affected countries (Syria, Yemen, Bosnia). Defer extension for now.

---

## References

- **Base Ontologies**: `/data/ontology/` directory
  - `CIDOC_CRM_v7.1.3.rdf` - Cultural heritage modeling
  - `schemaorg.owl` - Schema.org vocabulary
- **LinkML Documentation**: https://linkml.io/linkml/
- **Schema Design Patterns**: `/docs/plan/global_glam/05-design-patterns.md`
- **Data Standardization**: `/docs/plan/global_glam/04-data-standardization.md`

---

**Maintained by**: GLAM Data Extraction Project
**Last Updated**: 2025-11-09
**Schema Version**: 0.2.1 (development)