glam/docs/ONTOLOGY_EXTENSIONS.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

20 KiB

Ontology Extensions and Schema Evolution

This document tracks extensions to the Heritage Custodian LinkML schema based on real-world data extraction findings. All extensions are mapped to base ontologies (CIDOC-CRM, Schema.org, RiC-O, etc.) to maintain semantic interoperability.

Version History

Version Date Description
0.2.1 2025-11-09 Added LEARNING_MANAGEMENT to DigitalPlatformTypeEnum (Libyan extraction)
0.2.0 2025-11-05 Modular schema reorganization

Extensions Log

2025-11-09: LEARNING_MANAGEMENT Platform Type

Schema File: schemas/enums.yaml
Enum: DigitalPlatformTypeEnum
Added Value: LEARNING_MANAGEMENT

Gap Identified

During extraction of Libyan heritage institutions, 3 universities (Misurata, Benghazi, University of Tripoli) were found using learning management systems (Google Classroom, Moodle) for heritage education and digital resource delivery. The existing DigitalPlatformTypeEnum did not have an appropriate category for LMS platforms.

Source Data:

  • data/instances/libya_universities_batch1.json (lines 78, 190, 286)
  • Misurata University: Google Classroom Integration
  • Benghazi University: Moodle platform for heritage courses
  • University of Tripoli: Moodle integration

Original Schema Coverage:

  • COLLECTION_MANAGEMENT (too specific - for museum/archive systems)
  • DIGITAL_REPOSITORY (for digital preservation, not learning)
  • DISCOVERY_PORTAL (for search/discovery, not education)
  • WEBSITE (too generic)
  • GENERIC (too generic, loses semantic meaning)

Proposal

Add LEARNING_MANAGEMENT to DigitalPlatformTypeEnum:

LEARNING_MANAGEMENT:
  description: Learning management systems for heritage education (Moodle, Google Classroom, Blackboard, Canvas)
  meaning: schema:LearningResource

Ontology Mapping

Base Ontology: Schema.org
Class: schema:LearningResource
Reference: https://schema.org/LearningResource

RDF Serialization:

@prefix schema: <http://schema.org/> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/ly/misurata-lms> a heritage:DigitalPlatform ;
    heritage:platform_name "Google Classroom Integration" ;
    heritage:platform_type "LEARNING_MANAGEMENT" ;
    rdf:type schema:LearningResource ;
    schema:isPartOf <https://w3id.org/heritage/custodian/ly/misurata-university> .

Use Cases

  1. Heritage Education Tracking: Document how institutions deliver heritage education digitally
  2. Platform Integration Mapping: Identify which LMS platforms are used in heritage sector
  3. E-Learning Resource Discovery: Enable discovery of heritage learning platforms
  4. Digital Pedagogy Research: Support research on digital heritage education methods

Implementation

Status: Implemented (2025-11-09)

Affected Files:

  • schemas/enums.yaml (lines 191-212, added LEARNING_MANAGEMENT at line 208)

Validation:

  • Libyan extraction data now validates correctly
  • 3 institutions using LEARNING_MANAGEMENT platform type

Backward Compatibility:

  • New enum value is additive (non-breaking change)
  • Existing data unaffected
  • Future extractions can use new value

Similar Patterns in Other Domains:

  • Schema.org schema:Course - For structured course information
  • LTI (Learning Tools Interoperability) - Standard for LMS integration
  • LRMI (Learning Resource Metadata Initiative) - Metadata for learning resources

Future Extensions:

  • Consider adding course_url slot to DigitalPlatform for linking to specific courses
  • May need MetadataStandardEnum value for LRMI if heritage institutions adopt it

Integrating TOOI and CPOV Ontologies

The GLAM project builds on two foundational ontologies for organizational data modeling. AI agents should always consult these ontologies when designing extraction pipelines or extending the schema.

TOOI - Dutch Government Organizational Ontology

File: /data/ontology/tooiont.ttl
Namespace: https://identifier.overheid.nl/tooi/def/ont/
Purpose: Model Dutch government organizations, their lifecycle events, and temporal changes

Key Classes:

  • tooi:Overheidsorganisatie - Government organization (base for DutchHeritageCustodian)
  • tooi:Wijzigingsgebeurtenis - Change event (merger, split, closure)
  • tooi:organisatieIdentificatie - Organizational identifiers

Key Properties:

  • tooi:officieleNaamInclSoort - Official name including organizational type
  • tooi:begindatum - Start date (founding, change effective date)
  • tooi:einddatum - End date (closure, change expiry)
  • tooi:resultaat - Resulting organization from change event
  • tooi:voorafgaandeOrganisatie - Predecessor organization

PROV-O Integration: TOOI uses PROV-O (W3C Provenance Ontology) for temporal tracking:

  • Change events as prov:Activity
  • Organizations linked via prov:wasInfluencedBy and prov:generated
  • Temporal bounds via prov:atTime

Heritage Custodian Mapping:

# LinkML schema/dutch.yaml extends TOOI
DutchHeritageCustodian:
  is_a: HeritageCustodian
  class_uri: tooi:Overheidsorganisatie  # Maps to TOOI base class
  
  slots:
    - isil_code  # Maps to tooi:organisatieIdentificatie
    - change_history  # Maps to tooi:Wijzigingsgebeurtenis

RDF Serialization Example:

@prefix tooi: <https://identifier.overheid.nl/tooi/def/ont/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/nl/noord-hollands-archief>
    a tooi:Overheidsorganisatie, heritage:HeritageCustodian ;
    tooi:officieleNaamInclSoort "Noord-Hollands Archief" ;
    tooi:begindatum "2001-01-01"^^xsd:date ;
    heritage:institution_type "ARCHIVE" ;
    heritage:isil_code "NL-HlmNHA" .

# Change event: Merger of two archives
<https://w3id.org/heritage/custodian/event/nha-merger-2001>
    a tooi:Wijzigingsgebeurtenis, prov:Activity ;
    prov:atTime "2001-01-01T00:00:00Z"^^xsd:dateTime ;
    tooi:resultaat <https://w3id.org/heritage/custodian/nl/noord-hollands-archief> ;
    tooi:voorafgaandeOrganisatie 
        <https://w3id.org/heritage/custodian/nl/gemeentearchief-haarlem>,
        <https://w3id.org/heritage/custodian/nl/rijksarchief-noord-holland> ;
    heritage:change_type "MERGER" ;
    heritage:event_description "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland" .

When to Use TOOI:

  • Extracting Dutch heritage institutions (government archives, state museums)
  • Modeling mergers, splits, reorganizations of Dutch organizations
  • Tracking historical changes to organizational structure
  • Integrating with Dutch national registries (ISIL, KvK)
  • Non-Dutch institutions (use CPOV instead)
  • Private collections without government affiliation

CPOV - EU Core Public Organisation Vocabulary

Files:

  • /data/ontology/core-public-organisation-ap.ttl (RDF schema)
  • /data/ontology/core-public-organisation-ap.jsonld (JSON-LD context)

Namespace: http://data.europa.eu/m8g/
Purpose: EU-wide vocabulary for public sector organizations (governments, NGOs, cultural institutions)

Key Classes:

  • cpov:PublicOrganisation - Any public-sector organization (base for global heritage custodians)
  • cv:ChangeEvent - Organizational change (founding, closure, name change)
  • cv:ContactPoint - Contact information for public services
  • locn:Address - Physical location details

Key Properties:

  • dct:identifier - Formal identifier (ISIL, national registry ID)
  • skos:prefLabel - Preferred name
  • skos:altLabel - Alternative names
  • dct:temporal - Temporal coverage (founding to closure)
  • cv:contactPoint - Contact details
  • locn:address - Physical address

W3C Org Ontology Integration: CPOV builds on W3C Organization Ontology:

  • org:Organization - Base organizational structure
  • org:hasUnit - Hierarchical relationships (parent-child)
  • org:linkedTo - Partnerships, networks
  • org:changedBy - Change events affecting organization

Heritage Custodian Mapping:

# LinkML schemas/core.yaml aligns with CPOV
HeritageCustodian:
  class_uri: cpov:PublicOrganisation  # Maps to CPOV for EU-wide interoperability
  
  slots:
    name:
      slot_uri: skos:prefLabel
    alternative_names:
      slot_uri: skos:altLabel
    identifiers:
      slot_uri: dct:identifier
    locations:
      slot_uri: locn:address
    change_history:
      slot_uri: cv:ChangeEvent

RDF Serialization Example:

@prefix cpov: <http://data.europa.eu/m8g/> .
@prefix cv: <http://data.europa.eu/m8g/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix schema: <http://schema.org/> .

<https://w3id.org/heritage/custodian/br/biblioteca-nacional>
    a cpov:PublicOrganisation ;
    skos:prefLabel "Biblioteca Nacional do Brasil"@pt ;
    skos:altLabel "National Library of Brazil"@en, "BNB"@pt ;
    dct:identifier [
        a dct:Identifier ;
        skos:notation "BR-RjBN" ;
        dct:creator "International Standard Identifier for Libraries and Related Organisations"
    ] ;
    locn:address [
        a locn:Address ;
        locn:thoroughfare "Avenida Rio Branco, 219" ;
        locn:postCode "20040-008" ;
        locn:adminUnitL2 "Rio de Janeiro" ;
        locn:adminUnitL1 "BR"
    ] ;
    dct:temporal [
        schema:startDate "1810-01-01"^^xsd:date
    ] .

# Change event: Founding
<https://w3id.org/heritage/custodian/event/bnb-founding>
    a cv:ChangeEvent ;
    dct:date "1810-01-01"^^xsd:date ;
    dct:type "FOUNDING" ;
    dct:description "Founded by King João VI of Portugal as Royal Library"@en ;
    cv:changedOrganisation <https://w3id.org/heritage/custodian/br/biblioteca-nacional> .

When to Use CPOV:

  • Extracting non-Dutch European heritage institutions (France, Germany, Belgium, etc.)
  • Modeling public-sector cultural organizations (national museums, state archives)
  • EU Linked Open Data alignment (Europeana, DPLA)
  • Cross-border organizational relationships (EU heritage networks)
  • ⚠️ Global institutions outside EU (use CPOV patterns but add regional ontologies)
  • Purely private collections (consider Schema.org schema:Organization instead)

Ontology Decision Tree for AI Agents

When designing extraction pipelines, choose the appropriate ontology:

Is the institution Dutch?
├─ YES → Use TOOI (tooi:Overheidsorganisatie)
│         Map to schemas/dutch.yaml
│         Extract ISIL codes, KvK numbers
│
└─ NO → Is the institution in the EU?
         ├─ YES → Use CPOV (cpov:PublicOrganisation)
         │         Map to schemas/core.yaml
         │         Extract EU-standard identifiers
         │
         └─ NO → Use CPOV patterns + regional ontologies
                  Example: Brazilian institutions → CPOV + national heritage codes
                  Fallback to Schema.org for private/informal collections

Combining Ontologies: Institutions can implement MULTIPLE ontology classes:

<https://w3id.org/heritage/custodian/nl/rijksmuseum>
    a tooi:Overheidsorganisatie,    # Dutch government organization
      cpov:PublicOrganisation,          # EU public sector
      schema:Museum,                    # Schema.org for web discoverability
      crm:E74_Group ;                   # CIDOC-CRM for cultural heritage domain
    ...

Practical Extraction Workflow

Step 1: Read Ontology Files

Before designing extraction logic, review:

# Dutch institutions
cat /data/ontology/tooiont.ttl | grep "tooi:Overheidsorganisatie" -A 10

# EU/global institutions  
cat /data/ontology/core-public-organisation-ap.ttl | grep "cpov:PublicOrganisation" -A 10

# JSON-LD context for CPOV
cat /data/ontology/core-public-organisation-ap.jsonld

Step 2: Map Conversation Data to Ontology Classes

Identify which ontology properties correspond to extracted data:

Extracted Data TOOI Property CPOV Property Schema.org
Institution name tooi:officieleNaamInclSoort skos:prefLabel schema:name
Founding date tooi:begindatum schema:startDate schema:foundingDate
ISIL code tooi:organisatieIdentificatie dct:identifier schema:identifier
Address (use locn:Address) locn:address schema:address
Merger event tooi:Wijzigingsgebeurtenis cv:ChangeEvent schema:Event

Step 3: Generate RDF-Compatible LinkML

LinkML YAML automatically maps to RDF when class_uri and slot_uri are defined:

# Extraction output (LinkML YAML)
- id: https://w3id.org/heritage/custodian/nl/amsterdam-museum
  name: Amsterdam Museum
  institution_type: MUSEUM
  identifiers:
    - identifier_scheme: ISIL
      identifier_value: NL-AsdAM
  locations:
    - city: Amsterdam
      country: NL
  change_history:
    - event_id: https://w3id.org/heritage/custodian/event/am-renaming-2011
      change_type: NAME_CHANGE
      event_date: "2011-01-01"
      event_description: "Renamed from Amsterdams Historisch Museum to Amsterdam Museum"

Step 4: Export to RDF

LinkML automatically serializes to RDF/Turtle with ontology mappings:

# Use linkml-convert (when implemented)
linkml-convert -s schemas/heritage_custodian.yaml \
               -t ttl \
               data/instances/netherlands_batch1.yaml \
               > output/netherlands_batch1.ttl

Extension Guidelines for AI Agents

When extracting data reveals a gap in the schema, follow this process:

1. Document the Gap

  • What data was found? (exact field values, institution names)
  • Why doesn't existing schema fit? (explain semantic mismatch)
  • How many instances? (frequency of occurrence)
  • Geographic/domain scope? (is this regional or global?)

2. Research Base Ontologies

Check existing ontologies for appropriate mappings (in priority order):

  1. TOOI (/data/ontology/tooiont.ttl) - Dutch government organizations (if applicable)
  2. CPOV (/data/ontology/core-public-organisation-ap.ttl) - EU public sector organizations
  3. Schema.org (/data/ontology/schemaorg.owl) - Web semantics, broad coverage
  4. CIDOC-CRM (/data/ontology/CIDOC_CRM_v7.1.3.rdf) - Cultural heritage domain
  5. RiC-O (Records in Contexts) - Archival description
  6. BIBFRAME - Bibliographic resources
  7. Dublin Core (dcterms:) - Metadata elements

Prefer existing ontology classes over inventing new ones.

Search Strategy:

# Search for relevant classes in ontologies
rg "Organisatie|Organization|Museum|Archive" /data/ontology/*.ttl
rg "ChangeEvent|Wijziging|Merger" /data/ontology/*.ttl

3. Propose Extension

Create a proposal including:

  • Enum/slot name: Follow LinkML naming conventions (snake_case for slots, UPPER_CASE for enums)
  • Description: Clear, concise explanation of the concept
  • Meaning: Link to base ontology class (meaning: schema:ClassName)
  • Use cases: Minimum 2-3 real-world use cases
  • RDF example: Show how it serializes to RDF

4. Validate with Real Data

  • Test the extension against the data that revealed the gap
  • Check if it applies to other extracted datasets
  • Ensure backward compatibility (prefer additive changes)

5. Update Documentation

  • Add entry to this file (ONTOLOGY_EXTENSIONS.md)
  • Update schema version number if needed
  • Note affected files and line numbers
  • Document validation results

Schema Evolution Principles

1. Ontology Reuse Over Invention

Always prefer:

  • Existing ontology classes (Schema.org, CIDOC-CRM, RiC-O)
  • Widely adopted standards (Dublin Core, BIBFRAME)
  • Industry conventions (ISIL codes, Wikidata identifiers)

Avoid:

  • Inventing new properties when existing ones exist
  • Creating parallel taxonomies to established standards
  • Over-specialization (prefer general + description field)

2. Additive Changes > Breaking Changes

Safe changes (additive):

  • Add new enum values
  • Add optional slots
  • Add new classes
  • Expand multivalued slots

Breaking changes (avoid):

  • Remove enum values
  • Change slot ranges
  • Make optional slots required
  • Rename classes/slots

If breaking change is necessary:

  • Document migration path in /docs/MIGRATION.md
  • Provide conversion script in /scripts/migrations/
  • Bump major version number (0.2.x → 0.3.0)

3. Evidence-Based Extensions

Require:

  • Minimum 2-3 real-world instances found in extraction
  • Clear semantic gap (no existing enum/slot fits)
  • Use case justification (why is this distinction important?)

Don't extend for:

  • Single outlier instances (use free-text description instead)
  • Regional idiosyncrasies (consider Dutch-specific extension module)
  • Speculative future needs (extend when needed, not preemptively)

4. Semantic Clarity

Good enum/slot names:

  • LEARNING_MANAGEMENT - Clear, unambiguous, scoped to heritage education
  • collection_type - Flexible, allows domain-specific values
  • platform_url - Self-explanatory, no ambiguity

Poor enum/slot names:

  • SYSTEM - Too generic, unclear semantics
  • other_stuff - Vague, unmaintainable
  • lms - Abbreviation, unclear to non-experts

5. Balance Granularity and Usability

Too coarse:

# BAD: Loses semantic precision
platform_type: GENERIC
notes: "This is a learning management system"

Too fine-grained:

# BAD: Unmaintainable, too many enums
platform_type: MOODLE_LMS
platform_type: GOOGLE_CLASSROOM_LMS
platform_type: BLACKBOARD_LMS
platform_type: CANVAS_LMS

Just right:

# GOOD: Semantic category + specific name
platform_type: LEARNING_MANAGEMENT
platform_name: "Moodle"

Future Extension Candidates

These are potential extensions identified but not yet implemented (waiting for more evidence):

CollectionTypeEnum

Status: Under review
Current Implementation: Free text (collection_type: string)
Found in Libyan Data:

  • "archaeological", "bibliographic", "archival" (standard)
  • "historical", "architectural", "mixed", "digital objects" (non-standard)

Proposal: Create optional controlled vocabulary while keeping free text fallback

Questions:

  • Is there an existing standard (AAT, LCSH subject headings)?
  • Would enum improve data quality or restrict flexibility?
  • Do different countries use different typologies?

Decision: Defer until we have 50+ institutions to analyze usage patterns.


UNESCO Heritage Status

Status: Adequate (no extension needed)
Current Implementation: Use Identifier class with identifier_scheme: UNESCO_WHC

Found in Libyan Data:

  • 5 UNESCO World Heritage Sites with WHC identifiers
  • Status changes tracked via ChangeEvent (inscription, delisting)

Conclusion: Current schema handles this well. No extension needed.


War/Conflict Heritage Markers

Status: Monitoring
Found in Libyan Data:

  • Misrata War Museum (2011 Libyan Civil War)
  • Tobruk WWII Commonwealth War Cemetery

Current Handling: Use description field + subjects in Collection class

Question: Should we add conflict_period or war_era enum for specialized search?

Decision: Monitor usage across more conflict-affected countries (Syria, Yemen, Bosnia). Defer extension for now.


References

  • Base Ontologies: /data/ontology/ directory
    • CIDOC_CRM_v7.1.3.rdf - Cultural heritage modeling
    • schemaorg.owl - Schema.org vocabulary
  • LinkML Documentation: https://linkml.io/linkml/
  • Schema Design Patterns: /docs/plan/global_glam/05-design-patterns.md
  • Data Standardization: /docs/plan/global_glam/04-data-standardization.md

Maintained by: GLAM Data Extraction Project
Last Updated: 2025-11-09
Schema Version: 0.2.1 (development)