kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats

- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.

2025-11-22 14:33:51 +01:00

20 KiB

Raw Blame History

Ontology Extensions and Schema Evolution

This document tracks extensions to the Heritage Custodian LinkML schema based on real-world data extraction findings. All extensions are mapped to base ontologies (CIDOC-CRM, Schema.org, RiC-O, etc.) to maintain semantic interoperability.

Version History

Version	Date	Description
0.2.1	2025-11-09	Added LEARNING_MANAGEMENT to DigitalPlatformTypeEnum (Libyan extraction)
0.2.0	2025-11-05	Modular schema reorganization

Extensions Log

2025-11-09: LEARNING_MANAGEMENT Platform Type

Schema File: schemas/enums.yaml
Enum: DigitalPlatformTypeEnum
Added Value: LEARNING_MANAGEMENT

Gap Identified

During extraction of Libyan heritage institutions, 3 universities (Misurata, Benghazi, University of Tripoli) were found using learning management systems (Google Classroom, Moodle) for heritage education and digital resource delivery. The existing DigitalPlatformTypeEnum did not have an appropriate category for LMS platforms.

Source Data:

data/instances/libya_universities_batch1.json (lines 78, 190, 286)
Misurata University: Google Classroom Integration
Benghazi University: Moodle platform for heritage courses
University of Tripoli: Moodle integration

Original Schema Coverage:

COLLECTION_MANAGEMENT ❌ (too specific - for museum/archive systems)
DIGITAL_REPOSITORY ❌ (for digital preservation, not learning)
DISCOVERY_PORTAL ❌ (for search/discovery, not education)
WEBSITE ❌ (too generic)
GENERIC ❌ (too generic, loses semantic meaning)

Proposal

Add LEARNING_MANAGEMENT to DigitalPlatformTypeEnum:

LEARNING_MANAGEMENT:
  description: Learning management systems for heritage education (Moodle, Google Classroom, Blackboard, Canvas)
  meaning: schema:LearningResource

Ontology Mapping

Base Ontology: Schema.org
Class: schema:LearningResource
Reference: https://schema.org/LearningResource

RDF Serialization:

@prefix schema: <http://schema.org/> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/ly/misurata-lms> a heritage:DigitalPlatform ;
    heritage:platform_name "Google Classroom Integration" ;
    heritage:platform_type "LEARNING_MANAGEMENT" ;
    rdf:type schema:LearningResource ;
    schema:isPartOf <https://w3id.org/heritage/custodian/ly/misurata-university> .

Use Cases

Heritage Education Tracking: Document how institutions deliver heritage education digitally
Platform Integration Mapping: Identify which LMS platforms are used in heritage sector
E-Learning Resource Discovery: Enable discovery of heritage learning platforms
Digital Pedagogy Research: Support research on digital heritage education methods

Implementation

Status: ✅ Implemented (2025-11-09)

Affected Files:

schemas/enums.yaml (lines 191-212, added LEARNING_MANAGEMENT at line 208)

Validation:

Libyan extraction data now validates correctly
3 institutions using LEARNING_MANAGEMENT platform type

Backward Compatibility:

New enum value is additive (non-breaking change)
Existing data unaffected
Future extractions can use new value

Similar Patterns in Other Domains:

Schema.org schema:Course - For structured course information
LTI (Learning Tools Interoperability) - Standard for LMS integration
LRMI (Learning Resource Metadata Initiative) - Metadata for learning resources

Future Extensions:

Consider adding course_url slot to DigitalPlatform for linking to specific courses
May need MetadataStandardEnum value for LRMI if heritage institutions adopt it

Integrating TOOI and CPOV Ontologies

The GLAM project builds on two foundational ontologies for organizational data modeling. AI agents should always consult these ontologies when designing extraction pipelines or extending the schema.

TOOI - Dutch Government Organizational Ontology

File: /data/ontology/tooiont.ttl
Namespace: https://identifier.overheid.nl/tooi/def/ont/
Purpose: Model Dutch government organizations, their lifecycle events, and temporal changes

Key Classes:

tooi:Overheidsorganisatie - Government organization (base for DutchHeritageCustodian)
tooi:Wijzigingsgebeurtenis - Change event (merger, split, closure)
tooi:organisatieIdentificatie - Organizational identifiers

Key Properties:

tooi:officieleNaamInclSoort - Official name including organizational type
tooi:begindatum - Start date (founding, change effective date)
tooi:einddatum - End date (closure, change expiry)
tooi:resultaat - Resulting organization from change event
tooi:voorafgaandeOrganisatie - Predecessor organization

PROV-O Integration: TOOI uses PROV-O (W3C Provenance Ontology) for temporal tracking:

Change events as prov:Activity
Organizations linked via prov:wasInfluencedBy and prov:generated
Temporal bounds via prov:atTime

Heritage Custodian Mapping:

# LinkML schema/dutch.yaml extends TOOI
DutchHeritageCustodian:
  is_a: HeritageCustodian
  class_uri: tooi:Overheidsorganisatie  # Maps to TOOI base class
  
  slots:
    - isil_code  # Maps to tooi:organisatieIdentificatie
    - change_history  # Maps to tooi:Wijzigingsgebeurtenis

RDF Serialization Example:

@prefix tooi: <https://identifier.overheid.nl/tooi/def/ont/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix heritage: <https://w3id.org/heritage/custodian/> .

<https://w3id.org/heritage/custodian/nl/noord-hollands-archief>
    a tooi:Overheidsorganisatie, heritage:HeritageCustodian ;
    tooi:officieleNaamInclSoort "Noord-Hollands Archief" ;
    tooi:begindatum "2001-01-01"^^xsd:date ;
    heritage:institution_type "ARCHIVE" ;
    heritage:isil_code "NL-HlmNHA" .

# Change event: Merger of two archives
<https://w3id.org/heritage/custodian/event/nha-merger-2001>
    a tooi:Wijzigingsgebeurtenis, prov:Activity ;
    prov:atTime "2001-01-01T00:00:00Z"^^xsd:dateTime ;
    tooi:resultaat <https://w3id.org/heritage/custodian/nl/noord-hollands-archief> ;
    tooi:voorafgaandeOrganisatie 
        <https://w3id.org/heritage/custodian/nl/gemeentearchief-haarlem>,
        <https://w3id.org/heritage/custodian/nl/rijksarchief-noord-holland> ;
    heritage:change_type "MERGER" ;
    heritage:event_description "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland" .

When to Use TOOI:

✅ Extracting Dutch heritage institutions (government archives, state museums)
✅ Modeling mergers, splits, reorganizations of Dutch organizations
✅ Tracking historical changes to organizational structure
✅ Integrating with Dutch national registries (ISIL, KvK)
❌ Non-Dutch institutions (use CPOV instead)
❌ Private collections without government affiliation

CPOV - EU Core Public Organisation Vocabulary

Files:

/data/ontology/core-public-organisation-ap.ttl (RDF schema)
/data/ontology/core-public-organisation-ap.jsonld (JSON-LD context)

Namespace: http://data.europa.eu/m8g/
Purpose: EU-wide vocabulary for public sector organizations (governments, NGOs, cultural institutions)

Key Classes:

cpov:PublicOrganisation - Any public-sector organization (base for global heritage custodians)
cv:ChangeEvent - Organizational change (founding, closure, name change)
cv:ContactPoint - Contact information for public services
locn:Address - Physical location details

Key Properties:

dct:identifier - Formal identifier (ISIL, national registry ID)
skos:prefLabel - Preferred name
skos:altLabel - Alternative names
dct:temporal - Temporal coverage (founding to closure)
cv:contactPoint - Contact details
locn:address - Physical address

W3C Org Ontology Integration: CPOV builds on W3C Organization Ontology:

org:Organization - Base organizational structure
org:hasUnit - Hierarchical relationships (parent-child)
org:linkedTo - Partnerships, networks
org:changedBy - Change events affecting organization

Heritage Custodian Mapping:

# LinkML schemas/core.yaml aligns with CPOV
HeritageCustodian:
  class_uri: cpov:PublicOrganisation  # Maps to CPOV for EU-wide interoperability
  
  slots:
    name:
      slot_uri: skos:prefLabel
    alternative_names:
      slot_uri: skos:altLabel
    identifiers:
      slot_uri: dct:identifier
    locations:
      slot_uri: locn:address
    change_history:
      slot_uri: cv:ChangeEvent

RDF Serialization Example:

@prefix cpov: <http://data.europa.eu/m8g/> .
@prefix cv: <http://data.europa.eu/m8g/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix schema: <http://schema.org/> .

<https://w3id.org/heritage/custodian/br/biblioteca-nacional>
    a cpov:PublicOrganisation ;
    skos:prefLabel "Biblioteca Nacional do Brasil"@pt ;
    skos:altLabel "National Library of Brazil"@en, "BNB"@pt ;
    dct:identifier [
        a dct:Identifier ;
        skos:notation "BR-RjBN" ;
        dct:creator "International Standard Identifier for Libraries and Related Organisations"
    ] ;
    locn:address [
        a locn:Address ;
        locn:thoroughfare "Avenida Rio Branco, 219" ;
        locn:postCode "20040-008" ;
        locn:adminUnitL2 "Rio de Janeiro" ;
        locn:adminUnitL1 "BR"
    ] ;
    dct:temporal [
        schema:startDate "1810-01-01"^^xsd:date
    ] .

# Change event: Founding
<https://w3id.org/heritage/custodian/event/bnb-founding>
    a cv:ChangeEvent ;
    dct:date "1810-01-01"^^xsd:date ;
    dct:type "FOUNDING" ;
    dct:description "Founded by King João VI of Portugal as Royal Library"@en ;
    cv:changedOrganisation <https://w3id.org/heritage/custodian/br/biblioteca-nacional> .

When to Use CPOV:

✅ Extracting non-Dutch European heritage institutions (France, Germany, Belgium, etc.)
✅ Modeling public-sector cultural organizations (national museums, state archives)
✅ EU Linked Open Data alignment (Europeana, DPLA)
✅ Cross-border organizational relationships (EU heritage networks)
⚠️ Global institutions outside EU (use CPOV patterns but add regional ontologies)
❌ Purely private collections (consider Schema.org schema:Organization instead)

Ontology Decision Tree for AI Agents

When designing extraction pipelines, choose the appropriate ontology:

Is the institution Dutch?
├─ YES → Use TOOI (tooi:Overheidsorganisatie)
│         Map to schemas/dutch.yaml
│         Extract ISIL codes, KvK numbers
│
└─ NO → Is the institution in the EU?
         ├─ YES → Use CPOV (cpov:PublicOrganisation)
         │         Map to schemas/core.yaml
         │         Extract EU-standard identifiers
         │
         └─ NO → Use CPOV patterns + regional ontologies
                  Example: Brazilian institutions → CPOV + national heritage codes
                  Fallback to Schema.org for private/informal collections

Combining Ontologies: Institutions can implement MULTIPLE ontology classes:

<https://w3id.org/heritage/custodian/nl/rijksmuseum>
    a tooi:Overheidsorganisatie,    # Dutch government organization
      cpov:PublicOrganisation,          # EU public sector
      schema:Museum,                    # Schema.org for web discoverability
      crm:E74_Group ;                   # CIDOC-CRM for cultural heritage domain
    ...

Practical Extraction Workflow

Step 1: Read Ontology Files

Before designing extraction logic, review:

# Dutch institutions
cat /data/ontology/tooiont.ttl | grep "tooi:Overheidsorganisatie" -A 10

# EU/global institutions  
cat /data/ontology/core-public-organisation-ap.ttl | grep "cpov:PublicOrganisation" -A 10

# JSON-LD context for CPOV
cat /data/ontology/core-public-organisation-ap.jsonld

Step 2: Map Conversation Data to Ontology Classes

Identify which ontology properties correspond to extracted data:

Extracted Data	TOOI Property	CPOV Property	Schema.org
Institution name	`tooi:officieleNaamInclSoort`	`skos:prefLabel`	`schema:name`
Founding date	`tooi:begindatum`	`schema:startDate`	`schema:foundingDate`
ISIL code	`tooi:organisatieIdentificatie`	`dct:identifier`	`schema:identifier`
Address	(use `locn:Address`)	`locn:address`	`schema:address`
Merger event	`tooi:Wijzigingsgebeurtenis`	`cv:ChangeEvent`	`schema:Event`

Step 3: Generate RDF-Compatible LinkML

LinkML YAML automatically maps to RDF when class_uri and slot_uri are defined:

# Extraction output (LinkML YAML)
- id: https://w3id.org/heritage/custodian/nl/amsterdam-museum
  name: Amsterdam Museum
  institution_type: MUSEUM
  identifiers:
    - identifier_scheme: ISIL
      identifier_value: NL-AsdAM
  locations:
    - city: Amsterdam
      country: NL
  change_history:
    - event_id: https://w3id.org/heritage/custodian/event/am-renaming-2011
      change_type: NAME_CHANGE
      event_date: "2011-01-01"
      event_description: "Renamed from Amsterdams Historisch Museum to Amsterdam Museum"

Step 4: Export to RDF

LinkML automatically serializes to RDF/Turtle with ontology mappings:

# Use linkml-convert (when implemented)
linkml-convert -s schemas/heritage_custodian.yaml \
               -t ttl \
               data/instances/netherlands_batch1.yaml \
               > output/netherlands_batch1.ttl

Extension Guidelines for AI Agents

When extracting data reveals a gap in the schema, follow this process:

1. Document the Gap

What data was found? (exact field values, institution names)
Why doesn't existing schema fit? (explain semantic mismatch)
How many instances? (frequency of occurrence)
Geographic/domain scope? (is this regional or global?)

2. Research Base Ontologies

Check existing ontologies for appropriate mappings (in priority order):

TOOI (/data/ontology/tooiont.ttl) - Dutch government organizations (if applicable)
CPOV (/data/ontology/core-public-organisation-ap.ttl) - EU public sector organizations
Schema.org (/data/ontology/schemaorg.owl) - Web semantics, broad coverage
CIDOC-CRM (/data/ontology/CIDOC_CRM_v7.1.3.rdf) - Cultural heritage domain
RiC-O (Records in Contexts) - Archival description
BIBFRAME - Bibliographic resources
Dublin Core (dcterms:) - Metadata elements

Prefer existing ontology classes over inventing new ones.

Search Strategy:

# Search for relevant classes in ontologies
rg "Organisatie|Organization|Museum|Archive" /data/ontology/*.ttl
rg "ChangeEvent|Wijziging|Merger" /data/ontology/*.ttl

3. Propose Extension

Create a proposal including:

Enum/slot name: Follow LinkML naming conventions (snake_case for slots, UPPER_CASE for enums)
Description: Clear, concise explanation of the concept
Meaning: Link to base ontology class (meaning: schema:ClassName)
Use cases: Minimum 2-3 real-world use cases
RDF example: Show how it serializes to RDF

4. Validate with Real Data

Test the extension against the data that revealed the gap
Check if it applies to other extracted datasets
Ensure backward compatibility (prefer additive changes)

5. Update Documentation

Add entry to this file (ONTOLOGY_EXTENSIONS.md)
Update schema version number if needed
Note affected files and line numbers
Document validation results

Schema Evolution Principles

1. Ontology Reuse Over Invention

Always prefer:

Existing ontology classes (Schema.org, CIDOC-CRM, RiC-O)
Widely adopted standards (Dublin Core, BIBFRAME)
Industry conventions (ISIL codes, Wikidata identifiers)

Avoid:

Inventing new properties when existing ones exist
Creating parallel taxonomies to established standards
Over-specialization (prefer general + description field)

2. Additive Changes > Breaking Changes

Safe changes (additive):

✅ Add new enum values
✅ Add optional slots
✅ Add new classes
✅ Expand multivalued slots

Breaking changes (avoid):

❌ Remove enum values
❌ Change slot ranges
❌ Make optional slots required
❌ Rename classes/slots

If breaking change is necessary:

Document migration path in /docs/MIGRATION.md
Provide conversion script in /scripts/migrations/
Bump major version number (0.2.x → 0.3.0)

3. Evidence-Based Extensions

Require:

Minimum 2-3 real-world instances found in extraction
Clear semantic gap (no existing enum/slot fits)
Use case justification (why is this distinction important?)

Don't extend for:

Single outlier instances (use free-text description instead)
Regional idiosyncrasies (consider Dutch-specific extension module)
Speculative future needs (extend when needed, not preemptively)

4. Semantic Clarity

Good enum/slot names:

LEARNING_MANAGEMENT - Clear, unambiguous, scoped to heritage education
collection_type - Flexible, allows domain-specific values
platform_url - Self-explanatory, no ambiguity

Poor enum/slot names:

SYSTEM - Too generic, unclear semantics
other_stuff - Vague, unmaintainable
lms - Abbreviation, unclear to non-experts

5. Balance Granularity and Usability

Too coarse:

# BAD: Loses semantic precision
platform_type: GENERIC
notes: "This is a learning management system"

Too fine-grained:

# BAD: Unmaintainable, too many enums
platform_type: MOODLE_LMS
platform_type: GOOGLE_CLASSROOM_LMS
platform_type: BLACKBOARD_LMS
platform_type: CANVAS_LMS

Just right:

# GOOD: Semantic category + specific name
platform_type: LEARNING_MANAGEMENT
platform_name: "Moodle"

Future Extension Candidates

These are potential extensions identified but not yet implemented (waiting for more evidence):

CollectionTypeEnum

Status: ⏳ Under review
Current Implementation: Free text (collection_type: string)
Found in Libyan Data:

"archaeological", "bibliographic", "archival" (standard)
"historical", "architectural", "mixed", "digital objects" (non-standard)

Proposal: Create optional controlled vocabulary while keeping free text fallback

Questions:

Is there an existing standard (AAT, LCSH subject headings)?
Would enum improve data quality or restrict flexibility?
Do different countries use different typologies?

Decision: Defer until we have 50+ institutions to analyze usage patterns.

UNESCO Heritage Status

Status: ✅ Adequate (no extension needed)
Current Implementation: Use Identifier class with identifier_scheme: UNESCO_WHC

Found in Libyan Data:

5 UNESCO World Heritage Sites with WHC identifiers
Status changes tracked via ChangeEvent (inscription, delisting)

Conclusion: Current schema handles this well. No extension needed.

War/Conflict Heritage Markers

Status: ⏳ Monitoring
Found in Libyan Data:

Misrata War Museum (2011 Libyan Civil War)
Tobruk WWII Commonwealth War Cemetery

Current Handling: Use description field + subjects in Collection class

Question: Should we add conflict_period or war_era enum for specialized search?

Decision: Monitor usage across more conflict-affected countries (Syria, Yemen, Bosnia). Defer extension for now.

References

Base Ontologies: /data/ontology/ directory
- CIDOC_CRM_v7.1.3.rdf - Cultural heritage modeling
- schemaorg.owl - Schema.org vocabulary
LinkML Documentation: https://linkml.io/linkml/
Schema Design Patterns: /docs/plan/global_glam/05-design-patterns.md
Data Standardization: /docs/plan/global_glam/04-data-standardization.md

Maintained by: GLAM Data Extraction Project
Last Updated: 2025-11-09
Schema Version: 0.2.1 (development)

20 KiB Raw Blame History