glam/docs/SCHEMA_MODULES.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

20 KiB

Schema Modularization Architecture

Version: 0.2.1
Date: 2025-11-09
Status: Complete

Overview

The Heritage Custodian schema has been modularized from a single 1,102-line YAML file into 7 focused, reusable modules. This architecture enables:

  • Independent usage: Use only the modules you need
  • Clear separation of concerns: Core entities vs. provenance vs. collections vs. bibliographic
  • Easier maintenance: Changes isolated to specific modules
  • Better comprehension: Each module is self-contained and focused
  • Future extensibility: Easy to add country-specific or domain-specific modules
  • Standards integration: SPAR Ontologies (FaBiO, CiTO, PRO, PSO, DoCO) for scholarly publications

Module Structure

schemas/
├── heritage_custodian.yaml    # Main schema (74 lines, imports all modules)
├── enums.yaml                 # Enumerations (institution types, data tiers, etc.)
├── core.yaml                  # Core organizational classes
├── provenance.yaml            # Provenance tracking, GHCID history, change events
├── collections.yaml           # Collections, digital platforms, partnerships
├── dutch.yaml                 # Dutch-specific extensions
└── bibliographic.yaml         # NEW: SPAR Ontologies for scholarly publications

Module Details

1. enums.yaml (8,671 bytes)

Purpose: Define all enumeration types used across the schema.

Contents:

  • InstitutionTypeEnum - 13 heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
  • OrganizationStatusEnum - Operational status (ACTIVE, INACTIVE, MERGED, etc.)
  • DataSourceEnum - Data provenance sources (ISIL_REGISTRY, WIKIDATA, CONVERSATION_NLP, etc.)
  • DataTierEnum - Data quality tiers (TIER_1_AUTHORITATIVE → TIER_4_INFERRED)
  • MetadataStandardEnum - Standards used (DUBLIN_CORE, MARC21, EAD, RIC_O, etc.)
  • DigitalPlatformTypeEnum - Platform types (COLLECTION_MANAGEMENT, DIGITAL_REPOSITORY, etc.)
  • ChangeTypeEnum - Organizational changes (FOUNDING, CLOSURE, MERGER, RELOCATION, etc.)

Usage:

imports:
  - enums

classes:
  MyClass:
    slots:
      - institution_type  # Uses InstitutionTypeEnum

2. core.yaml (13,810 bytes)

Purpose: Core organizational classes and identification.

Contents:

  • HeritageCustodian - Main heritage institution class
    • GHCID tracking (numeric, current, original, history)
    • Organizational metadata (name, type, status, description)
    • Temporal fields (founded_date, closed_date, prov_generated_at, prov_invalidated_at)
    • Hierarchies (parent_organization, sub_organizations)
    • TOOI naming (official_name, sorting_name, abbreviation)
  • Location - Physical/virtual locations
    • Address fields (street, city, postal_code, region, country)
    • Geocoding (latitude, longitude, geonames_id)
    • Primary location flag
  • ContactInfo - Contact information (email, phone, fax)
  • Identifier - External identifiers (ISIL, VIAF, Wikidata, KvK)
  • OrganizationalUnit - Departments and divisions within institutions

Ontology Mappings:

  • HeritageCustodianorg:Organization + prov:Entity
  • Locationschema:Place
  • ContactInfocpov:ContactPoint + schema:ContactPoint
  • Identifierdcterms:identifier
  • OrganizationalUnitorg:OrganizationalUnit

Usage:

imports:
  - core

# Now you can use HeritageCustodian, Location, etc.

3. provenance.yaml (6,715 bytes)

Purpose: Data quality tracking, GHCID history, and organizational change events.

Contents:

  • Provenance - Data provenance metadata
    • Source and tier (data_source, data_tier)
    • Extraction metadata (extraction_date, extraction_method, confidence_score)
    • Verification (verified_date, verified_by)
    • Source references (conversation_id, source_url)
  • GHCIDHistoryEntry - Historical GHCID tracking
    • GHCID values (ghcid, ghcid_numeric)
    • Validity period (valid_from, valid_to)
    • Context (institution_name, location_city, location_country, reason)
  • ChangeEvent - Organizational lifecycle events
    • Event metadata (event_id, change_type, event_date, event_description)
    • Affected entities (affected_organization, resulting_organization, related_organizations)
    • Documentation (source_documentation)

Ontology Mappings:

  • ChangeEventprov:Activity + tooi:Wijzigingsgebeurtenis

Usage:

imports:
  - provenance

classes:
  MyClass:
    slots:
      - provenance  # Uses Provenance class
      - change_history  # List of ChangeEvent

4. collections.yaml (4,834 bytes)

Purpose: Collections, digital platforms, and partnerships.

Contents:

  • Collection - Heritage collections
    • Metadata (collection_id, collection_name, collection_description, collection_type)
    • Content (item_count, subjects, time_period_start, time_period_end)
    • Access (access_rights, catalog_url)
    • NEW: Custodian link (custodian slot → implements rico:hasOrHadHolder)
  • DigitalPlatform - Digital systems and platforms
    • Platform info (platform_name, platform_type, platform_url)
    • Technical (vendor, implemented_standards)
  • Partnership - Partnerships and network memberships
    • Partner info (partner_name, partner_id, partnership_type)
    • Temporal (start_date, end_date)

Ontology Mappings:

  • Collectionschema:Collection + rico:RecordSet
  • DigitalPlatformschema:SoftwareApplication
  • Partnershiporg:Membership

Key Feature: The Collection.custodian slot implements RiC-O's rico:hasOrHadHolder pattern, creating bidirectional relationships between collections and their custodial organizations.

Usage:

imports:
  - collections

classes:
  MyClass:
    slots:
      - collections  # List of Collection objects
      - digital_platforms
      - partnerships

5. dutch.yaml (4,931 bytes)

Purpose: Dutch-specific extensions for heritage institutions.

Contents:

  • DutchHeritageCustodian - Extends HeritageCustodian
    • Dutch identifiers (kvk_number, gemeente_code)
    • Regional info (provincie)
    • Networks (samenwerkingsverband)
    • Aggregation platforms (in_museum_register, in_rijkscollectie, in_collectie_nederland, in_archieven_nl)

Ontology Mappings:

  • DutchHeritageCustodianHeritageCustodian + tooi:Overheidsorganisatie

Usage:

imports:
  - dutch

# Now you can use DutchHeritageCustodian

6. bibliographic.yaml (949 lines)

Purpose: SPAR Ontologies integration for scholarly publications about heritage institutions.

Contents:

  • 9 Classes:

    • ScholarlyWork - FRBR Work level (abstract intellectual creation)
    • Publication - FRBR Expression level (concrete realization with authors, dates)
    • Journal - Academic journals with ISSN, impact factors, CiteScore
    • Conference - Conference proceedings and papers
    • Person - Authors, editors, reviewers with ORCID support
    • Organization - Publishers, affiliations with ROR IDs
    • Citation - Citation relationships between publications (CiTO)
    • DocumentSection - Document structure (abstract, methods, results, discussion)
    • PublishingRole - Author/editor/reviewer roles (PRO)
  • 5 Enumerations:

    • PublicationTypeEnum - 13 types (journal article, conference paper, book, thesis, preprint, etc.)
    • CitationTypeEnum - 14 CiTO types (cites, citesAsAuthority, supports, refutes, critiques, etc.)
    • OpenAccessStatusEnum - 6 types (fully open, green OA, hybrid, closed, embargoed, bronze)
    • PublishingStatusEnum - 8 PSO states (submitted, under-review, accepted, published, retracted, etc.)
    • DocumentSectionTypeEnum - 9 DoCO types (abstract, introduction, methods, results, discussion, etc.)
    • PublishingRoleTypeEnum - 9 PRO roles (author, editor, reviewer, translator, corresponding-author, etc.)

SPAR Ontologies Integrated:

  • FaBiO (FRBR-aligned Bibliographic Ontology) - Core bibliographic classes
  • CiTO (Citation Typing Ontology) - Citation relationships
  • BiRO (Bibliographic Reference Ontology) - Citation references
  • DoCO (Document Components Ontology) - Document sections
  • PRO (Publishing Roles Ontology) - Author/editor/reviewer roles
  • PSO (Publishing Status Ontology) - Publishing workflow
  • DataCite - DOI, ORCID identifiers

Validation Patterns:

doi: "^10\\.\\d{4,9}/[-._;()/:A-Za-z0-9]+$"     # DOI format
issn: "^\\d{4}-\\d{3}[\\dX]$"                   # ISSN format
isbn: "^(97[89])?[0-9]{9}[0-9X]$"               # ISBN-10/13
orcid: "^\\d{4}-\\d{4}-\\d{4}-\\d{3}[0-9X]$"    # ORCID format
ror_id: "^https://ror\\.org/0[a-z0-9]{6}[0-9]{2}$"  # ROR ID

Ontology Mappings:

  • ScholarlyWorkfabio:Work + frbr:Work
  • Publicationfabio:Expression + frbr:Expression
  • Journalfabio:Journal + bibo:Journal
  • Conferencefabio:AcademicProceedings + bibo:Conference
  • Personfoaf:Person
  • Organizationschema:Organization
  • Citationcito:Citation + biro:BibliographicReference
  • DocumentSectiondoco:Section
  • PublishingRolepro:PublishingRole

Integration with HeritageCustodian:

# core.yaml now includes:
HeritageCustodian:
  slots:
    - publications  # Links to Publication class
      # Tracks scholarly works about/by this institution

Usage:

imports:
  - bibliographic

# Example: Publication with citations
- publication_id: https://doi.org/10.1007/s00799-023-00123-4
  title: "Linked Open Data for Museum Collections: The Rijksmuseum Case"
  publication_type: JOURNAL_ARTICLE
  authors:
    - person_name: "John Smith"
      orcid: "0000-0002-1234-5678"
      affiliation:
        organization_name: "University of Amsterdam"
  published_in:
    journal_title: "International Journal on Digital Libraries"
    issn: "1432-5012"
    impact_factor: 2.5
  doi: "10.1007/s00799-023-00123-4"
  open_access_status: FULLY_OPEN_ACCESS
  publishing_status: PUBLISHED
  citations:
    - citing_work: https://doi.org/10.1007/s00799-023-00123-4
      cited_work: https://doi.org/10.1145/2970000.2970111
      citation_type: CITES_AS_AUTHORITY
      citation_context: "Following the approach of Smith et al. [42]..."

FRBR Work/Expression Pattern:

# Abstract work
- work_id: https://w3id.org/heritage/work/glam-linked-data
  work_title: "Linked Open Data for Heritage Institutions"
  discipline:
    - Computer Science
    - Library and Information Science
  has_expression:
    # English journal article
    - publication_id: https://doi.org/10.1007/s00799-023-00123-4
      title: "Linked Open Data for Museum Collections"
      language: en
      publication_year: 2023
    
    # Dutch book chapter
    - publication_id: https://isbn.org/9789012345678
      title: "Gekoppelde Open Data voor Musea"
      language: nl
      publication_year: 2024

Citation Graph Example (CiTO):

# Paper A cites Paper B as authoritative source
- citation_id: https://w3id.org/heritage/citation/001
  citing_work: https://doi.org/10.1007/paper-a
  cited_work: https://doi.org/10.1145/paper-b
  citation_type: CITES_AS_AUTHORITY
  citation_intent: "Foundational methodology for LOD extraction"
  citation_context: "Following the approach of Jones et al. [15]..."
  page_number: "p. 7"

# Paper C refutes Paper D
- citation_id: https://w3id.org/heritage/citation/002
  citing_work: https://doi.org/10.1016/paper-c
  cited_work: https://doi.org/10.1108/paper-d
  citation_type: REFUTES
  citation_intent: "Challenge assumptions about metadata quality"
  citation_context: "Contrary to Smith's findings [23], we observe..."

Document Structure (DoCO):

- publication_id: https://doi.org/10.1007/example
  title: "Semantic Technologies for Heritage Institutions"
  document_sections:
    - section_id: https://doi.org/10.1007/example#abstract
      section_type: ABSTRACT
      section_content: "This paper presents..."
      section_order: 1
    
    - section_id: https://doi.org/10.1007/example#introduction
      section_type: INTRODUCTION
      section_content: "Heritage institutions face challenges..."
      section_order: 2
    
    - section_id: https://doi.org/10.1007/example#methods
      section_type: METHODS
      section_content: "We developed a LinkML schema..."
      section_order: 3
    
    - section_id: https://doi.org/10.1007/example#results
      section_type: RESULTS
      section_content: "Our approach successfully extracted..."
      section_order: 4

Publishing Workflow (PSO):

# Track publication status over time
- publication_id: https://doi.org/10.1007/example
  title: "Heritage Institution Metadata Analysis"
  publishing_status: PUBLISHED  # Current status
  
  # Status history (not in schema, but tracked externally):
  # 2024-01-15: SUBMITTED
  # 2024-02-01: UNDER_REVIEW
  # 2024-04-10: ACCEPTED
  # 2024-05-20: IN_PRESS
  # 2024-06-01: PUBLISHED

Main Schema (heritage_custodian.yaml)

The main schema file now serves as an aggregator, importing all 7 modules:

id: https://w3id.org/heritage/custodian
name: heritage-custodian
title: Heritage Custodian Schema
description: >-
  Comprehensive LinkML schema for modeling heritage institutions...  

version: 0.2.1

# Prefixes (all ontologies used across modules)
prefixes:
  heritage: https://w3id.org/heritage/custodian/
  org: http://www.w3.org/ns/org#
  prov: http://www.w3.org/ns/prov#
  rico: https://www.ica.org/standards/RiC/ontology#
  fabio: http://purl.org/spar/fabio/
  cito: http://purl.org/spar/cito/
  datacite: http://purl.org/spar/datacite/
  # ... (30+ prefixes)

# Import all modules
imports:
  - linkml:types
  - enums
  - core
  - provenance
  - collections
  - dutch
  - bibliographic  # NEW: SPAR Ontologies for scholarly publications

Using Modules

Full Schema (All Modules)

from linkml_runtime import SchemaView

sv = SchemaView('schemas/heritage_custodian.yaml')
# Access all 21 classes, 155+ slots, 12 enums
# Includes bibliographic module for scholarly publications

Individual Modules

# Just enumerations
sv = SchemaView('schemas/enums.yaml')

# Just core classes
sv = SchemaView('schemas/core.yaml')

# Just provenance tracking
sv = SchemaView('schemas/provenance.yaml')

# Just bibliographic (SPAR Ontologies)
sv = SchemaView('schemas/bibliographic.yaml')

Selective Imports

Create a custom schema importing only what you need:

id: https://example.org/my-schema
imports:
  - linkml:types
  - enums
  - core
  # Omit provenance, collections, dutch, bibliographic if not needed

classes:
  MyCustomClass:
    is_a: HeritageCustodian
    # Use core classes without Dutch-specific fields

Domain-Specific Module Example (Bibliographic)

For scholarly publication tracking:

id: https://example.org/publication-tracker
imports:
  - linkml:types
  - bibliographic  # Just SPAR Ontologies classes

# Track publications about heritage institutions
# without importing full organizational metadata
classes:
  SimplifiedPublication:
    is_a: Publication
    slots:
      - title
      - doi
      - authors
      - citations

Benefits

1. Independent Usage

Use only the modules relevant to your use case. For example:

  • Global dataset without Dutch institutions → omit dutch.yaml
  • Simple institution list without provenance → omit provenance.yaml
  • Just enumerations for validation → use enums.yaml alone

2. Clearer Organization

Each module has a single, focused purpose:

  • Enums: Type definitions
  • Core: Organizational structure
  • Provenance: Data quality and history
  • Collections: Holdings and platforms
  • Dutch: Country-specific extensions

3. Easier Maintenance

Changes are isolated to relevant modules:

  • Add new institution type → edit enums.yaml
  • Add new core field → edit core.yaml
  • Add new country module → create brazil.yaml (doesn't affect existing modules)

4. Better IDE Support

Smaller files load faster and are easier to navigate in IDEs.

5. Future Extensibility

Easy to add country-specific and domain-specific modules following established patterns:

Country-Specific Modules (following dutch.yaml pattern):

schemas/
├── heritage_custodian.yaml
├── enums.yaml
├── core.yaml
├── provenance.yaml
├── collections.yaml
├── dutch.yaml
├── brazil.yaml      # New: Brazilian-specific extensions (CNPJ, Ibram registry)
├── vietnam.yaml     # New: Vietnamese-specific extensions
└── japan.yaml       # New: Japanese-specific extensions (ISIL-JP, NACSIS)

Domain-Specific Modules (following bibliographic.yaml pattern):

schemas/
├── heritage_custodian.yaml
├── bibliographic.yaml  # SPAR Ontologies for scholarly publications
├── museum_objects.yaml # Future: CIDOC-CRM for museum object cataloging
├── archival_desc.yaml  # Future: RiC-O for archival description
└── conservation.yaml   # Future: Conservation event tracking

Validation

LinkML Validation

All modules validate successfully with LinkML runtime:

$ python3 -c "from linkml_runtime import SchemaView; sv = SchemaView('schemas/heritage_custodian.yaml'); print(f'✓ {len(list(sv.all_classes()))} classes, {len(list(sv.all_slots()))} slots, {len(list(sv.all_enums()))} enums')"21 classes, 190 slots, 13 enums

Backward Compatibility

All existing parsers and tests pass without modification:

$ pytest tests/parsers/ -q
53 passed in 0.40s  ✓

Python Models

Generated Pydantic models work correctly:

from glam_extractor.models import DutchHeritageCustodian, Provenance, Publication
# All models functional, no breaking changes

Migration Guide

For Schema Developers

Before (monolithic):

  • Edit 1,102-line heritage_custodian.yaml
  • Hard to find relevant sections
  • Risk of breaking unrelated code

After (modular):

  • Edit focused module (e.g., collections.yaml for collection fields)
  • Clear module scope
  • Changes isolated

For Schema Users

No changes required! The main schema still works exactly as before:

from linkml_runtime import SchemaView
sv = SchemaView('schemas/heritage_custodian.yaml')
# All classes, slots, enums available as before

For Extension Developers

New pattern: Create country/domain-specific modules:

# schemas/my_extension.yaml
id: https://example.org/my-extension
imports:
  - core  # Import base classes

classes:
  MyCustomCustodian:
    is_a: HeritageCustodian
    slots:
      - my_custom_field

Metrics

Metric Before (v0.1) After (v0.2.1) Change
Main schema size 1,102 lines 73 lines -93%
Total schema size 1,102 lines 2,436 lines* +121% (comprehensive)
Modules 1 7 Focused separation
Classes 12 21 +9 bibliographic classes
Slots 104 190 +86 bibliographic slots
Enums 7 13 +6 bibliographic enums
Test coverage 53 tests 53 tests ✓ 100% pass rate
Ontologies integrated 5 12 +7 SPAR ontologies

*Includes extensive documentation, examples, and SPAR Ontologies integration

Module Breakdown

Module Lines Classes Key Features
enums.yaml 257 0 13 enumerations (institution types, data tiers, citation types, etc.)
core.yaml 610 5 HeritageCustodian, Location, ContactInfo, Identifier, OrganizationalUnit
provenance.yaml 237 3 Provenance, GHCIDHistoryEntry, ChangeEvent
collections.yaml 194 3 Collection, DigitalPlatform, Partnership
dutch.yaml 117 1 DutchHeritageCustodian (extends HeritageCustodian)
bibliographic.yaml 948 9 ScholarlyWork, Publication, Journal, Conference, Person, Organization, Citation, DocumentSection, PublishingRole
heritage_custodian.yaml 73 0 Main aggregator (imports all modules)
TOTAL 2,436 21 Comprehensive heritage institution modeling

References

  • Schema Files: /schemas/
  • Documentation: /docs/
  • Tests: /tests/parsers/
  • Models: /src/glam_extractor/models.py

Last Updated: 2025-11-09
Schema Version: 0.2.1