glam/docs/SCHEMA_MODULES.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

621 lines
20 KiB
Markdown

# Schema Modularization Architecture
**Version**: 0.2.1
**Date**: 2025-11-09
**Status**: ✅ Complete
## Overview
The Heritage Custodian schema has been modularized from a single 1,102-line YAML file into 7 focused, reusable modules. This architecture enables:
- **Independent usage**: Use only the modules you need
- **Clear separation of concerns**: Core entities vs. provenance vs. collections vs. bibliographic
- **Easier maintenance**: Changes isolated to specific modules
- **Better comprehension**: Each module is self-contained and focused
- **Future extensibility**: Easy to add country-specific or domain-specific modules
- **Standards integration**: SPAR Ontologies (FaBiO, CiTO, PRO, PSO, DoCO) for scholarly publications
## Module Structure
```
schemas/
├── heritage_custodian.yaml # Main schema (74 lines, imports all modules)
├── enums.yaml # Enumerations (institution types, data tiers, etc.)
├── core.yaml # Core organizational classes
├── provenance.yaml # Provenance tracking, GHCID history, change events
├── collections.yaml # Collections, digital platforms, partnerships
├── dutch.yaml # Dutch-specific extensions
└── bibliographic.yaml # NEW: SPAR Ontologies for scholarly publications
```
## Module Details
### 1. `enums.yaml` (8,671 bytes)
**Purpose**: Define all enumeration types used across the schema.
**Contents**:
- `InstitutionTypeEnum` - 13 heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
- `OrganizationStatusEnum` - Operational status (ACTIVE, INACTIVE, MERGED, etc.)
- `DataSourceEnum` - Data provenance sources (ISIL_REGISTRY, WIKIDATA, CONVERSATION_NLP, etc.)
- `DataTierEnum` - Data quality tiers (TIER_1_AUTHORITATIVE → TIER_4_INFERRED)
- `MetadataStandardEnum` - Standards used (DUBLIN_CORE, MARC21, EAD, RIC_O, etc.)
- `DigitalPlatformTypeEnum` - Platform types (COLLECTION_MANAGEMENT, DIGITAL_REPOSITORY, etc.)
- `ChangeTypeEnum` - Organizational changes (FOUNDING, CLOSURE, MERGER, RELOCATION, etc.)
**Usage**:
```yaml
imports:
- enums
classes:
MyClass:
slots:
- institution_type # Uses InstitutionTypeEnum
```
### 2. `core.yaml` (13,810 bytes)
**Purpose**: Core organizational classes and identification.
**Contents**:
- `HeritageCustodian` - Main heritage institution class
- GHCID tracking (numeric, current, original, history)
- Organizational metadata (name, type, status, description)
- Temporal fields (founded_date, closed_date, prov_generated_at, prov_invalidated_at)
- Hierarchies (parent_organization, sub_organizations)
- TOOI naming (official_name, sorting_name, abbreviation)
- `Location` - Physical/virtual locations
- Address fields (street, city, postal_code, region, country)
- Geocoding (latitude, longitude, geonames_id)
- Primary location flag
- `ContactInfo` - Contact information (email, phone, fax)
- `Identifier` - External identifiers (ISIL, VIAF, Wikidata, KvK)
- `OrganizationalUnit` - Departments and divisions within institutions
**Ontology Mappings**:
- `HeritageCustodian``org:Organization` + `prov:Entity`
- `Location``schema:Place`
- `ContactInfo``cpov:ContactPoint` + `schema:ContactPoint`
- `Identifier``dcterms:identifier`
- `OrganizationalUnit``org:OrganizationalUnit`
**Usage**:
```yaml
imports:
- core
# Now you can use HeritageCustodian, Location, etc.
```
### 3. `provenance.yaml` (6,715 bytes)
**Purpose**: Data quality tracking, GHCID history, and organizational change events.
**Contents**:
- `Provenance` - Data provenance metadata
- Source and tier (data_source, data_tier)
- Extraction metadata (extraction_date, extraction_method, confidence_score)
- Verification (verified_date, verified_by)
- Source references (conversation_id, source_url)
- `GHCIDHistoryEntry` - Historical GHCID tracking
- GHCID values (ghcid, ghcid_numeric)
- Validity period (valid_from, valid_to)
- Context (institution_name, location_city, location_country, reason)
- `ChangeEvent` - Organizational lifecycle events
- Event metadata (event_id, change_type, event_date, event_description)
- Affected entities (affected_organization, resulting_organization, related_organizations)
- Documentation (source_documentation)
**Ontology Mappings**:
- `ChangeEvent``prov:Activity` + `tooi:Wijzigingsgebeurtenis`
**Usage**:
```yaml
imports:
- provenance
classes:
MyClass:
slots:
- provenance # Uses Provenance class
- change_history # List of ChangeEvent
```
### 4. `collections.yaml` (4,834 bytes)
**Purpose**: Collections, digital platforms, and partnerships.
**Contents**:
- `Collection` - Heritage collections
- Metadata (collection_id, collection_name, collection_description, collection_type)
- Content (item_count, subjects, time_period_start, time_period_end)
- Access (access_rights, catalog_url)
- **NEW**: Custodian link (custodian slot → implements `rico:hasOrHadHolder`)
- `DigitalPlatform` - Digital systems and platforms
- Platform info (platform_name, platform_type, platform_url)
- Technical (vendor, implemented_standards)
- `Partnership` - Partnerships and network memberships
- Partner info (partner_name, partner_id, partnership_type)
- Temporal (start_date, end_date)
**Ontology Mappings**:
- `Collection``schema:Collection` + `rico:RecordSet`
- `DigitalPlatform``schema:SoftwareApplication`
- `Partnership``org:Membership`
**Key Feature**:
The `Collection.custodian` slot implements RiC-O's `rico:hasOrHadHolder` pattern, creating bidirectional relationships between collections and their custodial organizations.
**Usage**:
```yaml
imports:
- collections
classes:
MyClass:
slots:
- collections # List of Collection objects
- digital_platforms
- partnerships
```
### 5. `dutch.yaml` (4,931 bytes)
**Purpose**: Dutch-specific extensions for heritage institutions.
**Contents**:
- `DutchHeritageCustodian` - Extends HeritageCustodian
- Dutch identifiers (kvk_number, gemeente_code)
- Regional info (provincie)
- Networks (samenwerkingsverband)
- Aggregation platforms (in_museum_register, in_rijkscollectie, in_collectie_nederland, in_archieven_nl)
**Ontology Mappings**:
- `DutchHeritageCustodian``HeritageCustodian` + `tooi:Overheidsorganisatie`
**Usage**:
```yaml
imports:
- dutch
# Now you can use DutchHeritageCustodian
```
### 6. `bibliographic.yaml` (949 lines)
**Purpose**: SPAR Ontologies integration for scholarly publications about heritage institutions.
**Contents**:
- **9 Classes**:
- `ScholarlyWork` - FRBR Work level (abstract intellectual creation)
- `Publication` - FRBR Expression level (concrete realization with authors, dates)
- `Journal` - Academic journals with ISSN, impact factors, CiteScore
- `Conference` - Conference proceedings and papers
- `Person` - Authors, editors, reviewers with ORCID support
- `Organization` - Publishers, affiliations with ROR IDs
- `Citation` - Citation relationships between publications (CiTO)
- `DocumentSection` - Document structure (abstract, methods, results, discussion)
- `PublishingRole` - Author/editor/reviewer roles (PRO)
- **5 Enumerations**:
- `PublicationTypeEnum` - 13 types (journal article, conference paper, book, thesis, preprint, etc.)
- `CitationTypeEnum` - 14 CiTO types (cites, citesAsAuthority, supports, refutes, critiques, etc.)
- `OpenAccessStatusEnum` - 6 types (fully open, green OA, hybrid, closed, embargoed, bronze)
- `PublishingStatusEnum` - 8 PSO states (submitted, under-review, accepted, published, retracted, etc.)
- `DocumentSectionTypeEnum` - 9 DoCO types (abstract, introduction, methods, results, discussion, etc.)
- `PublishingRoleTypeEnum` - 9 PRO roles (author, editor, reviewer, translator, corresponding-author, etc.)
**SPAR Ontologies Integrated**:
- **FaBiO** (FRBR-aligned Bibliographic Ontology) - Core bibliographic classes
- **CiTO** (Citation Typing Ontology) - Citation relationships
- **BiRO** (Bibliographic Reference Ontology) - Citation references
- **DoCO** (Document Components Ontology) - Document sections
- **PRO** (Publishing Roles Ontology) - Author/editor/reviewer roles
- **PSO** (Publishing Status Ontology) - Publishing workflow
- **DataCite** - DOI, ORCID identifiers
**Validation Patterns**:
```yaml
doi: "^10\\.\\d{4,9}/[-._;()/:A-Za-z0-9]+$" # DOI format
issn: "^\\d{4}-\\d{3}[\\dX]$" # ISSN format
isbn: "^(97[89])?[0-9]{9}[0-9X]$" # ISBN-10/13
orcid: "^\\d{4}-\\d{4}-\\d{4}-\\d{3}[0-9X]$" # ORCID format
ror_id: "^https://ror\\.org/0[a-z0-9]{6}[0-9]{2}$" # ROR ID
```
**Ontology Mappings**:
- `ScholarlyWork``fabio:Work` + `frbr:Work`
- `Publication``fabio:Expression` + `frbr:Expression`
- `Journal``fabio:Journal` + `bibo:Journal`
- `Conference``fabio:AcademicProceedings` + `bibo:Conference`
- `Person``foaf:Person`
- `Organization``schema:Organization`
- `Citation``cito:Citation` + `biro:BibliographicReference`
- `DocumentSection``doco:Section`
- `PublishingRole``pro:PublishingRole`
**Integration with HeritageCustodian**:
```yaml
# core.yaml now includes:
HeritageCustodian:
slots:
- publications # Links to Publication class
# Tracks scholarly works about/by this institution
```
**Usage**:
```yaml
imports:
- bibliographic
# Example: Publication with citations
- publication_id: https://doi.org/10.1007/s00799-023-00123-4
title: "Linked Open Data for Museum Collections: The Rijksmuseum Case"
publication_type: JOURNAL_ARTICLE
authors:
- person_name: "John Smith"
orcid: "0000-0002-1234-5678"
affiliation:
organization_name: "University of Amsterdam"
published_in:
journal_title: "International Journal on Digital Libraries"
issn: "1432-5012"
impact_factor: 2.5
doi: "10.1007/s00799-023-00123-4"
open_access_status: FULLY_OPEN_ACCESS
publishing_status: PUBLISHED
citations:
- citing_work: https://doi.org/10.1007/s00799-023-00123-4
cited_work: https://doi.org/10.1145/2970000.2970111
citation_type: CITES_AS_AUTHORITY
citation_context: "Following the approach of Smith et al. [42]..."
```
**FRBR Work/Expression Pattern**:
```yaml
# Abstract work
- work_id: https://w3id.org/heritage/work/glam-linked-data
work_title: "Linked Open Data for Heritage Institutions"
discipline:
- Computer Science
- Library and Information Science
has_expression:
# English journal article
- publication_id: https://doi.org/10.1007/s00799-023-00123-4
title: "Linked Open Data for Museum Collections"
language: en
publication_year: 2023
# Dutch book chapter
- publication_id: https://isbn.org/9789012345678
title: "Gekoppelde Open Data voor Musea"
language: nl
publication_year: 2024
```
**Citation Graph Example (CiTO)**:
```yaml
# Paper A cites Paper B as authoritative source
- citation_id: https://w3id.org/heritage/citation/001
citing_work: https://doi.org/10.1007/paper-a
cited_work: https://doi.org/10.1145/paper-b
citation_type: CITES_AS_AUTHORITY
citation_intent: "Foundational methodology for LOD extraction"
citation_context: "Following the approach of Jones et al. [15]..."
page_number: "p. 7"
# Paper C refutes Paper D
- citation_id: https://w3id.org/heritage/citation/002
citing_work: https://doi.org/10.1016/paper-c
cited_work: https://doi.org/10.1108/paper-d
citation_type: REFUTES
citation_intent: "Challenge assumptions about metadata quality"
citation_context: "Contrary to Smith's findings [23], we observe..."
```
**Document Structure (DoCO)**:
```yaml
- publication_id: https://doi.org/10.1007/example
title: "Semantic Technologies for Heritage Institutions"
document_sections:
- section_id: https://doi.org/10.1007/example#abstract
section_type: ABSTRACT
section_content: "This paper presents..."
section_order: 1
- section_id: https://doi.org/10.1007/example#introduction
section_type: INTRODUCTION
section_content: "Heritage institutions face challenges..."
section_order: 2
- section_id: https://doi.org/10.1007/example#methods
section_type: METHODS
section_content: "We developed a LinkML schema..."
section_order: 3
- section_id: https://doi.org/10.1007/example#results
section_type: RESULTS
section_content: "Our approach successfully extracted..."
section_order: 4
```
**Publishing Workflow (PSO)**:
```yaml
# Track publication status over time
- publication_id: https://doi.org/10.1007/example
title: "Heritage Institution Metadata Analysis"
publishing_status: PUBLISHED # Current status
# Status history (not in schema, but tracked externally):
# 2024-01-15: SUBMITTED
# 2024-02-01: UNDER_REVIEW
# 2024-04-10: ACCEPTED
# 2024-05-20: IN_PRESS
# 2024-06-01: PUBLISHED
```
## Main Schema (`heritage_custodian.yaml`)
The main schema file now serves as an aggregator, importing all 7 modules:
```yaml
id: https://w3id.org/heritage/custodian
name: heritage-custodian
title: Heritage Custodian Schema
description: >-
Comprehensive LinkML schema for modeling heritage institutions...
version: 0.2.1
# Prefixes (all ontologies used across modules)
prefixes:
heritage: https://w3id.org/heritage/custodian/
org: http://www.w3.org/ns/org#
prov: http://www.w3.org/ns/prov#
rico: https://www.ica.org/standards/RiC/ontology#
fabio: http://purl.org/spar/fabio/
cito: http://purl.org/spar/cito/
datacite: http://purl.org/spar/datacite/
# ... (30+ prefixes)
# Import all modules
imports:
- linkml:types
- enums
- core
- provenance
- collections
- dutch
- bibliographic # NEW: SPAR Ontologies for scholarly publications
```
## Using Modules
### Full Schema (All Modules)
```python
from linkml_runtime import SchemaView
sv = SchemaView('schemas/heritage_custodian.yaml')
# Access all 21 classes, 155+ slots, 12 enums
# Includes bibliographic module for scholarly publications
```
### Individual Modules
```python
# Just enumerations
sv = SchemaView('schemas/enums.yaml')
# Just core classes
sv = SchemaView('schemas/core.yaml')
# Just provenance tracking
sv = SchemaView('schemas/provenance.yaml')
# Just bibliographic (SPAR Ontologies)
sv = SchemaView('schemas/bibliographic.yaml')
```
### Selective Imports
Create a custom schema importing only what you need:
```yaml
id: https://example.org/my-schema
imports:
- linkml:types
- enums
- core
# Omit provenance, collections, dutch, bibliographic if not needed
classes:
MyCustomClass:
is_a: HeritageCustodian
# Use core classes without Dutch-specific fields
```
### Domain-Specific Module Example (Bibliographic)
For scholarly publication tracking:
```yaml
id: https://example.org/publication-tracker
imports:
- linkml:types
- bibliographic # Just SPAR Ontologies classes
# Track publications about heritage institutions
# without importing full organizational metadata
classes:
SimplifiedPublication:
is_a: Publication
slots:
- title
- doi
- authors
- citations
```
## Benefits
### 1. **Independent Usage**
Use only the modules relevant to your use case. For example:
- Global dataset without Dutch institutions → omit `dutch.yaml`
- Simple institution list without provenance → omit `provenance.yaml`
- Just enumerations for validation → use `enums.yaml` alone
### 2. **Clearer Organization**
Each module has a single, focused purpose:
- **Enums**: Type definitions
- **Core**: Organizational structure
- **Provenance**: Data quality and history
- **Collections**: Holdings and platforms
- **Dutch**: Country-specific extensions
### 3. **Easier Maintenance**
Changes are isolated to relevant modules:
- Add new institution type → edit `enums.yaml`
- Add new core field → edit `core.yaml`
- Add new country module → create `brazil.yaml` (doesn't affect existing modules)
### 4. **Better IDE Support**
Smaller files load faster and are easier to navigate in IDEs.
### 5. **Future Extensibility**
Easy to add country-specific and domain-specific modules following established patterns:
**Country-Specific Modules** (following `dutch.yaml` pattern):
```
schemas/
├── heritage_custodian.yaml
├── enums.yaml
├── core.yaml
├── provenance.yaml
├── collections.yaml
├── dutch.yaml
├── brazil.yaml # New: Brazilian-specific extensions (CNPJ, Ibram registry)
├── vietnam.yaml # New: Vietnamese-specific extensions
└── japan.yaml # New: Japanese-specific extensions (ISIL-JP, NACSIS)
```
**Domain-Specific Modules** (following `bibliographic.yaml` pattern):
```
schemas/
├── heritage_custodian.yaml
├── bibliographic.yaml # SPAR Ontologies for scholarly publications
├── museum_objects.yaml # Future: CIDOC-CRM for museum object cataloging
├── archival_desc.yaml # Future: RiC-O for archival description
└── conservation.yaml # Future: Conservation event tracking
```
## Validation
### LinkML Validation
All modules validate successfully with LinkML runtime:
```bash
$ python3 -c "from linkml_runtime import SchemaView; sv = SchemaView('schemas/heritage_custodian.yaml'); print(f'✓ {len(list(sv.all_classes()))} classes, {len(list(sv.all_slots()))} slots, {len(list(sv.all_enums()))} enums')"
21 classes, 190 slots, 13 enums
```
### Backward Compatibility
All existing parsers and tests pass without modification:
```bash
$ pytest tests/parsers/ -q
53 passed in 0.40s ✓
```
### Python Models
Generated Pydantic models work correctly:
```python
from glam_extractor.models import DutchHeritageCustodian, Provenance, Publication
# All models functional, no breaking changes
```
## Migration Guide
### For Schema Developers
**Before (monolithic)**:
- Edit 1,102-line `heritage_custodian.yaml`
- Hard to find relevant sections
- Risk of breaking unrelated code
**After (modular)**:
- Edit focused module (e.g., `collections.yaml` for collection fields)
- Clear module scope
- Changes isolated
### For Schema Users
**No changes required!** The main schema still works exactly as before:
```python
from linkml_runtime import SchemaView
sv = SchemaView('schemas/heritage_custodian.yaml')
# All classes, slots, enums available as before
```
### For Extension Developers
**New pattern**: Create country/domain-specific modules:
```yaml
# schemas/my_extension.yaml
id: https://example.org/my-extension
imports:
- core # Import base classes
classes:
MyCustomCustodian:
is_a: HeritageCustodian
slots:
- my_custom_field
```
## Metrics
| Metric | Before (v0.1) | After (v0.2.1) | Change |
|--------|---------------|----------------|--------|
| Main schema size | 1,102 lines | 73 lines | **-93%** |
| Total schema size | 1,102 lines | 2,436 lines* | +121% (comprehensive) |
| Modules | 1 | 7 | **Focused separation** |
| Classes | 12 | 21 | **+9 bibliographic classes** |
| Slots | 104 | 190 | **+86 bibliographic slots** |
| Enums | 7 | 13 | **+6 bibliographic enums** |
| Test coverage | 53 tests | 53 tests | ✓ 100% pass rate |
| Ontologies integrated | 5 | **12** | **+7 SPAR ontologies** |
*Includes extensive documentation, examples, and SPAR Ontologies integration
### Module Breakdown
| Module | Lines | Classes | Key Features |
|--------|-------|---------|--------------|
| `enums.yaml` | 257 | 0 | 13 enumerations (institution types, data tiers, citation types, etc.) |
| `core.yaml` | 610 | 5 | HeritageCustodian, Location, ContactInfo, Identifier, OrganizationalUnit |
| `provenance.yaml` | 237 | 3 | Provenance, GHCIDHistoryEntry, ChangeEvent |
| `collections.yaml` | 194 | 3 | Collection, DigitalPlatform, Partnership |
| `dutch.yaml` | 117 | 1 | DutchHeritageCustodian (extends HeritageCustodian) |
| `bibliographic.yaml` | 948 | 9 | ScholarlyWork, Publication, Journal, Conference, Person, Organization, Citation, DocumentSection, PublishingRole |
| `heritage_custodian.yaml` | 73 | 0 | Main aggregator (imports all modules) |
| **TOTAL** | **2,436** | **21** | Comprehensive heritage institution modeling |
## References
- **Schema Files**: `/schemas/`
- **Documentation**: `/docs/`
- **Tests**: `/tests/parsers/`
- **Models**: `/src/glam_extractor/models.py`
---
**Last Updated**: 2025-11-09
**Schema Version**: 0.2.1