20 KiB
Schema Modularization Architecture
Version: 0.2.1
Date: 2025-11-09
Status: ✅ Complete
Overview
The Heritage Custodian schema has been modularized from a single 1,102-line YAML file into 7 focused, reusable modules. This architecture enables:
- Independent usage: Use only the modules you need
- Clear separation of concerns: Core entities vs. provenance vs. collections vs. bibliographic
- Easier maintenance: Changes isolated to specific modules
- Better comprehension: Each module is self-contained and focused
- Future extensibility: Easy to add country-specific or domain-specific modules
- Standards integration: SPAR Ontologies (FaBiO, CiTO, PRO, PSO, DoCO) for scholarly publications
Module Structure
schemas/
├── heritage_custodian.yaml # Main schema (74 lines, imports all modules)
├── enums.yaml # Enumerations (institution types, data tiers, etc.)
├── core.yaml # Core organizational classes
├── provenance.yaml # Provenance tracking, GHCID history, change events
├── collections.yaml # Collections, digital platforms, partnerships
├── dutch.yaml # Dutch-specific extensions
└── bibliographic.yaml # NEW: SPAR Ontologies for scholarly publications
Module Details
1. enums.yaml (8,671 bytes)
Purpose: Define all enumeration types used across the schema.
Contents:
InstitutionTypeEnum- 13 heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)OrganizationStatusEnum- Operational status (ACTIVE, INACTIVE, MERGED, etc.)DataSourceEnum- Data provenance sources (ISIL_REGISTRY, WIKIDATA, CONVERSATION_NLP, etc.)DataTierEnum- Data quality tiers (TIER_1_AUTHORITATIVE → TIER_4_INFERRED)MetadataStandardEnum- Standards used (DUBLIN_CORE, MARC21, EAD, RIC_O, etc.)DigitalPlatformTypeEnum- Platform types (COLLECTION_MANAGEMENT, DIGITAL_REPOSITORY, etc.)ChangeTypeEnum- Organizational changes (FOUNDING, CLOSURE, MERGER, RELOCATION, etc.)
Usage:
imports:
- enums
classes:
MyClass:
slots:
- institution_type # Uses InstitutionTypeEnum
2. core.yaml (13,810 bytes)
Purpose: Core organizational classes and identification.
Contents:
HeritageCustodian- Main heritage institution class- GHCID tracking (numeric, current, original, history)
- Organizational metadata (name, type, status, description)
- Temporal fields (founded_date, closed_date, prov_generated_at, prov_invalidated_at)
- Hierarchies (parent_organization, sub_organizations)
- TOOI naming (official_name, sorting_name, abbreviation)
Location- Physical/virtual locations- Address fields (street, city, postal_code, region, country)
- Geocoding (latitude, longitude, geonames_id)
- Primary location flag
ContactInfo- Contact information (email, phone, fax)Identifier- External identifiers (ISIL, VIAF, Wikidata, KvK)OrganizationalUnit- Departments and divisions within institutions
Ontology Mappings:
HeritageCustodian→org:Organization+prov:EntityLocation→schema:PlaceContactInfo→cpov:ContactPoint+schema:ContactPointIdentifier→dcterms:identifierOrganizationalUnit→org:OrganizationalUnit
Usage:
imports:
- core
# Now you can use HeritageCustodian, Location, etc.
3. provenance.yaml (6,715 bytes)
Purpose: Data quality tracking, GHCID history, and organizational change events.
Contents:
Provenance- Data provenance metadata- Source and tier (data_source, data_tier)
- Extraction metadata (extraction_date, extraction_method, confidence_score)
- Verification (verified_date, verified_by)
- Source references (conversation_id, source_url)
GHCIDHistoryEntry- Historical GHCID tracking- GHCID values (ghcid, ghcid_numeric)
- Validity period (valid_from, valid_to)
- Context (institution_name, location_city, location_country, reason)
ChangeEvent- Organizational lifecycle events- Event metadata (event_id, change_type, event_date, event_description)
- Affected entities (affected_organization, resulting_organization, related_organizations)
- Documentation (source_documentation)
Ontology Mappings:
ChangeEvent→prov:Activity+tooiont:Wijzigingsgebeurtenis
Usage:
imports:
- provenance
classes:
MyClass:
slots:
- provenance # Uses Provenance class
- change_history # List of ChangeEvent
4. collections.yaml (4,834 bytes)
Purpose: Collections, digital platforms, and partnerships.
Contents:
Collection- Heritage collections- Metadata (collection_id, collection_name, collection_description, collection_type)
- Content (item_count, subjects, time_period_start, time_period_end)
- Access (access_rights, catalog_url)
- NEW: Custodian link (custodian slot → implements
rico:hasOrHadHolder)
DigitalPlatform- Digital systems and platforms- Platform info (platform_name, platform_type, platform_url)
- Technical (vendor, implemented_standards)
Partnership- Partnerships and network memberships- Partner info (partner_name, partner_id, partnership_type)
- Temporal (start_date, end_date)
Ontology Mappings:
Collection→schema:Collection+rico:RecordSetDigitalPlatform→schema:SoftwareApplicationPartnership→org:Membership
Key Feature:
The Collection.custodian slot implements RiC-O's rico:hasOrHadHolder pattern, creating bidirectional relationships between collections and their custodial organizations.
Usage:
imports:
- collections
classes:
MyClass:
slots:
- collections # List of Collection objects
- digital_platforms
- partnerships
5. dutch.yaml (4,931 bytes)
Purpose: Dutch-specific extensions for heritage institutions.
Contents:
DutchHeritageCustodian- Extends HeritageCustodian- Dutch identifiers (kvk_number, gemeente_code)
- Regional info (provincie)
- Networks (samenwerkingsverband)
- Aggregation platforms (in_museum_register, in_rijkscollectie, in_collectie_nederland, in_archieven_nl)
Ontology Mappings:
DutchHeritageCustodian→HeritageCustodian+tooiont:Overheidsorganisatie
Usage:
imports:
- dutch
# Now you can use DutchHeritageCustodian
6. bibliographic.yaml (949 lines)
Purpose: SPAR Ontologies integration for scholarly publications about heritage institutions.
Contents:
-
9 Classes:
ScholarlyWork- FRBR Work level (abstract intellectual creation)Publication- FRBR Expression level (concrete realization with authors, dates)Journal- Academic journals with ISSN, impact factors, CiteScoreConference- Conference proceedings and papersPerson- Authors, editors, reviewers with ORCID supportOrganization- Publishers, affiliations with ROR IDsCitation- Citation relationships between publications (CiTO)DocumentSection- Document structure (abstract, methods, results, discussion)PublishingRole- Author/editor/reviewer roles (PRO)
-
5 Enumerations:
PublicationTypeEnum- 13 types (journal article, conference paper, book, thesis, preprint, etc.)CitationTypeEnum- 14 CiTO types (cites, citesAsAuthority, supports, refutes, critiques, etc.)OpenAccessStatusEnum- 6 types (fully open, green OA, hybrid, closed, embargoed, bronze)PublishingStatusEnum- 8 PSO states (submitted, under-review, accepted, published, retracted, etc.)DocumentSectionTypeEnum- 9 DoCO types (abstract, introduction, methods, results, discussion, etc.)PublishingRoleTypeEnum- 9 PRO roles (author, editor, reviewer, translator, corresponding-author, etc.)
SPAR Ontologies Integrated:
- FaBiO (FRBR-aligned Bibliographic Ontology) - Core bibliographic classes
- CiTO (Citation Typing Ontology) - Citation relationships
- BiRO (Bibliographic Reference Ontology) - Citation references
- DoCO (Document Components Ontology) - Document sections
- PRO (Publishing Roles Ontology) - Author/editor/reviewer roles
- PSO (Publishing Status Ontology) - Publishing workflow
- DataCite - DOI, ORCID identifiers
Validation Patterns:
doi: "^10\\.\\d{4,9}/[-._;()/:A-Za-z0-9]+$" # DOI format
issn: "^\\d{4}-\\d{3}[\\dX]$" # ISSN format
isbn: "^(97[89])?[0-9]{9}[0-9X]$" # ISBN-10/13
orcid: "^\\d{4}-\\d{4}-\\d{4}-\\d{3}[0-9X]$" # ORCID format
ror_id: "^https://ror\\.org/0[a-z0-9]{6}[0-9]{2}$" # ROR ID
Ontology Mappings:
ScholarlyWork→fabio:Work+frbr:WorkPublication→fabio:Expression+frbr:ExpressionJournal→fabio:Journal+bibo:JournalConference→fabio:AcademicProceedings+bibo:ConferencePerson→foaf:PersonOrganization→schema:OrganizationCitation→cito:Citation+biro:BibliographicReferenceDocumentSection→doco:SectionPublishingRole→pro:PublishingRole
Integration with HeritageCustodian:
# core.yaml now includes:
HeritageCustodian:
slots:
- publications # Links to Publication class
# Tracks scholarly works about/by this institution
Usage:
imports:
- bibliographic
# Example: Publication with citations
- publication_id: https://doi.org/10.1007/s00799-023-00123-4
title: "Linked Open Data for Museum Collections: The Rijksmuseum Case"
publication_type: JOURNAL_ARTICLE
authors:
- person_name: "John Smith"
orcid: "0000-0002-1234-5678"
affiliation:
organization_name: "University of Amsterdam"
published_in:
journal_title: "International Journal on Digital Libraries"
issn: "1432-5012"
impact_factor: 2.5
doi: "10.1007/s00799-023-00123-4"
open_access_status: FULLY_OPEN_ACCESS
publishing_status: PUBLISHED
citations:
- citing_work: https://doi.org/10.1007/s00799-023-00123-4
cited_work: https://doi.org/10.1145/2970000.2970111
citation_type: CITES_AS_AUTHORITY
citation_context: "Following the approach of Smith et al. [42]..."
FRBR Work/Expression Pattern:
# Abstract work
- work_id: https://w3id.org/heritage/work/glam-linked-data
work_title: "Linked Open Data for Heritage Institutions"
discipline:
- Computer Science
- Library and Information Science
has_expression:
# English journal article
- publication_id: https://doi.org/10.1007/s00799-023-00123-4
title: "Linked Open Data for Museum Collections"
language: en
publication_year: 2023
# Dutch book chapter
- publication_id: https://isbn.org/9789012345678
title: "Gekoppelde Open Data voor Musea"
language: nl
publication_year: 2024
Citation Graph Example (CiTO):
# Paper A cites Paper B as authoritative source
- citation_id: https://w3id.org/heritage/citation/001
citing_work: https://doi.org/10.1007/paper-a
cited_work: https://doi.org/10.1145/paper-b
citation_type: CITES_AS_AUTHORITY
citation_intent: "Foundational methodology for LOD extraction"
citation_context: "Following the approach of Jones et al. [15]..."
page_number: "p. 7"
# Paper C refutes Paper D
- citation_id: https://w3id.org/heritage/citation/002
citing_work: https://doi.org/10.1016/paper-c
cited_work: https://doi.org/10.1108/paper-d
citation_type: REFUTES
citation_intent: "Challenge assumptions about metadata quality"
citation_context: "Contrary to Smith's findings [23], we observe..."
Document Structure (DoCO):
- publication_id: https://doi.org/10.1007/example
title: "Semantic Technologies for Heritage Institutions"
document_sections:
- section_id: https://doi.org/10.1007/example#abstract
section_type: ABSTRACT
section_content: "This paper presents..."
section_order: 1
- section_id: https://doi.org/10.1007/example#introduction
section_type: INTRODUCTION
section_content: "Heritage institutions face challenges..."
section_order: 2
- section_id: https://doi.org/10.1007/example#methods
section_type: METHODS
section_content: "We developed a LinkML schema..."
section_order: 3
- section_id: https://doi.org/10.1007/example#results
section_type: RESULTS
section_content: "Our approach successfully extracted..."
section_order: 4
Publishing Workflow (PSO):
# Track publication status over time
- publication_id: https://doi.org/10.1007/example
title: "Heritage Institution Metadata Analysis"
publishing_status: PUBLISHED # Current status
# Status history (not in schema, but tracked externally):
# 2024-01-15: SUBMITTED
# 2024-02-01: UNDER_REVIEW
# 2024-04-10: ACCEPTED
# 2024-05-20: IN_PRESS
# 2024-06-01: PUBLISHED
Main Schema (heritage_custodian.yaml)
The main schema file now serves as an aggregator, importing all 7 modules:
id: https://w3id.org/heritage/custodian
name: heritage-custodian
title: Heritage Custodian Schema
description: >-
Comprehensive LinkML schema for modeling heritage institutions...
version: 0.2.1
# Prefixes (all ontologies used across modules)
prefixes:
heritage: https://w3id.org/heritage/custodian/
org: http://www.w3.org/ns/org#
prov: http://www.w3.org/ns/prov#
rico: https://www.ica.org/standards/RiC/ontology#
fabio: http://purl.org/spar/fabio/
cito: http://purl.org/spar/cito/
datacite: http://purl.org/spar/datacite/
# ... (30+ prefixes)
# Import all modules
imports:
- linkml:types
- enums
- core
- provenance
- collections
- dutch
- bibliographic # NEW: SPAR Ontologies for scholarly publications
Using Modules
Full Schema (All Modules)
from linkml_runtime import SchemaView
sv = SchemaView('schemas/heritage_custodian.yaml')
# Access all 21 classes, 155+ slots, 12 enums
# Includes bibliographic module for scholarly publications
Individual Modules
# Just enumerations
sv = SchemaView('schemas/enums.yaml')
# Just core classes
sv = SchemaView('schemas/core.yaml')
# Just provenance tracking
sv = SchemaView('schemas/provenance.yaml')
# Just bibliographic (SPAR Ontologies)
sv = SchemaView('schemas/bibliographic.yaml')
Selective Imports
Create a custom schema importing only what you need:
id: https://example.org/my-schema
imports:
- linkml:types
- enums
- core
# Omit provenance, collections, dutch, bibliographic if not needed
classes:
MyCustomClass:
is_a: HeritageCustodian
# Use core classes without Dutch-specific fields
Domain-Specific Module Example (Bibliographic)
For scholarly publication tracking:
id: https://example.org/publication-tracker
imports:
- linkml:types
- bibliographic # Just SPAR Ontologies classes
# Track publications about heritage institutions
# without importing full organizational metadata
classes:
SimplifiedPublication:
is_a: Publication
slots:
- title
- doi
- authors
- citations
Benefits
1. Independent Usage
Use only the modules relevant to your use case. For example:
- Global dataset without Dutch institutions → omit
dutch.yaml - Simple institution list without provenance → omit
provenance.yaml - Just enumerations for validation → use
enums.yamlalone
2. Clearer Organization
Each module has a single, focused purpose:
- Enums: Type definitions
- Core: Organizational structure
- Provenance: Data quality and history
- Collections: Holdings and platforms
- Dutch: Country-specific extensions
3. Easier Maintenance
Changes are isolated to relevant modules:
- Add new institution type → edit
enums.yaml - Add new core field → edit
core.yaml - Add new country module → create
brazil.yaml(doesn't affect existing modules)
4. Better IDE Support
Smaller files load faster and are easier to navigate in IDEs.
5. Future Extensibility
Easy to add country-specific and domain-specific modules following established patterns:
Country-Specific Modules (following dutch.yaml pattern):
schemas/
├── heritage_custodian.yaml
├── enums.yaml
├── core.yaml
├── provenance.yaml
├── collections.yaml
├── dutch.yaml
├── brazil.yaml # New: Brazilian-specific extensions (CNPJ, Ibram registry)
├── vietnam.yaml # New: Vietnamese-specific extensions
└── japan.yaml # New: Japanese-specific extensions (ISIL-JP, NACSIS)
Domain-Specific Modules (following bibliographic.yaml pattern):
schemas/
├── heritage_custodian.yaml
├── bibliographic.yaml # SPAR Ontologies for scholarly publications
├── museum_objects.yaml # Future: CIDOC-CRM for museum object cataloging
├── archival_desc.yaml # Future: RiC-O for archival description
└── conservation.yaml # Future: Conservation event tracking
Validation
LinkML Validation
All modules validate successfully with LinkML runtime:
$ python3 -c "from linkml_runtime import SchemaView; sv = SchemaView('schemas/heritage_custodian.yaml'); print(f'✓ {len(list(sv.all_classes()))} classes, {len(list(sv.all_slots()))} slots, {len(list(sv.all_enums()))} enums')"
✓ 21 classes, 190 slots, 13 enums
Backward Compatibility
All existing parsers and tests pass without modification:
$ pytest tests/parsers/ -q
53 passed in 0.40s ✓
Python Models
Generated Pydantic models work correctly:
from glam_extractor.models import DutchHeritageCustodian, Provenance, Publication
# All models functional, no breaking changes
Migration Guide
For Schema Developers
Before (monolithic):
- Edit 1,102-line
heritage_custodian.yaml - Hard to find relevant sections
- Risk of breaking unrelated code
After (modular):
- Edit focused module (e.g.,
collections.yamlfor collection fields) - Clear module scope
- Changes isolated
For Schema Users
No changes required! The main schema still works exactly as before:
from linkml_runtime import SchemaView
sv = SchemaView('schemas/heritage_custodian.yaml')
# All classes, slots, enums available as before
For Extension Developers
New pattern: Create country/domain-specific modules:
# schemas/my_extension.yaml
id: https://example.org/my-extension
imports:
- core # Import base classes
classes:
MyCustomCustodian:
is_a: HeritageCustodian
slots:
- my_custom_field
Metrics
| Metric | Before (v0.1) | After (v0.2.1) | Change |
|---|---|---|---|
| Main schema size | 1,102 lines | 73 lines | -93% |
| Total schema size | 1,102 lines | 2,436 lines* | +121% (comprehensive) |
| Modules | 1 | 7 | Focused separation |
| Classes | 12 | 21 | +9 bibliographic classes |
| Slots | 104 | 190 | +86 bibliographic slots |
| Enums | 7 | 13 | +6 bibliographic enums |
| Test coverage | 53 tests | 53 tests | ✓ 100% pass rate |
| Ontologies integrated | 5 | 12 | +7 SPAR ontologies |
*Includes extensive documentation, examples, and SPAR Ontologies integration
Module Breakdown
| Module | Lines | Classes | Key Features |
|---|---|---|---|
enums.yaml |
257 | 0 | 13 enumerations (institution types, data tiers, citation types, etc.) |
core.yaml |
610 | 5 | HeritageCustodian, Location, ContactInfo, Identifier, OrganizationalUnit |
provenance.yaml |
237 | 3 | Provenance, GHCIDHistoryEntry, ChangeEvent |
collections.yaml |
194 | 3 | Collection, DigitalPlatform, Partnership |
dutch.yaml |
117 | 1 | DutchHeritageCustodian (extends HeritageCustodian) |
bibliographic.yaml |
948 | 9 | ScholarlyWork, Publication, Journal, Conference, Person, Organization, Citation, DocumentSection, PublishingRole |
heritage_custodian.yaml |
73 | 0 | Main aggregator (imports all modules) |
| TOTAL | 2,436 | 21 | Comprehensive heritage institution modeling |
References
- Schema Files:
/schemas/ - Documentation:
/docs/ - Tests:
/tests/parsers/ - Models:
/src/glam_extractor/models.py
Last Updated: 2025-11-09
Schema Version: 0.2.1