# Schema Modularization Architecture **Version**: 0.2.1 **Date**: 2025-11-09 **Status**: ✅ Complete ## Overview The Heritage Custodian schema has been modularized from a single 1,102-line YAML file into 7 focused, reusable modules. This architecture enables: - **Independent usage**: Use only the modules you need - **Clear separation of concerns**: Core entities vs. provenance vs. collections vs. bibliographic - **Easier maintenance**: Changes isolated to specific modules - **Better comprehension**: Each module is self-contained and focused - **Future extensibility**: Easy to add country-specific or domain-specific modules - **Standards integration**: SPAR Ontologies (FaBiO, CiTO, PRO, PSO, DoCO) for scholarly publications ## Module Structure ``` schemas/ ├── heritage_custodian.yaml # Main schema (74 lines, imports all modules) ├── enums.yaml # Enumerations (institution types, data tiers, etc.) ├── core.yaml # Core organizational classes ├── provenance.yaml # Provenance tracking, GHCID history, change events ├── collections.yaml # Collections, digital platforms, partnerships ├── dutch.yaml # Dutch-specific extensions └── bibliographic.yaml # NEW: SPAR Ontologies for scholarly publications ``` ## Module Details ### 1. `enums.yaml` (8,671 bytes) **Purpose**: Define all enumeration types used across the schema. **Contents**: - `InstitutionTypeEnum` - 13 heritage institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.) - `OrganizationStatusEnum` - Operational status (ACTIVE, INACTIVE, MERGED, etc.) - `DataSourceEnum` - Data provenance sources (ISIL_REGISTRY, WIKIDATA, CONVERSATION_NLP, etc.) - `DataTierEnum` - Data quality tiers (TIER_1_AUTHORITATIVE → TIER_4_INFERRED) - `MetadataStandardEnum` - Standards used (DUBLIN_CORE, MARC21, EAD, RIC_O, etc.) - `DigitalPlatformTypeEnum` - Platform types (COLLECTION_MANAGEMENT, DIGITAL_REPOSITORY, etc.) - `ChangeTypeEnum` - Organizational changes (FOUNDING, CLOSURE, MERGER, RELOCATION, etc.) **Usage**: ```yaml imports: - enums classes: MyClass: slots: - institution_type # Uses InstitutionTypeEnum ``` ### 2. `core.yaml` (13,810 bytes) **Purpose**: Core organizational classes and identification. **Contents**: - `HeritageCustodian` - Main heritage institution class - GHCID tracking (numeric, current, original, history) - Organizational metadata (name, type, status, description) - Temporal fields (founded_date, closed_date, prov_generated_at, prov_invalidated_at) - Hierarchies (parent_organization, sub_organizations) - TOOI naming (official_name, sorting_name, abbreviation) - `Location` - Physical/virtual locations - Address fields (street, city, postal_code, region, country) - Geocoding (latitude, longitude, geonames_id) - Primary location flag - `ContactInfo` - Contact information (email, phone, fax) - `Identifier` - External identifiers (ISIL, VIAF, Wikidata, KvK) - `OrganizationalUnit` - Departments and divisions within institutions **Ontology Mappings**: - `HeritageCustodian` → `org:Organization` + `prov:Entity` - `Location` → `schema:Place` - `ContactInfo` → `cpov:ContactPoint` + `schema:ContactPoint` - `Identifier` → `dcterms:identifier` - `OrganizationalUnit` → `org:OrganizationalUnit` **Usage**: ```yaml imports: - core # Now you can use HeritageCustodian, Location, etc. ``` ### 3. `provenance.yaml` (6,715 bytes) **Purpose**: Data quality tracking, GHCID history, and organizational change events. **Contents**: - `Provenance` - Data provenance metadata - Source and tier (data_source, data_tier) - Extraction metadata (extraction_date, extraction_method, confidence_score) - Verification (verified_date, verified_by) - Source references (conversation_id, source_url) - `GHCIDHistoryEntry` - Historical GHCID tracking - GHCID values (ghcid, ghcid_numeric) - Validity period (valid_from, valid_to) - Context (institution_name, location_city, location_country, reason) - `ChangeEvent` - Organizational lifecycle events - Event metadata (event_id, change_type, event_date, event_description) - Affected entities (affected_organization, resulting_organization, related_organizations) - Documentation (source_documentation) **Ontology Mappings**: - `ChangeEvent` → `prov:Activity` + `tooi:Wijzigingsgebeurtenis` **Usage**: ```yaml imports: - provenance classes: MyClass: slots: - provenance # Uses Provenance class - change_history # List of ChangeEvent ``` ### 4. `collections.yaml` (4,834 bytes) **Purpose**: Collections, digital platforms, and partnerships. **Contents**: - `Collection` - Heritage collections - Metadata (collection_id, collection_name, collection_description, collection_type) - Content (item_count, subjects, time_period_start, time_period_end) - Access (access_rights, catalog_url) - **NEW**: Custodian link (custodian slot → implements `rico:hasOrHadHolder`) - `DigitalPlatform` - Digital systems and platforms - Platform info (platform_name, platform_type, platform_url) - Technical (vendor, implemented_standards) - `Partnership` - Partnerships and network memberships - Partner info (partner_name, partner_id, partnership_type) - Temporal (start_date, end_date) **Ontology Mappings**: - `Collection` → `schema:Collection` + `rico:RecordSet` - `DigitalPlatform` → `schema:SoftwareApplication` - `Partnership` → `org:Membership` **Key Feature**: The `Collection.custodian` slot implements RiC-O's `rico:hasOrHadHolder` pattern, creating bidirectional relationships between collections and their custodial organizations. **Usage**: ```yaml imports: - collections classes: MyClass: slots: - collections # List of Collection objects - digital_platforms - partnerships ``` ### 5. `dutch.yaml` (4,931 bytes) **Purpose**: Dutch-specific extensions for heritage institutions. **Contents**: - `DutchHeritageCustodian` - Extends HeritageCustodian - Dutch identifiers (kvk_number, gemeente_code) - Regional info (provincie) - Networks (samenwerkingsverband) - Aggregation platforms (in_museum_register, in_rijkscollectie, in_collectie_nederland, in_archieven_nl) **Ontology Mappings**: - `DutchHeritageCustodian` → `HeritageCustodian` + `tooi:Overheidsorganisatie` **Usage**: ```yaml imports: - dutch # Now you can use DutchHeritageCustodian ``` ### 6. `bibliographic.yaml` (949 lines) **Purpose**: SPAR Ontologies integration for scholarly publications about heritage institutions. **Contents**: - **9 Classes**: - `ScholarlyWork` - FRBR Work level (abstract intellectual creation) - `Publication` - FRBR Expression level (concrete realization with authors, dates) - `Journal` - Academic journals with ISSN, impact factors, CiteScore - `Conference` - Conference proceedings and papers - `Person` - Authors, editors, reviewers with ORCID support - `Organization` - Publishers, affiliations with ROR IDs - `Citation` - Citation relationships between publications (CiTO) - `DocumentSection` - Document structure (abstract, methods, results, discussion) - `PublishingRole` - Author/editor/reviewer roles (PRO) - **5 Enumerations**: - `PublicationTypeEnum` - 13 types (journal article, conference paper, book, thesis, preprint, etc.) - `CitationTypeEnum` - 14 CiTO types (cites, citesAsAuthority, supports, refutes, critiques, etc.) - `OpenAccessStatusEnum` - 6 types (fully open, green OA, hybrid, closed, embargoed, bronze) - `PublishingStatusEnum` - 8 PSO states (submitted, under-review, accepted, published, retracted, etc.) - `DocumentSectionTypeEnum` - 9 DoCO types (abstract, introduction, methods, results, discussion, etc.) - `PublishingRoleTypeEnum` - 9 PRO roles (author, editor, reviewer, translator, corresponding-author, etc.) **SPAR Ontologies Integrated**: - **FaBiO** (FRBR-aligned Bibliographic Ontology) - Core bibliographic classes - **CiTO** (Citation Typing Ontology) - Citation relationships - **BiRO** (Bibliographic Reference Ontology) - Citation references - **DoCO** (Document Components Ontology) - Document sections - **PRO** (Publishing Roles Ontology) - Author/editor/reviewer roles - **PSO** (Publishing Status Ontology) - Publishing workflow - **DataCite** - DOI, ORCID identifiers **Validation Patterns**: ```yaml doi: "^10\\.\\d{4,9}/[-._;()/:A-Za-z0-9]+$" # DOI format issn: "^\\d{4}-\\d{3}[\\dX]$" # ISSN format isbn: "^(97[89])?[0-9]{9}[0-9X]$" # ISBN-10/13 orcid: "^\\d{4}-\\d{4}-\\d{4}-\\d{3}[0-9X]$" # ORCID format ror_id: "^https://ror\\.org/0[a-z0-9]{6}[0-9]{2}$" # ROR ID ``` **Ontology Mappings**: - `ScholarlyWork` → `fabio:Work` + `frbr:Work` - `Publication` → `fabio:Expression` + `frbr:Expression` - `Journal` → `fabio:Journal` + `bibo:Journal` - `Conference` → `fabio:AcademicProceedings` + `bibo:Conference` - `Person` → `foaf:Person` - `Organization` → `schema:Organization` - `Citation` → `cito:Citation` + `biro:BibliographicReference` - `DocumentSection` → `doco:Section` - `PublishingRole` → `pro:PublishingRole` **Integration with HeritageCustodian**: ```yaml # core.yaml now includes: HeritageCustodian: slots: - publications # Links to Publication class # Tracks scholarly works about/by this institution ``` **Usage**: ```yaml imports: - bibliographic # Example: Publication with citations - publication_id: https://doi.org/10.1007/s00799-023-00123-4 title: "Linked Open Data for Museum Collections: The Rijksmuseum Case" publication_type: JOURNAL_ARTICLE authors: - person_name: "John Smith" orcid: "0000-0002-1234-5678" affiliation: organization_name: "University of Amsterdam" published_in: journal_title: "International Journal on Digital Libraries" issn: "1432-5012" impact_factor: 2.5 doi: "10.1007/s00799-023-00123-4" open_access_status: FULLY_OPEN_ACCESS publishing_status: PUBLISHED citations: - citing_work: https://doi.org/10.1007/s00799-023-00123-4 cited_work: https://doi.org/10.1145/2970000.2970111 citation_type: CITES_AS_AUTHORITY citation_context: "Following the approach of Smith et al. [42]..." ``` **FRBR Work/Expression Pattern**: ```yaml # Abstract work - work_id: https://w3id.org/heritage/work/glam-linked-data work_title: "Linked Open Data for Heritage Institutions" discipline: - Computer Science - Library and Information Science has_expression: # English journal article - publication_id: https://doi.org/10.1007/s00799-023-00123-4 title: "Linked Open Data for Museum Collections" language: en publication_year: 2023 # Dutch book chapter - publication_id: https://isbn.org/9789012345678 title: "Gekoppelde Open Data voor Musea" language: nl publication_year: 2024 ``` **Citation Graph Example (CiTO)**: ```yaml # Paper A cites Paper B as authoritative source - citation_id: https://w3id.org/heritage/citation/001 citing_work: https://doi.org/10.1007/paper-a cited_work: https://doi.org/10.1145/paper-b citation_type: CITES_AS_AUTHORITY citation_intent: "Foundational methodology for LOD extraction" citation_context: "Following the approach of Jones et al. [15]..." page_number: "p. 7" # Paper C refutes Paper D - citation_id: https://w3id.org/heritage/citation/002 citing_work: https://doi.org/10.1016/paper-c cited_work: https://doi.org/10.1108/paper-d citation_type: REFUTES citation_intent: "Challenge assumptions about metadata quality" citation_context: "Contrary to Smith's findings [23], we observe..." ``` **Document Structure (DoCO)**: ```yaml - publication_id: https://doi.org/10.1007/example title: "Semantic Technologies for Heritage Institutions" document_sections: - section_id: https://doi.org/10.1007/example#abstract section_type: ABSTRACT section_content: "This paper presents..." section_order: 1 - section_id: https://doi.org/10.1007/example#introduction section_type: INTRODUCTION section_content: "Heritage institutions face challenges..." section_order: 2 - section_id: https://doi.org/10.1007/example#methods section_type: METHODS section_content: "We developed a LinkML schema..." section_order: 3 - section_id: https://doi.org/10.1007/example#results section_type: RESULTS section_content: "Our approach successfully extracted..." section_order: 4 ``` **Publishing Workflow (PSO)**: ```yaml # Track publication status over time - publication_id: https://doi.org/10.1007/example title: "Heritage Institution Metadata Analysis" publishing_status: PUBLISHED # Current status # Status history (not in schema, but tracked externally): # 2024-01-15: SUBMITTED # 2024-02-01: UNDER_REVIEW # 2024-04-10: ACCEPTED # 2024-05-20: IN_PRESS # 2024-06-01: PUBLISHED ``` ## Main Schema (`heritage_custodian.yaml`) The main schema file now serves as an aggregator, importing all 7 modules: ```yaml id: https://w3id.org/heritage/custodian name: heritage-custodian title: Heritage Custodian Schema description: >- Comprehensive LinkML schema for modeling heritage institutions... version: 0.2.1 # Prefixes (all ontologies used across modules) prefixes: heritage: https://w3id.org/heritage/custodian/ org: http://www.w3.org/ns/org# prov: http://www.w3.org/ns/prov# rico: https://www.ica.org/standards/RiC/ontology# fabio: http://purl.org/spar/fabio/ cito: http://purl.org/spar/cito/ datacite: http://purl.org/spar/datacite/ # ... (30+ prefixes) # Import all modules imports: - linkml:types - enums - core - provenance - collections - dutch - bibliographic # NEW: SPAR Ontologies for scholarly publications ``` ## Using Modules ### Full Schema (All Modules) ```python from linkml_runtime import SchemaView sv = SchemaView('schemas/heritage_custodian.yaml') # Access all 21 classes, 155+ slots, 12 enums # Includes bibliographic module for scholarly publications ``` ### Individual Modules ```python # Just enumerations sv = SchemaView('schemas/enums.yaml') # Just core classes sv = SchemaView('schemas/core.yaml') # Just provenance tracking sv = SchemaView('schemas/provenance.yaml') # Just bibliographic (SPAR Ontologies) sv = SchemaView('schemas/bibliographic.yaml') ``` ### Selective Imports Create a custom schema importing only what you need: ```yaml id: https://example.org/my-schema imports: - linkml:types - enums - core # Omit provenance, collections, dutch, bibliographic if not needed classes: MyCustomClass: is_a: HeritageCustodian # Use core classes without Dutch-specific fields ``` ### Domain-Specific Module Example (Bibliographic) For scholarly publication tracking: ```yaml id: https://example.org/publication-tracker imports: - linkml:types - bibliographic # Just SPAR Ontologies classes # Track publications about heritage institutions # without importing full organizational metadata classes: SimplifiedPublication: is_a: Publication slots: - title - doi - authors - citations ``` ## Benefits ### 1. **Independent Usage** Use only the modules relevant to your use case. For example: - Global dataset without Dutch institutions → omit `dutch.yaml` - Simple institution list without provenance → omit `provenance.yaml` - Just enumerations for validation → use `enums.yaml` alone ### 2. **Clearer Organization** Each module has a single, focused purpose: - **Enums**: Type definitions - **Core**: Organizational structure - **Provenance**: Data quality and history - **Collections**: Holdings and platforms - **Dutch**: Country-specific extensions ### 3. **Easier Maintenance** Changes are isolated to relevant modules: - Add new institution type → edit `enums.yaml` - Add new core field → edit `core.yaml` - Add new country module → create `brazil.yaml` (doesn't affect existing modules) ### 4. **Better IDE Support** Smaller files load faster and are easier to navigate in IDEs. ### 5. **Future Extensibility** Easy to add country-specific and domain-specific modules following established patterns: **Country-Specific Modules** (following `dutch.yaml` pattern): ``` schemas/ ├── heritage_custodian.yaml ├── enums.yaml ├── core.yaml ├── provenance.yaml ├── collections.yaml ├── dutch.yaml ├── brazil.yaml # New: Brazilian-specific extensions (CNPJ, Ibram registry) ├── vietnam.yaml # New: Vietnamese-specific extensions └── japan.yaml # New: Japanese-specific extensions (ISIL-JP, NACSIS) ``` **Domain-Specific Modules** (following `bibliographic.yaml` pattern): ``` schemas/ ├── heritage_custodian.yaml ├── bibliographic.yaml # SPAR Ontologies for scholarly publications ├── museum_objects.yaml # Future: CIDOC-CRM for museum object cataloging ├── archival_desc.yaml # Future: RiC-O for archival description └── conservation.yaml # Future: Conservation event tracking ``` ## Validation ### LinkML Validation All modules validate successfully with LinkML runtime: ```bash $ python3 -c "from linkml_runtime import SchemaView; sv = SchemaView('schemas/heritage_custodian.yaml'); print(f'✓ {len(list(sv.all_classes()))} classes, {len(list(sv.all_slots()))} slots, {len(list(sv.all_enums()))} enums')" ✓ 21 classes, 190 slots, 13 enums ``` ### Backward Compatibility All existing parsers and tests pass without modification: ```bash $ pytest tests/parsers/ -q 53 passed in 0.40s ✓ ``` ### Python Models Generated Pydantic models work correctly: ```python from glam_extractor.models import DutchHeritageCustodian, Provenance, Publication # All models functional, no breaking changes ``` ## Migration Guide ### For Schema Developers **Before (monolithic)**: - Edit 1,102-line `heritage_custodian.yaml` - Hard to find relevant sections - Risk of breaking unrelated code **After (modular)**: - Edit focused module (e.g., `collections.yaml` for collection fields) - Clear module scope - Changes isolated ### For Schema Users **No changes required!** The main schema still works exactly as before: ```python from linkml_runtime import SchemaView sv = SchemaView('schemas/heritage_custodian.yaml') # All classes, slots, enums available as before ``` ### For Extension Developers **New pattern**: Create country/domain-specific modules: ```yaml # schemas/my_extension.yaml id: https://example.org/my-extension imports: - core # Import base classes classes: MyCustomCustodian: is_a: HeritageCustodian slots: - my_custom_field ``` ## Metrics | Metric | Before (v0.1) | After (v0.2.1) | Change | |--------|---------------|----------------|--------| | Main schema size | 1,102 lines | 73 lines | **-93%** | | Total schema size | 1,102 lines | 2,436 lines* | +121% (comprehensive) | | Modules | 1 | 7 | **Focused separation** | | Classes | 12 | 21 | **+9 bibliographic classes** | | Slots | 104 | 190 | **+86 bibliographic slots** | | Enums | 7 | 13 | **+6 bibliographic enums** | | Test coverage | 53 tests | 53 tests | ✓ 100% pass rate | | Ontologies integrated | 5 | **12** | **+7 SPAR ontologies** | *Includes extensive documentation, examples, and SPAR Ontologies integration ### Module Breakdown | Module | Lines | Classes | Key Features | |--------|-------|---------|--------------| | `enums.yaml` | 257 | 0 | 13 enumerations (institution types, data tiers, citation types, etc.) | | `core.yaml` | 610 | 5 | HeritageCustodian, Location, ContactInfo, Identifier, OrganizationalUnit | | `provenance.yaml` | 237 | 3 | Provenance, GHCIDHistoryEntry, ChangeEvent | | `collections.yaml` | 194 | 3 | Collection, DigitalPlatform, Partnership | | `dutch.yaml` | 117 | 1 | DutchHeritageCustodian (extends HeritageCustodian) | | `bibliographic.yaml` | 948 | 9 | ScholarlyWork, Publication, Journal, Conference, Person, Organization, Citation, DocumentSection, PublishingRole | | `heritage_custodian.yaml` | 73 | 0 | Main aggregator (imports all modules) | | **TOTAL** | **2,436** | **21** | Comprehensive heritage institution modeling | ## References - **Schema Files**: `/schemas/` - **Documentation**: `/docs/` - **Tests**: `/tests/parsers/` - **Models**: `/src/glam_extractor/models.py` --- **Last Updated**: 2025-11-09 **Schema Version**: 0.2.1