# ============================================================================= # GLAM-NER ENTITY ANNOTATION CONVENTION # Module: Document Structure and Namespace Paths # ============================================================================= # # VERSION: 1.7.0 # STATUS: STABLE # LAST_UPDATED: 2025-12-02 # # DESCRIPTION: # Comprehensive document structure annotation module defining the DOC hypernym # and all document region types. Critical for understanding entity context # within structured documents. # # CRITICAL: Namespaces and document paths are ESSENTIAL for clustering entities # and distinguishing context-specific entity relationships. # # DEPENDENCIES: # - core/namespaces.yaml (namespace prefixes) # - core/convention.yaml (base provenance model) # # CONTENTS: # - Purpose and design principles # - Layout Semantic Ontology (DOC hypernym with ~25 region types) # - Semantic role enumeration # - Nested provenance model (layout claims + entity claims) # - Structural contexts for clustering # - Format-specific path conventions (PAGE-XML, HTML, JSON, plain text) # - Clustering strategies # - Provenance requirements # # ============================================================================= module: id: document_structure name: "Document Structure and Namespace Paths" version: "1.7.0" status: stable category: advanced imports: - ../core/namespaces.yaml - ../core/convention.yaml # ============================================================================= # PURPOSE AND DESIGN PRINCIPLES # ============================================================================= purpose: | Entity annotations occur within STRUCTURED DOCUMENTS - not flat text streams. The LOCATION of an entity within document structure determines: 1. SEMANTIC SCOPE: An entity in a header governs entities in subsequent paragraphs 2. CO-REFERENCE CLUSTERS: Entities in the same structural unit likely co-refer 3. RELATIONSHIP CONTEXT: Header-paragraph relations differ from paragraph-paragraph 4. PROVENANCE PRECISION: XPath/JSONPath enables exact location for verification Without namespace paths, entity extraction loses critical context that distinguishes "the director" in a museum description from "the director" in a film credits section of the same document. design_principles: - principle: "STRUCTURE IS MEANING" description: | Document structure (headers, sections, paragraphs, lists) carries semantic information. A person mentioned in a "Board of Directors" section has a different relationship to the organization than a person mentioned in a "Historical Overview" section. - principle: "PATHS ENABLE CLUSTERING" description: | Entities sharing a common path prefix belong to the same structural context. Clustering by path enables: - Scoped entity resolution (disambiguate within section before document) - Contextual relationship inference (section membership implies relationship) - Provenance aggregation (all claims from same region share reliability) - principle: "NAMESPACES PREVENT COLLISION" description: | The same entity mention in different structural contexts may require different annotations or link to different knowledge base entities. Namespaces ensure annotations are addressable without ambiguity. - principle: "LAYOUT INFORMS SEMANTICS" description: | Visual layout (sidebars, captions, footnotes, marginalia) carries meaning distinct from main body text. PAGE-XML text regions, HTML semantic elements, and JSON structural keys all encode layout that affects interpretation. # ============================================================================= # LAYOUT SEMANTIC ONTOLOGY # ============================================================================= # Comprehensive hypernym classes and properties for document structure annotation. # These classes are format-agnostic and apply to PAGE-XML, HTML, JSON, MD, EPUB, PDF. # # PRIMARY AUTHORITIES: # - W3C Web Annotation Data Model (oa:) - annotation targeting and selectors # - CIDOC-CRM (crm:) - information objects and carriers # - RiC-O (rico:) - record parts and structure # - PREMIS (premis:) - digital object hierarchy # - Dublin Core (dcterms:) - part/whole relationships # - NIF 2.0 (nif:) - text annotation interchange # ============================================================================= layout_semantic_ontology: purpose: | This ontology defines semantic classes for document layout elements that generative AI models can use to annotate/describe structural context. CRITICAL: Layout claims are DISTINCT from entity claims. A complete annotation requires TWO NESTED LAYERS of provenance: 1. LAYOUT CLAIM: "This text region is a sidebar" - Has its own provenance (XPath, timestamp, confidence, agent) - May be uncertain (is this a sidebar or a marginalia?) 2. ENTITY CLAIM: "This sidebar contains person name 'Rembrandt'" - Has its own provenance (span offsets, NER model, confidence) - REFERENCES the layout claim as context This separation enables: - Independent validation of layout vs. entity extraction - Different confidence levels for structure vs. content - Reasoning about how layout affects entity interpretation # --------------------------------------------------------------------------- # HYPERNYM: DOCUMENT_REGION (DOC) # --------------------------------------------------------------------------- # The top-level class for any identifiable region within a document. # Analogous to CIDOC-CRM E73 Information Object but focused on structure. # --------------------------------------------------------------------------- DOCUMENT_REGION: code: "DOC" definition: | A discrete, identifiable region within a document that contains content and has structural relationships to other regions. Document regions are the fundamental unit of layout annotation. Every document region has: - BOUNDARIES: Start/end positions (character offsets, coordinates, paths) - CONTAINMENT: Parent regions that contain it, child regions it contains - SEQUENCE: Position relative to sibling regions - SEMANTIC ROLE: Function within the document (header, body, supplement) ontology_mappings: primary_class: "crm:E73_Information_Object" primary_class_definition: | CIDOC-CRM E73 Information Object: "This class comprises identifiable immaterial items, such as poems, jokes, data sets, images, texts, multimedia objects, procedural prescriptions, computer program code, algorithm or mathematical formulae, that have an objectively recognizable structure." alternative_classes: - class: "rico:RecordPart" note: "RiC-O class for component parts of archival records" - class: "premis:Bitstream" note: "PREMIS class for meaningful data segment within file" - class: "oa:TextualBody" note: "Web Annotation class for textual content" linkml_mapping: class_uri: "glam:DocumentRegion" exact_mappings: - "crm:E73_Information_Object" close_mappings: - "rico:RecordPart" - "premis:Bitstream" related_mappings: - "oa:TextualBody" - "nif:Context" properties: - property: "hasParentRegion" uri: "glam:hasParentRegion" range: "DocumentRegion" owl_mapping: "dcterms:isPartOf" description: "The containing region (section contains paragraph)" - property: "hasChildRegion" uri: "glam:hasChildRegion" range: "DocumentRegion" owl_mapping: "dcterms:hasPart" description: "Contained regions (paragraph contains sentences)" - property: "hasNextSibling" uri: "glam:hasNextSibling" range: "DocumentRegion" owl_mapping: "rico:isOrWasAdjacentTo" description: "Next region in document order" - property: "hasPreviousSibling" uri: "glam:hasPreviousSibling" range: "DocumentRegion" description: "Previous region in document order" - property: "hasSemanticRole" uri: "glam:hasSemanticRole" range: "LayoutSemanticRole" description: "The semantic function of this region" - property: "regionPath" uri: "glam:regionPath" range: "xsd:string" owl_mapping: "oa:hasSelector" description: "XPath, JSONPath, or other path expression" - property: "regionStart" uri: "glam:regionStart" range: "xsd:integer" owl_mapping: "nif:beginIndex" description: "Character offset of region start" - property: "regionEnd" uri: "glam:regionEnd" range: "xsd:integer" owl_mapping: "nif:endIndex" description: "Character offset of region end" # ========================================================================= # SUBCATEGORIES - PRIMARY CONTENT REGIONS # ========================================================================= subcategories: # ----------------------------------------------------------------------- # PRIMARY CONTENT REGIONS # ----------------------------------------------------------------------- DOC.HDR: name: "HEADER" definition: | Heading or title region that introduces and governs subsequent content. Headers establish scope for entity interpretation and relationship inference. Includes: h1-h6, chapter titles, section headings, running headers, PAGE-XML TextRegion[@type='header'], JSON object keys. ontology_mappings: primary_class: "schema:headline" alternative_classes: - "bf:Title" - "dcterms:title" nif_mapping: "nif:Title" header_levels: - level: 1 scope: "document" html: "h1" pagexml: "TextRegion[@type='heading'][@level='1']" semantic: "Document title or primary topic" - level: 2 scope: "chapter" html: "h2" pagexml: "TextRegion[@type='heading'][@level='2']" semantic: "Major section or chapter heading" - level: 3 scope: "section" html: "h3" semantic: "Subsection heading" - level: 4 scope: "subsection" html: "h4" semantic: "Sub-subsection heading" - level: 5 scope: "paragraph_group" html: "h5" semantic: "Minor heading or label" - level: 6 scope: "inline" html: "h6" semantic: "Inline heading or run-in head" governing_properties: - property: "governs" uri: "glam:governs" range: "DocumentRegion" description: "Regions semantically governed by this header" - property: "governsUntil" uri: "glam:governsUntil" range: "DocumentRegion" description: "Region where governance ends (next same-level header)" DOC.PAR: name: "PARAGRAPH" definition: | Block of continuous prose text forming a logical unit of discourse. The primary content-bearing unit in most documents. Includes: HTML

, PAGE-XML TextRegion[@type='paragraph'], text separated by blank lines, JSON string values. ontology_mappings: primary_class: "crm:E33_Linguistic_Object" alternative_classes: - "schema:Text" - "nif:Paragraph" paragraph_properties: - property: "paragraphIndex" uri: "glam:paragraphIndex" range: "xsd:integer" description: "Zero-based index within containing section" - property: "sentenceCount" uri: "glam:sentenceCount" range: "xsd:integer" description: "Number of sentences in paragraph" DOC.SEN: name: "SENTENCE" definition: | A grammatical sentence within a paragraph. The minimal unit for syntactic analysis and relationship extraction. ontology_mappings: primary_class: "nif:Sentence" alternative_classes: - "crm:E33_Linguistic_Object" sentence_properties: - property: "sentenceIndex" uri: "glam:sentenceIndex" range: "xsd:integer" description: "Zero-based index within containing paragraph" DOC.LST: name: "LIST" definition: | Ordered or unordered enumeration of items sharing parallel structure. Includes: HTML