"
appendix_types:
- type: "data_appendix"
description: "Statistical or tabular data"
- type: "document_appendix"
description: "Reproduced primary documents"
- type: "technical_appendix"
description: "Methods, algorithms, specifications"
- type: "glossary_appendix"
description: "Extended terminology (see also DOC.GLO)"
appendix_properties:
- property: "appendixLabel"
uri: "schema:name"
description: "Appendix letter/number (A, B, C or 1, 2, 3)"
- property: "appendixTitle"
uri: "dcterms:title"
description: "Descriptive title"
DOC.GLO:
name: "GLOSSARY"
code: "DOC.GLO"
definition: |
Alphabetical list of terms with definitions. Each entry is a
CONCEPT with term (label) and definition (description).
Domain-agnostic applications:
- Heritage: Manuscript terminology, archaic word lists
- Publishing: Technical glossaries, foreign word lists
- Academic: Disciplinary terminology
- Legal: Legal terms, statutory definitions
- Archives: Archival terminology, provenance terms
Glossaries are CONTROLLED VOCABULARY sources.
ontology_mappings:
primary_class: "skos:ConceptScheme"
alternative_classes:
- "schema:DefinedTermSet"
- "bibo:DocumentPart"
linkml_mapping:
class_uri: "skos:ConceptScheme"
format_mappings:
html:
elements: [".glossary", "dl.glossary", "section[role='doc-glossary']"]
tei:
element: "
, "
glossary_entry_components:
- component: "GLO.TRM"
name: "Term"
ontology: "skos:prefLabel"
- component: "GLO.DEF"
name: "Definition"
ontology: "skos:definition"
- component: "GLO.SYN"
name: "Synonym"
ontology: "skos:altLabel"
- component: "GLO.REL"
name: "Related Term"
ontology: "skos:related"
# -----------------------------------------------------------------------
# COMMERCIAL AND BRANDING REGIONS
# -----------------------------------------------------------------------
DOC.ADV:
name: "ADVERTISEMENT"
code: "DOC.ADV"
definition: |
Commercial or promotional content within document. Important for
historical documents where ads provide dating, pricing, business,
and social context. Distinct from main editorial content.
Domain-agnostic applications:
- Heritage: Historical newspaper ads, trade catalog entries, broadside ads
- Publishing: Book advertisements, periodical ads, classified sections
- Archives: Commercial records, promotional materials
- Web: Banner ads, sponsored content, promotional sections
- Ephemera: Trade cards, handbills, promotional flyers
Advertisements are RICH entity sources for historical research:
business names, addresses, prices, products, and social attitudes.
ontology_mappings:
primary_class: "schema:Advertisement"
alternative_classes:
- "crm:E73_Information_Object"
- "bibo:DocumentPart"
linkml_mapping:
class_uri: "schema:Advertisement"
format_mappings:
html:
elements: [".ad", ".advertisement", "[role='complementary'][aria-label*='sponsor']", "aside.ad"]
pagexml:
type: "TextRegion[@type='advertisement']"
newspaper:
note: "Common in historical newspaper digitization"
advertisement_types:
- type: "display_ad"
description: "Large format advertisement with graphics"
- type: "classified_ad"
description: "Text-only small advertisement"
- type: "trade_listing"
description: "Business directory entry"
- type: "prospectus"
description: "Book/publication advertisement"
- type: "patent_medicine"
description: "Historical medical product ads"
note: "Common in 19th century periodicals"
- type: "auction_notice"
description: "Sale or auction announcement"
- type: "legal_notice"
description: "Required public announcements"
advertisement_properties:
- property: "advertiser"
uri: "schema:sponsor"
description: "Business or person placing ad"
entity_type: "GRP or AGT"
- property: "product"
uri: "schema:itemAdvertised"
description: "Product or service advertised"
- property: "businessAddress"
uri: "schema:address"
description: "Advertiser's address"
entity_type: "TOP"
- property: "price"
uri: "schema:price"
description: "Advertised price"
entity_type: "QTY"
entity_extraction_notes: |
Historical advertisements are TREASURE TROVES:
1. BUSINESS ENTITIES:
- Business names -> GROUP entities
- Proprietor names -> AGENT entities
- Business addresses -> TOPONYM entities
2. HISTORICAL VALUE:
- Dating evidence (product availability)
- Pricing history
- Business locations over time
- Social/cultural attitudes
3. PROVENANCE:
- Distinguish ad claims from editorial claims
- Ads have different authority level
DOC.LOG:
name: "LOGO"
code: "DOC.LOG"
definition: |
Visual identity marks: logos, mastheads, colophon marks, printer's
devices, watermarks, seals, and brand identifiers. Important for
attribution and provenance.
Domain-agnostic applications:
- Heritage: Printer's marks, publisher devices, watermarks, seals
- Publishing: Publisher logos, journal mastheads, imprint marks
- Archives: Institutional seals, letterhead logos
- Web: Site logos, brand marks, favicons
- Legal: Notary seals, official stamps, certification marks
Logos identify PRODUCING AGENTS and provide provenance evidence.
ontology_mappings:
primary_class: "schema:ImageObject"
alternative_classes:
- "crm:E37_Mark"
- "crm:E73_Information_Object"
linkml_mapping:
class_uri: "schema:ImageObject"
close_mappings:
- "crm:E37_Mark"
format_mappings:
html:
elements: [".logo", "header img.logo", "[role='banner'] img", ".masthead img"]
pagexml:
type: "GraphicRegion[@type='logo'], GraphicRegion[@type='decoration']"
logo_types:
- type: "publisher_logo"
description: "Publisher's identifying mark"
- type: "printer_device"
description: "Historical printer's identifying mark"
note: "Critical for incunabula identification"
- type: "masthead"
description: "Newspaper/periodical title banner"
- type: "watermark"
description: "Paper manufacturer's mark"
note: "Used for paper dating and provenance"
- type: "seal"
description: "Official or personal seal impression"
entity_type: "AGT or GRP"
- type: "coat_of_arms"
description: "Heraldic device"
entity_type: "AGT or GRP"
- type: "colophon_mark"
description: "Decorative mark in colophon"
- type: "ex_libris"
description: "Bookplate or ownership mark"
note: "Provenance evidence for ownership history"
logo_properties:
- property: "logoOwner"
uri: "schema:creator"
description: "Entity identified by logo"
entity_type: "GRP or AGT"
- property: "logoDescription"
uri: "schema:description"
description: "Visual description of mark"
- property: "logoReference"
uri: "schema:isBasedOn"
description: "Reference to mark catalog/database"
entity_extraction_notes: |
Logos provide PROVENANCE evidence:
1. ATTRIBUTION:
- Printer's devices identify producer
- Publisher logos identify publisher
- Watermarks date paper production
2. OWNERSHIP HISTORY:
- Ex libris marks trace ownership
- Seals indicate institutional provenance
3. VISUAL ANALYSIS required:
- May need image matching to logo databases
- Heraldic interpretation for coats of arms
DOC.PGN:
name: "PAGINATION"
definition: |
Page numbers, folio numbers, signature marks in printed/manuscript works.
ontology_mappings:
primary_class: "crm:E42_Identifier"
pagination_types:
- type: "page_number"
pagexml: "TextRegion[@type='page-number']"
description: "Arabic or Roman numeral page number"
- type: "folio"
description: "Leaf number with recto/verso (e.g., 23r, 23v)"
- type: "signature_mark"
pagexml: "TextRegion[@type='signature-mark']"
description: "Gathering/quire identifier in manuscripts"
- type: "catch_word"
pagexml: "TextRegion[@type='catch-word']"
description: "Word at page bottom matching next page start"
DOC.BLK:
name: "BLOCK_QUOTE"
definition: |
Extended quotation from another source, typically indented or styled
distinctly from surrounding text.
ontology_mappings:
primary_class: "schema:Quotation"
alternative_classes:
- "crm:E33_Linguistic_Object"
quote_properties:
- property: "quotedFrom"
uri: "glam:quotedFrom"
range: "xsd:anyURI"
owl_mapping: "prov:wasDerivedFrom"
description: "Source of the quotation"
# -----------------------------------------------------------------------
# METADATA AND ADMINISTRATIVE REGIONS
# -----------------------------------------------------------------------
DOC.MTD:
name: "METADATA_BLOCK"
definition: |
Region containing document metadata (author, date, keywords, etc.).
High-value for entity extraction as claims are typically structured.
Includes: HTML , document properties, front matter, colophon.
ontology_mappings:
primary_class: "dcterms:BibliographicResource"
alternative_classes:
- "schema:CreativeWork"
metadata_block_types:
- type: "front_matter"
description: "Title page, copyright, dedication"
- type: "back_matter"
description: "Appendices, bibliography, colophon"
- type: "colophon"
description: "Production details (printer, date, place)"
pagexml: "TextRegion[@type='colophon']"
- type: "document_head"
html: "head"
description: "HTML metadata section"
DOC.ANN:
name: "ANNOTATION_REGION"
definition: |
Region containing annotations or markup added to document.
Distinguished from original content for provenance tracking.
Includes: Editorial additions, transcription notes, TEI annotations.
ontology_mappings:
primary_class: "oa:Annotation"
alternative_classes:
- "crm:E13_Attribute_Assignment"
annotation_properties:
- property: "annotationBody"
uri: "oa:hasBody"
description: "Content of the annotation"
- property: "annotationTarget"
uri: "oa:hasTarget"
description: "Region being annotated"
- property: "annotator"
uri: "oa:annotatedBy"
owl_mapping: "prov:wasAttributedTo"
description: "Agent who created annotation"
# =============================================================================
# SEMANTIC ROLE ENUMERATION
# =============================================================================
layout_semantic_roles:
description: |
Enumeration of semantic roles that document regions can play.
A single region may have multiple roles.
roles:
- role: "PRIMARY_CONTENT"
code: "PRIM"
description: "Main content bearing primary information"
typical_regions: ["DOC.PAR", "DOC.HDR", "DOC.LST"]
- role: "SUPPLEMENTARY"
code: "SUPP"
description: "Additional context or metadata"
typical_regions: ["DOC.SDB", "DOC.FTN", "DOC.CAP", "DOC.APP"]
- role: "NAVIGATIONAL"
code: "NAV"
description: "Aids document navigation"
typical_regions: ["DOC.NAV", "DOC.PGN", "DOC.TOC", "DOC.IDX"]
- role: "STRUCTURAL"
code: "STRC"
description: "Defines document structure"
typical_regions: ["DOC.HDR", "DOC.TTP"]
- role: "REFERENTIAL"
code: "REF"
description: "Points to other resources"
typical_regions: ["DOC.FTN", "DOC.BLK", "DOC.BIB"]
- role: "VISUAL"
code: "VIS"
description: "Non-textual visual content (images, maps, diagrams)"
typical_regions: ["DOC.FIG", "DOC.GAL", "DOC.MAP", "DOC.LOG"]
- role: "AUDIOVISUAL"
code: "AV"
description: "Time-based media content (audio, video)"
typical_regions: ["DOC.AUD", "DOC.VID"]
- role: "INTERACTIVE"
code: "INT"
description: "User-manipulable embedded content"
typical_regions: ["DOC.EMB", "DOC.MAP"]
note: "Interactive maps have both VIS and INT roles"
- role: "METADATA"
code: "META"
description: "Document-level metadata"
typical_regions: ["DOC.MTD", "DOC.COL", "DOC.TTP"]
- role: "SPATIAL"
code: "SPAT"
description: "Geographic or spatial representation"
typical_regions: ["DOC.MAP"]
note: "Distinct from VISUAL; emphasizes coordinate/location semantics"
- role: "FRONT_MATTER"
code: "FRNT"
description: "Preliminary material before main content"
typical_regions: ["DOC.TTP", "DOC.DED", "DOC.TOC"]
- role: "BACK_MATTER"
code: "BACK"
description: "Material following main content"
typical_regions: ["DOC.BIB", "DOC.IDX", "DOC.APP", "DOC.GLO", "DOC.COL"]
- role: "PARATEXTUAL"
code: "PARA"
description: "Content about the document itself (not subject matter)"
typical_regions: ["DOC.TTP", "DOC.COL", "DOC.DED", "DOC.ADV"]
note: "Genette's paratext concept - frames the main text"
- role: "COMMERCIAL"
code: "COMM"
description: "Commercial or promotional content"
typical_regions: ["DOC.ADV", "DOC.LOG"]
note: "Distinguish from editorial content for authority assessment"
- role: "LEXICAL"
code: "LEX"
description: "Vocabulary, terminology, and definitions"
typical_regions: ["DOC.GLO", "DOC.IDX"]
# =============================================================================
# STRUCTURAL CONTEXTS FOR CLUSTERING
# =============================================================================
structural_contexts:
description: |
Document regions provide semantic context for entity clustering and relationship
inference. Entities appearing in the same structural context are more likely
to be related or co-referential than entities in distant structures.
This section defines annotation rules for major structural context types.
# ---------------------------------------------------------------------------
# List Contexts
# ---------------------------------------------------------------------------
list_contexts:
description: |
Lists (ul, ol, dl) have special semantics. Items in a list:
- Share a common parent context (the list introduction)
- Often have parallel structure (same entity types)
- May be implicitly related (all members of same group)
annotation_rules:
- rule: "LIST_ITEM_PARALLELISM"
description: |
Entities in sibling list items likely share:
- Same entity TYPE (all museums, all dates, all people)
- Same RELATIONSHIP to list parent (all "member of", all "located in")
- Parallel STRUCTURE (if first item has date, others likely do too)
- rule: "LIST_HEADER_INHERITANCE"
description: |
List items inherit context from the text introducing the list.
"The following museums are members:" +
Rijksmuseum
implies Rijksmuseum has membership relationship.
# ---------------------------------------------------------------------------
# Sidebar Contexts
# ---------------------------------------------------------------------------
sidebar_contexts:
description: |
Sidebars, asides, marginalia, infoboxes contain SUPPLEMENTARY information.
Entities in sidebars:
- Provide metadata ABOUT the main content
- May contain structured data (birth dates, locations, identifiers)
- Have different reliability/authority than main narrative
annotation_rules:
- rule: "SIDEBAR_METADATA_EXTRACTION"
description: |
Sidebar content often contains STRUCTURED CLAIMS suitable for
direct property extraction:
- Infobox fields → claim properties
- Marginalia dates → temporal metadata
- Caption text → aboutness relationships
- rule: "SIDEBAR_MAIN_LINKING"
description: |
Link sidebar entities to main content entities when co-referential.
The sidebar "Born: 1606" links to the main text "Rembrandt".
# ---------------------------------------------------------------------------
# Caption Contexts
# ---------------------------------------------------------------------------
caption_contexts:
description: |
Captions (figcaption, TextRegion[@type='caption']) describe VISUAL content.
Entities in captions:
- Describe depicted subjects (people, places, objects)
- Provide dates/locations for the depicted scene
- May differ from main text (image of one thing, text about another)
annotation_rules:
- rule: "CAPTION_VISUAL_BINDING"
description: |
Caption entities are ABOUT the associated figure/image.
This is distinct from main text co-occurrence.
Use relationship: schema:about with target: figure URI.
- rule: "CAPTION_PROVENANCE"
description: |
Captions may have different authorship/dates than main text.
Track caption-specific provenance when available.
# =============================================================================
# FORMAT-SPECIFIC PATH CONVENTIONS
# =============================================================================
format_path_conventions:
description: |
Different document formats require different path syntaxes for locating
entities within document structure. This section defines conventions for
each supported format.
# ---------------------------------------------------------------------------
# PAGE-XML Paths
# ---------------------------------------------------------------------------
page_xml:
description: |
PAGE-XML (used for historical manuscript transcription) organizes content
into TextRegions with type attributes. Critical for manuscript NER.
namespace: "http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
prefix: "page"
text_region_types:
- type: "header"
semantic_role: "section_heading"
doc_region: "DOC.HDR"
path_example: "//page:TextRegion[@type='header'][1]"
- type: "paragraph"
semantic_role: "body_content"
doc_region: "DOC.PAR"
path_example: "//page:TextRegion[@type='paragraph'][3]"
- type: "marginalia-left"
semantic_role: "supplementary_note"
doc_region: "DOC.SDB.MRG"
path_example: "//page:TextRegion[@type='marginalia-left'][1]"
- type: "marginalia-right"
semantic_role: "supplementary_note"
doc_region: "DOC.SDB.MRG"
path_example: "//page:TextRegion[@type='marginalia-right'][1]"
- type: "caption"
semantic_role: "figure_description"
doc_region: "DOC.CAP"
path_example: "//page:TextRegion[@type='caption'][1]"
- type: "page-number"
semantic_role: "pagination"
doc_region: "DOC.PGN"
path_example: "//page:TextRegion[@type='page-number'][1]"
- type: "signature-mark"
semantic_role: "gathering_identifier"
doc_region: "DOC.PGN"
path_example: "//page:TextRegion[@type='signature-mark'][1]"
- type: "catch-word"
semantic_role: "gathering_continuity"
doc_region: "DOC.PGN"
path_example: "//page:TextRegion[@type='catch-word'][1]"
- type: "table"
semantic_role: "structured_data"
doc_region: "DOC.TBL"
path_example: "//page:TextRegion[@type='table'][1]"
- type: "footnote"
semantic_role: "reference_note"
doc_region: "DOC.FTN"
path_example: "//page:TextRegion[@type='footnote'][1]"
- type: "table-of-contents"
semantic_role: "navigation"
doc_region: "DOC.TOC"
path_example: "//page:TextRegion[@type='table-of-contents'][1]"
- type: "index"
semantic_role: "navigation"
doc_region: "DOC.IDX"
path_example: "//page:TextRegion[@type='index'][1]"
- type: "title-page"
semantic_role: "front_matter"
doc_region: "DOC.TTP"
path_example: "//page:TextRegion[@type='title-page'][1]"
- type: "colophon"
semantic_role: "production_statement"
doc_region: "DOC.COL"
path_example: "//page:TextRegion[@type='colophon'][1]"
- type: "advertisement"
semantic_role: "commercial_content"
doc_region: "DOC.ADV"
path_example: "//page:TextRegion[@type='advertisement'][1]"
annotation_output:
description: "Include PAGE-XML path in entity provenance"
example:
entity: "Rembrandt"
span_start: 145
span_end: 154
page_xml_path: "//page:Page[@imageFilename='folio_23r.jpg']/page:TextRegion[@id='r1']/page:TextLine[@id='l1']/page:Word[@id='w3']"
text_region_type: "paragraph"
text_region_id: "r1"
doc_region: "DOC.PAR"
# ---------------------------------------------------------------------------
# HTML Paths
# ---------------------------------------------------------------------------
html:
description: |
HTML documents use semantic elements for structure. Critical for web NER.
semantic_elements:
headers:
elements: ["h1", "h2", "h3", "h4", "h5", "h6"]
semantic_role: "section_heading"
doc_region: "DOC.HDR"
hierarchy: "h1 > h2 > h3 > h4 > h5 > h6"
body_content:
elements: ["p", "div.content", "article", "section", "main"]
semantic_role: "primary_content"
doc_region: "DOC.PAR"
supplementary:
elements: ["aside", "nav", "footer"]
semantic_role: "supplementary_content"
doc_regions:
aside: "DOC.SDB"
nav: "DOC.NAV"
footer: "DOC.MTD"
figures:
elements: ["figure", "figcaption", "img"]
semantic_role: "visual_content"
doc_regions:
figure: "DOC.FIG"
figcaption: "DOC.CAP"
lists:
elements: ["ul", "ol", "li", "dl", "dt", "dd"]
semantic_role: "enumerated_content"
doc_region: "DOC.LST"
tables:
elements: ["table", "thead", "tbody", "tr", "th", "td"]
semantic_role: "structured_data"
doc_region: "DOC.TBL"
metadata:
elements: ["meta", "title", "head"]
semantic_role: "document_metadata"
doc_region: "DOC.MTD"
media:
audio:
elements: ["audio", "source[type^='audio']"]
doc_region: "DOC.AUD"
video:
elements: ["video", "source[type^='video']", "iframe[src*='youtube']", "iframe[src*='vimeo']"]
doc_region: "DOC.VID"
embed:
elements: ["iframe", "embed", "object"]
doc_region: "DOC.EMB"
xpath_conventions:
- pattern: "//article/section[2]/h2[1]"
description: "Second section's heading in article"
doc_region: "DOC.HDR"
- pattern: "//table[@class='infobox']//td[contains(text(),'Born')]"
description: "Birth date cell in Wikipedia-style infobox"
doc_region: "DOC.SDB.IBX"
- pattern: "//aside//a[@href]"
description: "Links in sidebar content"
doc_region: "DOC.SDB"
- pattern: "//figure/figcaption"
description: "Caption for figure"
doc_region: "DOC.CAP"
- pattern: "//nav[@aria-label='breadcrumb']//li"
description: "Breadcrumb navigation items"
doc_region: "DOC.NAV"
annotation_output:
example:
entity: "Rijksmuseum"
span_start: 2341
span_end: 2352
xpath: "/html/body/main/article/section[3]/p[2]"
css_selector: "article > section:nth-child(3) > p:nth-child(2)"
semantic_context: "body_content"
doc_region: "DOC.PAR"
parent_header: "Dutch Museums"
parent_header_path: "/html/body/main/article/section[3]/h2"
# ---------------------------------------------------------------------------
# JSON Paths
# ---------------------------------------------------------------------------
json:
description: |
JSON documents use key paths for structure. Critical for API response NER.
path_notation: "JSONPath (RFC 9535) or dot notation"
common_patterns:
- pattern: "$.results[*].name"
description: "Name field in array of results"
likely_entity: "any named entity"
- pattern: "$.data.institution.address.city"
description: "Nested city within institution data"
likely_entity: "TOP"
- pattern: "$['@context']"
description: "JSON-LD context (for namespace resolution)"
semantic_role: "metadata"
- pattern: "$.metadata.creator"
description: "Document creator in metadata block"
likely_entity: "AGT"
- pattern: "$.features[*].geometry"
description: "GeoJSON geometry objects"
likely_entity: "GEO"
- pattern: "$.features[*].properties.name"
description: "GeoJSON feature names"
likely_entity: "TOP"
semantic_inference:
description: |
JSON keys often encode semantic roles. Use key names to guide entity typing:
key_patterns:
- keys: ["name", "title", "label", "displayName"]
likely_content: "entity names"
- keys: ["date", "created", "founded", "birthDate", "deathDate", "startDate", "endDate"]
likely_entity: "TMP"
- keys: ["location", "address", "place", "city", "country", "coordinates"]
likely_entity: "TOP or GEO"
- keys: ["author", "creator", "owner", "contributor", "artist"]
likely_entity: "AGT"
- keys: ["type", "category", "class", "@type"]
semantic_role: "classification"
- keys: ["id", "identifier", "@id", "uri", "url"]
semantic_role: "identifier"
annotation_output:
example:
entity: "Amsterdam"
json_path: "$.data.museums[0].location.city"
array_index: 0
parent_key: "location"
root_key: "museums"
inferred_type: "TOP"
# ---------------------------------------------------------------------------
# TEI-XML Paths
# ---------------------------------------------------------------------------
tei_xml:
description: |
TEI (Text Encoding Initiative) is the standard for scholarly text encoding.
Critical for digital humanities NER.
namespace: "http://www.tei-c.org/ns/1.0"
prefix: "tei"
structural_elements:
- element: "tei:front"
semantic_role: "front_matter"
doc_regions: ["DOC.TTP", "DOC.DED", "DOC.TOC"]
- element: "tei:body"
semantic_role: "main_content"
doc_region: "DOC.PAR"
- element: "tei:back"
semantic_role: "back_matter"
doc_regions: ["DOC.BIB", "DOC.IDX", "DOC.APP", "DOC.GLO"]
- element: "tei:div[@type='chapter']"
semantic_role: "major_section"
- element: "tei:p"
semantic_role: "paragraph"
doc_region: "DOC.PAR"
- element: "tei:head"
semantic_role: "heading"
doc_region: "DOC.HDR"
- element: "tei:note[@place='margin']"
semantic_role: "marginalia"
doc_region: "DOC.SDB.MRG"
- element: "tei:note[@type='footnote']"
semantic_role: "footnote"
doc_region: "DOC.FTN"
- element: "tei:figure"
semantic_role: "figure"
doc_region: "DOC.FIG"
- element: "tei:figDesc"
semantic_role: "figure_description"
doc_region: "DOC.CAP"
- element: "tei:table"
semantic_role: "table"
doc_region: "DOC.TBL"
- element: "tei:listBibl"
semantic_role: "bibliography"
doc_region: "DOC.BIB"
- element: "tei:list[@type='gloss']"
semantic_role: "glossary"
doc_region: "DOC.GLO"
named_entity_elements:
description: |
TEI provides dedicated elements for named entities. These should be
cross-referenced with GLAM-NER annotations.
elements:
- tei_element: "tei:persName"
glam_ner_type: "AGT.PER"
- tei_element: "tei:orgName"
glam_ner_type: "GRP.ORG"
- tei_element: "tei:placeName"
glam_ner_type: "TOP"
- tei_element: "tei:geogName"
glam_ner_type: "TOP.NAT"
- tei_element: "tei:date"
glam_ner_type: "TMP"
- tei_element: "tei:title"
glam_ner_type: "WRK"
annotation_output:
example:
entity: "Rembrandt van Rijn"
tei_path: "//tei:body/tei:div[@type='chapter'][2]/tei:p[3]/tei:persName[1]"
tei_element: "persName"
doc_region: "DOC.PAR"
existing_tei_annotation: true
# ---------------------------------------------------------------------------
# Plain Text Paths
# ---------------------------------------------------------------------------
plain_text:
description: |
Plain text lacks structural markup. Use line/paragraph detection heuristics.
structure_detection:
- heuristic: "BLANK_LINE_PARAGRAPH"
description: "Two consecutive newlines indicate paragraph break"
regex: "\\n\\n+"
- heuristic: "INDENTATION_STRUCTURE"
description: "Consistent indentation may indicate hierarchy"
regex: "^(\\s+)"
- heuristic: "CAPITALIZATION_HEADERS"
description: "ALL CAPS or Title Case lines may be headers"
patterns:
- all_caps: "^[A-Z][A-Z\\s]+$"
- title_case: "^([A-Z][a-z]+\\s)+$"
- heuristic: "ENUMERATION_LISTS"
description: "Lines starting with numbers/bullets are list items"
patterns:
- numbered: "^\\d+[.)\\s]"
- bulleted: "^[•\\-\\*]\\s"
- lettered: "^[a-z][.)\\s]"
offset_notation:
description: |
For plain text, use character offsets with optional line/paragraph indices.
format: "char:{start}-{end};line:{line};para:{paragraph}"
examples:
- "char:1456-1470;line:23;para:5"
- "char:0-15;line:1;para:1"
annotation_output:
description: "Use character offsets with paragraph/line indices"
example:
entity: "Dr. Jan de Wit"
char_start: 1456
char_end: 1470
line_number: 23
paragraph_index: 5
inferred_section: "unknown"
doc_region: "DOC.PAR"
# =============================================================================
# ENTITY CLUSTERING BY STRUCTURAL PATH
# =============================================================================
clustering_strategies:
description: |
Entities sharing structural context are more likely to be related or
co-referential. This section defines clustering algorithms that leverage
document structure for improved NER and relationship extraction.
# ---------------------------------------------------------------------------
# Path Prefix Clustering
# ---------------------------------------------------------------------------
path_prefix_clustering:
description: |
Entities sharing a common path prefix belong to the same structural unit
and should be clustered for co-reference resolution and relationship extraction.
algorithm:
- step: 1
action: "Extract path for each entity annotation"
output: "entity_path_map"
- step: 2
action: "Compute longest common prefix (LCP) for entity pairs"
output: "pairwise_lcp"
- step: 3
action: "Cluster entities where LCP depth >= threshold"
parameters:
threshold: 2
note: "Depth 2 = same section; adjust for document type"
- step: 4
action: "Within clusters, resolve co-references before cross-cluster"
priority: "intra-cluster > inter-cluster"
example:
entities:
- name: "Rembrandt"
path: "/article/section[2]/p[1]"
type: "AGT.PER"
- name: "Saskia"
path: "/article/section[2]/p[1]"
type: "AGT.PER"
- name: "the painter"
path: "/article/section[2]/p[2]"
type: "AGT.PER"
- name: "Vermeer"
path: "/article/section[3]/p[1]"
type: "AGT.PER"
clusters:
- cluster_id: 1
path_prefix: "/article/section[2]"
entities: ["Rembrandt", "Saskia", "the painter"]
coref_candidates:
- mention: "the painter"
antecedent: "Rembrandt"
confidence: 0.85
rationale: "Same paragraph, definite description"
- cluster_id: 2
path_prefix: "/article/section[3]"
entities: ["Vermeer"]
note: "Separate section, no co-reference with cluster 1"
# ---------------------------------------------------------------------------
# Hierarchical Context Inheritance
# ---------------------------------------------------------------------------
hierarchical_inheritance:
description: |
Properties from ancestor nodes propagate to descendant entities.
Header entities establish context inherited by paragraph entities.
inheritance_rules:
- rule: "TEMPORAL_SCOPE"
description: "Date range in header bounds dates in body"
example:
header: "The 17th Century (1600-1699)"
header_path: "/article/section[1]/h2"
body_date: "1642"
body_path: "/article/section[1]/p[3]"
inherited_context:
temporal_scope: "1600-1699"
validation: "1642 falls within 1600-1699"
- rule: "SPATIAL_SCOPE"
description: "Location in header provides default for body entities"
example:
header: "Museums in Amsterdam"
header_path: "/article/section[2]/h2"
body_org: "Rijksmuseum"
body_path: "/article/section[2]/p[1]"
inherited_context:
default_location: "Amsterdam"
relationship: "located_in"
- rule: "TOPIC_SCOPE"
description: "Subject in header provides aboutness for body"
example:
header: "Rembrandt van Rijn"
header_path: "/article/section[3]/h2"
body_text: "He painted The Night Watch"
body_path: "/article/section[3]/p[1]"
inherited_context:
pronoun_antecedent: "Rembrandt van Rijn"
aboutness: "Rembrandt van Rijn"
- rule: "ENTITY_TYPE_SCOPE"
description: "Entity type in header suggests types for body"
example:
header: "Notable Architects"
header_path: "/article/section[4]/h2"
body_names: ["Hendrik Petrus Berlage", "J.J.P. Oud"]
body_path: "/article/section[4]/ul/li"
inherited_context:
likely_type: "AGT.PER"
likely_role: "architect"
# ---------------------------------------------------------------------------
# Cross-Reference Resolution
# ---------------------------------------------------------------------------
cross_reference_resolution:
description: |
Resolve pronouns and definite references using structural proximity.
Closer structural context = higher resolution priority.
resolution_priority:
- priority: 1
scope: "same_sentence"
description: "Check for antecedent in same sentence"
strategy: "Recency + syntactic constraints"
- priority: 2
scope: "same_paragraph"
description: "Check preceding sentences in paragraph"
strategy: "Recency + entity salience"
- priority: 3
scope: "same_section"
description: "Check preceding paragraphs in section"
strategy: "Topic coherence + entity prominence"
- priority: 4
scope: "section_header"
description: "Check governing section header"
strategy: "Header entities are topical anchors"
- priority: 5
scope: "document"
description: "Check document-level prominent entities"
strategy: "Title entities, frequently mentioned entities"
example:
text: |
## Rembrandt van Rijn
The Dutch painter was born in Leiden. He moved to Amsterdam in 1631.
His most famous work is The Night Watch.
resolutions:
- mention: "The Dutch painter"
type: "definite_description"
antecedent: "Rembrandt van Rijn"
resolution_scope: "section_header"
confidence: 0.95
- mention: "He"
type: "pronoun"
antecedent: "'The Dutch painter' -> 'Rembrandt van Rijn'"
resolution_scope: "same_paragraph"
confidence: 0.98
- mention: "His"
type: "possessive_pronoun"
antecedent: "'He' -> 'Rembrandt van Rijn'"
resolution_scope: "same_paragraph"
confidence: 0.98
- mention: "The Night Watch"
type: "named_entity"
antecedent: null
relationship_to: "Rembrandt van Rijn"
relationship_type: "crm:P14i_performed"
note: "Work entity, not a co-reference"
# ---------------------------------------------------------------------------
# Multi-Document Clustering
# ---------------------------------------------------------------------------
multi_document_clustering:
description: |
When annotating corpora with multiple documents, entities may appear
across documents. Cross-document entity linking uses:
- Shared identifiers (Wikidata, VIAF, etc.)
- Name matching with disambiguation
- Contextual similarity
strategies:
- strategy: "IDENTIFIER_LINKING"
description: "Same external identifier = same entity"
priority: 1
example:
doc1_entity: "Rijksmuseum"
doc1_wikidata: "Q190804"
doc2_entity: "Rijksmuseum Amsterdam"
doc2_wikidata: "Q190804"
result: "same_entity"
- strategy: "NAME_MATCHING"
description: "Similar names with compatible context"
priority: 2
methods:
- exact_match: "Identical strings"
- normalized_match: "Case/diacritic normalization"
- alias_match: "Known aliases (from knowledge base)"
- fuzzy_match: "Edit distance < threshold"
- strategy: "CONTEXT_SIMILARITY"
description: "Similar co-occurring entities suggest same referent"
priority: 3
example:
doc1: "Rembrandt painted in Amsterdam"
doc2: "The artist worked in the Dutch Republic's capital"
shared_context: "Amsterdam / Dutch capital"
inference: "Likely same person"
# =============================================================================
# PROVENANCE PATH REQUIREMENTS
# =============================================================================
provenance_requirements:
description: |
ALL entity annotations MUST include path information for provenance.
This enables verification, reproducibility, and precise citation.
# ---------------------------------------------------------------------------
# Mandatory Fields
# ---------------------------------------------------------------------------
mandatory_path_fields:
description: |
Required provenance fields for every entity annotation.
required_fields:
- field: "source_document_uri"
type: "URI"
required: true
description: "Identifier for the source document"
examples:
- "https://example.org/documents/manuscript_123.xml"
- "file:///data/archives/letter_001.tei"
- "urn:isbn:978-0-123456-78-9"
- field: "document_format"
type: "enum"
required: true
values: ["PAGE-XML", "HTML", "JSON", "TEI-XML", "PLAIN_TEXT", "PDF", "ALTO-XML"]
description: "Format of source document (determines path syntax)"
- field: "structural_path"
type: "string"
required: true
description: "XPath, JSONPath, or character offset path to entity"
format_examples:
PAGE-XML: "//page:TextRegion[@id='r5']/page:TextLine[@id='l3']"
HTML: "/html/body/article/section[2]/p[1]"
JSON: "$.data.institutions[3].name"
TEI-XML: "//tei:body/tei:div[2]/tei:p[1]/tei:persName[1]"
PLAIN_TEXT: "char:1456-1470;line:23;para:5"
- field: "doc_region"
type: "enum"
required: true
description: "GLAM-NER document region code"
values_reference: "See document_regions section"
examples: ["DOC.PAR", "DOC.HDR", "DOC.CAP", "DOC.SDB.IBX"]
- field: "structural_context"
type: "enum"
required: true
values:
- "header"
- "paragraph"
- "list_item"
- "table_cell"
- "caption"
- "sidebar"
- "marginalia"
- "footnote"
- "metadata"
- "index_entry"
- "bibliography_entry"
- "title_page"
description: "Semantic role of containing structure"
optional_fields:
- field: "governing_header"
type: "object"
required: false
description: "Path and text of governing header (if applicable)"
schema:
path:
type: "string"
description: "XPath/JSONPath to header element"
text:
type: "string"
description: "Header text content"
level:
type: "integer"
description: "Header level (1=h1, 2=h2, etc.)"
- field: "parent_container"
type: "object"
required: false
description: "Immediate parent container (list, table, figure, etc.)"
schema:
path:
type: "string"
type:
type: "string"
description: "Container type (ul, table, figure, etc.)"
- field: "page_reference"
type: "object"
required: false
description: "Physical page information (for paginated documents)"
schema:
page_number:
type: "string"
folio:
type: "string"
description: "Folio reference (e.g., '23r', '45v')"
image_filename:
type: "string"
# ---------------------------------------------------------------------------
# NIF Context Alignment
# ---------------------------------------------------------------------------
nif_context_alignment:
description: |
Path information aligns with NIF (NLP Interchange Format) Context
for annotation interchange. GLAM-NER extends NIF with structural path.
nif_properties:
- property: "nif:beginIndex"
type: "xsd:nonNegativeInteger"
maps_to: "span_start (character offset)"
description: "Start offset within context string"
- property: "nif:endIndex"
type: "xsd:nonNegativeInteger"
maps_to: "span_end (character offset)"
description: "End offset within context string"
- property: "nif:anchorOf"
type: "xsd:string"
maps_to: "surface_form"
description: "Exact text of the annotation"
- property: "nif:sourceUrl"
type: "xsd:anyURI"
maps_to: "source_document_uri"
description: "Source document URL"
- property: "nif:referenceContext"
type: "nif:Context"
maps_to: "document context"
description: "Parent context containing this annotation"
glam_ner_extensions:
description: |
GLAM-NER extends NIF with structural provenance properties.
properties:
- property: "glam:structuralPath"
type: "xsd:string"
description: "XPath, JSONPath, or offset path to entity location"
- property: "glam:docRegion"
type: "xsd:string"
description: "GLAM-NER document region code (DOC.*)"
- property: "glam:structuralContext"
type: "xsd:string"
description: "Semantic role of containing structure"
- property: "glam:governingHeader"
type: "xsd:string"
description: "Text of governing section header"
- property: "glam:governingHeaderPath"
type: "xsd:string"
description: "Path to governing header element"
example_nif_output:
"@context":
- "http://persistence.uni-leipzig.org/nlp2rdf/contexts/nif-2.0.json"
- glam: "http://glam-ner.org/ns#"
"@id": "https://example.org/doc#char=145,154"
"@type": "nif:String"
# Standard NIF properties
"nif:anchorOf": "Rembrandt"
"nif:beginIndex": 145
"nif:endIndex": 154
"nif:referenceContext": "https://example.org/doc#char=0,5000"
# GLAM-NER extensions
"glam:structuralPath": "//page:TextRegion[@id='r1']/page:TextLine[@id='l1']"
"glam:docRegion": "DOC.PAR"
"glam:structuralContext": "paragraph"
"glam:governingHeader": "Dutch Golden Age Artists"
"glam:governingHeaderPath": "//page:TextRegion[@id='header1']"
# =============================================================================
# NESTED PROVENANCE MODEL
# =============================================================================
nested_provenance_model:
description: |
GLAM-NER uses a two-layer provenance model:
1. LAYOUT CLAIM: Provenance for structural/path assertions
- Where in the document is this region?
- What type of region is it?
- Who/what identified this region?
2. ENTITY CLAIM: Provenance for entity annotations
- What entity was recognized?
- What type is it?
- Who/what made this annotation?
- What is the confidence?
Layout claims contain entity claims, enabling:
- Separate assessment of structural vs. semantic accuracy
- Different annotators for layout vs. NER
- Tracking provenance at appropriate granularity
# ---------------------------------------------------------------------------
# Layout Claim Schema
# ---------------------------------------------------------------------------
layout_claim_schema:
description: |
A LayoutClaim asserts that a document region exists at a specific path
with a specific semantic type.
properties:
- property: "claim_id"
type: "URI"
required: true
description: "Unique identifier for this claim"
- property: "source_document"
type: "URI"
required: true
description: "Document containing this region"
- property: "structural_path"
type: "string"
required: true
description: "Path to region (format-specific)"
- property: "doc_region"
type: "string"
required: true
description: "GLAM-NER region code (DOC.*)"
- property: "region_text"
type: "string"
required: false
description: "Full text content of region"
- property: "char_offset_start"
type: "integer"
required: true
description: "Character offset of region start"
- property: "char_offset_end"
type: "integer"
required: true
description: "Character offset of region end"
- property: "provenance"
type: "object"
required: true
schema:
annotator:
type: "string"
description: "Human or system that made this claim"
annotation_date:
type: "datetime"
method:
type: "string"
enum: ["manual", "automatic", "hybrid"]
confidence:
type: "float"
range: [0.0, 1.0]
tool_version:
type: "string"
description: "Version of annotation tool (if automatic)"
- property: "entity_claims"
type: "array"
items: "EntityClaim"
description: "Entity annotations within this region"
example:
claim_id: "urn:glam:layout:doc123:r5"
source_document: "https://example.org/manuscript_001.xml"
structural_path: "//page:TextRegion[@id='r5']"
doc_region: "DOC.PAR"
region_text: "Rembrandt van Rijn was born in Leiden in 1606."
char_offset_start: 1234
char_offset_end: 1282
provenance:
annotator: "Transkribus Layout Model v4.2"
annotation_date: "2024-03-15T10:30:00Z"
method: "automatic"
confidence: 0.92
tool_version: "4.2.1"
entity_claims:
- # See entity_claim_schema
# ---------------------------------------------------------------------------
# Entity Claim Schema
# ---------------------------------------------------------------------------
entity_claim_schema:
description: |
An EntityClaim asserts that an entity mention exists at a specific
location within a layout region.
properties:
- property: "claim_id"
type: "URI"
required: true
description: "Unique identifier for this claim"
- property: "parent_layout_claim"
type: "URI"
required: true
description: "Reference to containing LayoutClaim"
- property: "surface_form"
type: "string"
required: true
description: "Exact text of entity mention"
- property: "span_start"
type: "integer"
required: true
description: "Character offset within region"
- property: "span_end"
type: "integer"
required: true
description: "Character offset end within region"
- property: "entity_type"
type: "string"
required: true
description: "GLAM-NER entity type code"
examples: ["AGT.PER", "TOP.ADM", "TMP.DAT", "WRK.TXT"]
- property: "normalized_value"
type: "string"
required: false
description: "Normalized/canonical form of entity"
- property: "linked_entity"
type: "URI"
required: false
description: "Link to knowledge base entity"
examples:
- "http://www.wikidata.org/entity/Q5598"
- "https://viaf.org/viaf/64013650"
- property: "provenance"
type: "object"
required: true
schema:
annotator:
type: "string"
annotation_date:
type: "datetime"
method:
type: "string"
enum: ["manual", "automatic", "hybrid"]
confidence:
type: "float"
range: [0.0, 1.0]
model:
type: "string"
description: "NER model used (if automatic)"
model_version:
type: "string"
example:
claim_id: "urn:glam:entity:doc123:r5:e1"
parent_layout_claim: "urn:glam:layout:doc123:r5"
surface_form: "Rembrandt van Rijn"
span_start: 0
span_end: 18
entity_type: "AGT.PER"
normalized_value: "Rembrandt Harmenszoon van Rijn"
linked_entity: "http://www.wikidata.org/entity/Q5598"
provenance:
annotator: "spaCy NER + GLAM fine-tuning"
annotation_date: "2024-03-15T11:00:00Z"
method: "automatic"
confidence: 0.97
model: "glam-ner-nl-lg"
model_version: "1.2.0"
# =============================================================================
# END OF DOCUMENT STRUCTURE MODULE
# =============================================================================