glam/.opencode/CH_ANNOTATOR_CONVENTION.md
2025-12-07 00:26:01 +01:00

8.3 KiB

CH-Annotator: Cultural Heritage Entity Annotation Convention

ID: ch_annotator-v1_7_0
Version: 1.7.0
Status: PRODUCTION
Date: 2025-12-02
Renamed: 2025-12-06


Overview

CH-Annotator (Cultural Heritage Annotator) is the project's comprehensive convention for:

  • Named Entity Recognition (NER)
  • Property Extraction
  • Entity Resolution
  • Entity Linking
  • Claim Validation

This convention applies to ALL text sources in the GLAM project.


Naming History

Date Name File
2025-12-02 GLAM-NER v1.7.0-unified entity_annotation_rules_v1.6.0_unified.yaml
2025-12-06 CH-Annotator v1.7.0 ch_annotator-v1_7_0.yaml

Rename Rationale:

  • "GLAM-NER" was ambiguous (could be confused with a Python NER library)
  • "CH-Annotator" clearly indicates Cultural Heritage domain focus
  • File naming now follows project snake_case conventions with version

File Locations

Primary Files

File Description
data/entity_annotation/ch_annotator-v1_7_0.yaml MAIN FILE - Complete self-contained convention (2500+ lines)
data/entity_annotation/modules/index.yaml Modular schema index for hypernym imports

Module Structure

data/entity_annotation/
├── ch_annotator-v1_7_0.yaml          # Main convention file (complete)
├── entity_annotation_rules_v1.6.0_unified.yaml  # Legacy name (deprecated)
└── modules/
    ├── index.yaml                     # Module index
    ├── core/
    │   ├── convention.yaml            # Convention metadata
    │   └── namespaces.yaml            # Ontology prefixes
    ├── hypernyms/
    │   ├── agt.yaml                   # AGENT (AGT)
    │   ├── grp.yaml                   # GROUP (GRP)
    │   ├── top.yaml                   # TOPONYM (TOP)
    │   ├── geo.yaml                   # GEOMETRY (GEO)
    │   ├── tmp.yaml                   # TEMPORAL (TMP)
    │   ├── app.yaml                   # APPELLATION (APP)
    │   ├── rol.yaml                   # ROLE (ROL)
    │   ├── wrk.yaml                   # WORK (WRK)
    │   ├── qty.yaml                   # QUANTITY (QTY)
    │   └── thg.yaml                   # THING (THG)
    ├── processing/
    │   ├── exclusions.yaml
    │   ├── double_tagging.yaml
    │   └── relationships.yaml
    ├── integrations/
    │   ├── pico.yaml                  # PiCo ontology
    │   └── nif_nerd.yaml              # NIF/NERD compatibility
    └── advanced/
        ├── document_structure.yaml    # DOC hypernym (30+ types)
        ├── coreference.yaml
        ├── uncertainty.yaml
        └── tei/                       # TEI P5 modules
            ├── core.yaml
            ├── namesdates.yaml
            ├── msdescription.yaml
            └── linking.yaml

Hypernym Entity Types

CH-Annotator defines 9 hypernym categories (10 including DOCUMENT regions):

Code Hypernym Description Primary Ontology Class
AGT AGENT Humans, AI agents, animals, fictional beings crm:E39_Actor
GRP GROUP Formal/informal collectives of agents crm:E74_Group
TOP TOPONYM Place names as nominal references crm:E53_Place
GEO GEOMETRY Coordinates, polygons, spatial primitives geo:Geometry
TMP TEMPORAL TimeML/TIMEX3 temporal expressions crm:E52_Time-Span
APP APPELLATION Names, titles, awards, structured names crm:E41_Appellation
ROL ROLE Occupations, honorifics, positions org:Role
WRK WORK FRBR Work/Expression/Manifestation/Item frbroo:F1_Work
QTY QUANTITY Counts, measurements, currency, ranges crm:E54_Dimension
THG THING Artworks, artifacts, events, concepts crm:E70_Thing

Heritage Institution Subtype (GRP.HER)

For heritage custodians specifically:

GRP.HER:
  name: HERITAGE_CUSTODIAN
  description: Heritage institutions managing cultural collections
  class_uri: cpov:PublicOrganisation
  subtypes:
    - GRP.HER.GAL  # Gallery
    - GRP.HER.LIB  # Library
    - GRP.HER.ARC  # Archive
    - GRP.HER.MUS  # Museum
    - GRP.HER.OFF  # Official institution
    - GRP.HER.RES  # Research center
    - GRP.HER.COR  # Corporation
    # ... (matches GLAMORCUBESFIXPHDNT taxonomy)

Digital Humanities Authority Stack

CH-Annotator prioritizes Digital Humanities standards over web-centric NER:

Primary Authorities

Authority Usage
TEI P5 Document structure, person/place/org names, temporal expressions
CIDOC-CRM 7.1.3 Cultural heritage entity modeling, events, temporal entities
TimeML/TIMEX3 Temporal expression annotation (DATE, TIME, DURATION, SET)
FRBR/LRM Work/Expression/Manifestation/Item for bibliographic entities
GeoSPARQL Spatial geometry representation in RDF
Pleiades Historical and ancient world toponyms

Secondary Authorities

Authority Usage
W3C Org Organizational structure, roles, memberships
RiC-O Archival description and record relationships
PNV Structured person name components
PiCo Person observations in historical sources

Deprecated (Interchange Only)

System Status
NERD Retained for NLP pipeline interchange, NOT authoritative

Breaking Changes in v1.7.0

Hypernym Renames

Old Name New Name Code Rationale
BEING AGENT AGT CIDOC-CRM E39_Actor (includes non-humans)
ORGANISATION GROUP GRP CIDOC-CRM E74_Group (formal + informal)
PLACE TOPONYM TOP Nominal references only
(new) GEOMETRY GEO Coordinate/shape data split from PLACE
TEXTUAL_REFERENCE WORK WRK FRBR model instead of nerd:Product
(new) ROLE ROL TEI roleName + PiCo concepts

Temporal Restructuring (TimeML/TIMEX3)

Type Code Description
TMP.DAB Datable (absolute timestamps, fully resolved)
TMP.DRL Deictic/Relative (require context)
TMP.DUR Durations
TMP.SET Recurring/periodic times
TMP.RNG Explicit start-end ranges

Claim Provenance Model

Every extracted claim MUST have 5-component provenance:

claim:
  claim_type: full_name
  claim_value: "Rijksmuseum Amsterdam"
  provenance:
    namespace: skos        # Ontology prefix
    path: /html/body/h1[1] # XPath/JSONPath to source
    timestamp: "2025-12-06T10:00:00Z"  # ISO 8601
    agent: ch_annotator-v1_7_0         # Extraction model
    context_convention: ch_annotator-v1_7_0  # This convention version

Usage in GLAM Project

When to Use CH-Annotator

  • Extracting entities from Claude conversation JSON files
  • Annotating web-scraped heritage institution pages
  • Processing archival finding aids
  • Extracting entities from PDF documents
  • Annotating ISIL registry entries

When NOT to Use CH-Annotator

  • Simple YAML data restructuring (no NER needed)
  • Identifier-only extraction (use regex patterns)
  • Geographic enrichment (use GeoNames directly)

Provenance Reference

When using CH-Annotator, reference it in extraction metadata:

provenance:
  data_source: CONVERSATION_NLP
  extraction_method: ch_annotator-v1_7_0
  extraction_date: "2025-12-06T10:00:00Z"
  confidence_score: 0.92

Integration with LinkML Schema

CH-Annotator aligns with the project's LinkML schema:

CH-Annotator Type LinkML Class Schema File
GRP.HER HeritageCustodian schemas/core.yaml
TOP Location schemas/core.yaml
APP.IDE Identifier schemas/core.yaml
TMP.DAB ChangeEvent.event_date schemas/provenance.yaml
WRK Collection schemas/collections.yaml

See Also

  • data/entity_annotation/ch_annotator-v1_7_0.yaml - Full convention
  • data/entity_annotation/modules/ - Modular hypernym definitions
  • schemas/20251121/linkml/modules/classes/WebClaim.yaml - Claim schema
  • .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - XPath provenance rules
  • AGENTS.md - Rule 10 (CH-Annotator usage)

Version: 1.0
Last Updated: 2025-12-06