# CH-Annotator: Cultural Heritage Entity Annotation Convention **ID**: `ch_annotator-v1_7_0` **Version**: 1.7.0 **Status**: PRODUCTION **Date**: 2025-12-02 **Renamed**: 2025-12-06 --- ## Overview CH-Annotator (Cultural Heritage Annotator) is the project's comprehensive convention for: - Named Entity Recognition (NER) - Property Extraction - Entity Resolution - Entity Linking - Claim Validation This convention applies to ALL text sources in the GLAM project. --- ## Naming History | Date | Name | File | |------|------|------| | 2025-12-02 | GLAM-NER v1.7.0-unified | `entity_annotation_rules_v1.6.0_unified.yaml` | | 2025-12-06 | CH-Annotator v1.7.0 | `ch_annotator-v1_7_0.yaml` | **Rename Rationale**: - "GLAM-NER" was ambiguous (could be confused with a Python NER library) - "CH-Annotator" clearly indicates Cultural Heritage domain focus - File naming now follows project snake_case conventions with version --- ## File Locations ### Primary Files | File | Description | |------|-------------| | `data/entity_annotation/ch_annotator-v1_7_0.yaml` | **MAIN FILE** - Complete self-contained convention (2500+ lines) | | `data/entity_annotation/modules/index.yaml` | Modular schema index for hypernym imports | ### Module Structure ``` data/entity_annotation/ ├── ch_annotator-v1_7_0.yaml # Main convention file (complete) ├── entity_annotation_rules_v1.6.0_unified.yaml # Legacy name (deprecated) └── modules/ ├── index.yaml # Module index ├── core/ │ ├── convention.yaml # Convention metadata │ └── namespaces.yaml # Ontology prefixes ├── hypernyms/ │ ├── agt.yaml # AGENT (AGT) │ ├── grp.yaml # GROUP (GRP) │ ├── top.yaml # TOPONYM (TOP) │ ├── geo.yaml # GEOMETRY (GEO) │ ├── tmp.yaml # TEMPORAL (TMP) │ ├── app.yaml # APPELLATION (APP) │ ├── rol.yaml # ROLE (ROL) │ ├── wrk.yaml # WORK (WRK) │ ├── qty.yaml # QUANTITY (QTY) │ └── thg.yaml # THING (THG) ├── processing/ │ ├── exclusions.yaml │ ├── double_tagging.yaml │ └── relationships.yaml ├── integrations/ │ ├── pico.yaml # PiCo ontology │ └── nif_nerd.yaml # NIF/NERD compatibility └── advanced/ ├── document_structure.yaml # DOC hypernym (30+ types) ├── coreference.yaml ├── uncertainty.yaml └── tei/ # TEI P5 modules ├── core.yaml ├── namesdates.yaml ├── msdescription.yaml └── linking.yaml ``` --- ## Hypernym Entity Types CH-Annotator defines **9 hypernym categories** (10 including DOCUMENT regions): | Code | Hypernym | Description | Primary Ontology Class | |------|----------|-------------|------------------------| | **AGT** | AGENT | Humans, AI agents, animals, fictional beings | `crm:E39_Actor` | | **GRP** | GROUP | Formal/informal collectives of agents | `crm:E74_Group` | | **TOP** | TOPONYM | Place names as nominal references | `crm:E53_Place` | | **GEO** | GEOMETRY | Coordinates, polygons, spatial primitives | `geo:Geometry` | | **TMP** | TEMPORAL | TimeML/TIMEX3 temporal expressions | `crm:E52_Time-Span` | | **APP** | APPELLATION | Names, titles, awards, structured names | `crm:E41_Appellation` | | **ROL** | ROLE | Occupations, honorifics, positions | `org:Role` | | **WRK** | WORK | FRBR Work/Expression/Manifestation/Item | `frbroo:F1_Work` | | **QTY** | QUANTITY | Counts, measurements, currency, ranges | `crm:E54_Dimension` | | **THG** | THING | Artworks, artifacts, events, concepts | `crm:E70_Thing` | ### Heritage Institution Subtype (GRP.HER) For heritage custodians specifically: ```yaml GRP.HER: name: HERITAGE_CUSTODIAN description: Heritage institutions managing cultural collections class_uri: cpov:PublicOrganisation subtypes: - GRP.HER.GAL # Gallery - GRP.HER.LIB # Library - GRP.HER.ARC # Archive - GRP.HER.MUS # Museum - GRP.HER.OFF # Official institution - GRP.HER.RES # Research center - GRP.HER.COR # Corporation # ... (matches GLAMORCUBESFIXPHDNT taxonomy) ``` --- ## Digital Humanities Authority Stack CH-Annotator prioritizes Digital Humanities standards over web-centric NER: ### Primary Authorities | Authority | Usage | |-----------|-------| | **TEI P5** | Document structure, person/place/org names, temporal expressions | | **CIDOC-CRM 7.1.3** | Cultural heritage entity modeling, events, temporal entities | | **TimeML/TIMEX3** | Temporal expression annotation (DATE, TIME, DURATION, SET) | | **FRBR/LRM** | Work/Expression/Manifestation/Item for bibliographic entities | | **GeoSPARQL** | Spatial geometry representation in RDF | | **Pleiades** | Historical and ancient world toponyms | ### Secondary Authorities | Authority | Usage | |-----------|-------| | **W3C Org** | Organizational structure, roles, memberships | | **RiC-O** | Archival description and record relationships | | **PNV** | Structured person name components | | **PiCo** | Person observations in historical sources | ### Deprecated (Interchange Only) | System | Status | |--------|--------| | **NERD** | Retained for NLP pipeline interchange, NOT authoritative | --- ## Breaking Changes in v1.7.0 ### Hypernym Renames | Old Name | New Name | Code | Rationale | |----------|----------|------|-----------| | BEING | AGENT | AGT | CIDOC-CRM E39_Actor (includes non-humans) | | ORGANISATION | GROUP | GRP | CIDOC-CRM E74_Group (formal + informal) | | PLACE | TOPONYM | TOP | Nominal references only | | (new) | GEOMETRY | GEO | Coordinate/shape data split from PLACE | | TEXTUAL_REFERENCE | WORK | WRK | FRBR model instead of nerd:Product | | (new) | ROLE | ROL | TEI roleName + PiCo concepts | ### Temporal Restructuring (TimeML/TIMEX3) | Type Code | Description | |-----------|-------------| | TMP.DAB | Datable (absolute timestamps, fully resolved) | | TMP.DRL | Deictic/Relative (require context) | | TMP.DUR | Durations | | TMP.SET | Recurring/periodic times | | TMP.RNG | Explicit start-end ranges | --- ## Claim Provenance Model Every extracted claim MUST have 5-component provenance: ```yaml claim: claim_type: full_name claim_value: "Rijksmuseum Amsterdam" provenance: namespace: skos # Ontology prefix path: /html/body/h1[1] # XPath/JSONPath to source timestamp: "2025-12-06T10:00:00Z" # ISO 8601 agent: ch_annotator-v1_7_0 # Extraction model context_convention: ch_annotator-v1_7_0 # This convention version ``` --- ## Usage in GLAM Project ### When to Use CH-Annotator - Extracting entities from Claude conversation JSON files - Annotating web-scraped heritage institution pages - Processing archival finding aids - Extracting entities from PDF documents - Annotating ISIL registry entries ### When NOT to Use CH-Annotator - Simple YAML data restructuring (no NER needed) - Identifier-only extraction (use regex patterns) - Geographic enrichment (use GeoNames directly) ### Provenance Reference When using CH-Annotator, reference it in extraction metadata: ```yaml provenance: data_source: CONVERSATION_NLP extraction_method: ch_annotator-v1_7_0 extraction_date: "2025-12-06T10:00:00Z" confidence_score: 0.92 ``` --- ## Integration with LinkML Schema CH-Annotator aligns with the project's LinkML schema: | CH-Annotator Type | LinkML Class | Schema File | |-------------------|--------------|-------------| | GRP.HER | HeritageCustodian | schemas/core.yaml | | TOP | Location | schemas/core.yaml | | APP.IDE | Identifier | schemas/core.yaml | | TMP.DAB | ChangeEvent.event_date | schemas/provenance.yaml | | WRK | Collection | schemas/collections.yaml | --- ## See Also - `data/entity_annotation/ch_annotator-v1_7_0.yaml` - Full convention - `data/entity_annotation/modules/` - Modular hypernym definitions - `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Claim schema - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath provenance rules - `AGENTS.md` - Rule 10 (CH-Annotator usage) --- **Version**: 1.0 **Last Updated**: 2025-12-06