8.3 KiB
8.3 KiB
CH-Annotator: Cultural Heritage Entity Annotation Convention
ID: ch_annotator-v1_7_0
Version: 1.7.0
Status: PRODUCTION
Date: 2025-12-02
Renamed: 2025-12-06
Overview
CH-Annotator (Cultural Heritage Annotator) is the project's comprehensive convention for:
- Named Entity Recognition (NER)
- Property Extraction
- Entity Resolution
- Entity Linking
- Claim Validation
This convention applies to ALL text sources in the GLAM project.
Naming History
| Date | Name | File |
|---|---|---|
| 2025-12-02 | GLAM-NER v1.7.0-unified | entity_annotation_rules_v1.6.0_unified.yaml |
| 2025-12-06 | CH-Annotator v1.7.0 | ch_annotator-v1_7_0.yaml |
Rename Rationale:
- "GLAM-NER" was ambiguous (could be confused with a Python NER library)
- "CH-Annotator" clearly indicates Cultural Heritage domain focus
- File naming now follows project snake_case conventions with version
File Locations
Primary Files
| File | Description |
|---|---|
data/entity_annotation/ch_annotator-v1_7_0.yaml |
MAIN FILE - Complete self-contained convention (2500+ lines) |
data/entity_annotation/modules/index.yaml |
Modular schema index for hypernym imports |
Module Structure
data/entity_annotation/
├── ch_annotator-v1_7_0.yaml # Main convention file (complete)
├── entity_annotation_rules_v1.6.0_unified.yaml # Legacy name (deprecated)
└── modules/
├── index.yaml # Module index
├── core/
│ ├── convention.yaml # Convention metadata
│ └── namespaces.yaml # Ontology prefixes
├── hypernyms/
│ ├── agt.yaml # AGENT (AGT)
│ ├── grp.yaml # GROUP (GRP)
│ ├── top.yaml # TOPONYM (TOP)
│ ├── geo.yaml # GEOMETRY (GEO)
│ ├── tmp.yaml # TEMPORAL (TMP)
│ ├── app.yaml # APPELLATION (APP)
│ ├── rol.yaml # ROLE (ROL)
│ ├── wrk.yaml # WORK (WRK)
│ ├── qty.yaml # QUANTITY (QTY)
│ └── thg.yaml # THING (THG)
├── processing/
│ ├── exclusions.yaml
│ ├── double_tagging.yaml
│ └── relationships.yaml
├── integrations/
│ ├── pico.yaml # PiCo ontology
│ └── nif_nerd.yaml # NIF/NERD compatibility
└── advanced/
├── document_structure.yaml # DOC hypernym (30+ types)
├── coreference.yaml
├── uncertainty.yaml
└── tei/ # TEI P5 modules
├── core.yaml
├── namesdates.yaml
├── msdescription.yaml
└── linking.yaml
Hypernym Entity Types
CH-Annotator defines 9 hypernym categories (10 including DOCUMENT regions):
| Code | Hypernym | Description | Primary Ontology Class |
|---|---|---|---|
| AGT | AGENT | Humans, AI agents, animals, fictional beings | crm:E39_Actor |
| GRP | GROUP | Formal/informal collectives of agents | crm:E74_Group |
| TOP | TOPONYM | Place names as nominal references | crm:E53_Place |
| GEO | GEOMETRY | Coordinates, polygons, spatial primitives | geo:Geometry |
| TMP | TEMPORAL | TimeML/TIMEX3 temporal expressions | crm:E52_Time-Span |
| APP | APPELLATION | Names, titles, awards, structured names | crm:E41_Appellation |
| ROL | ROLE | Occupations, honorifics, positions | org:Role |
| WRK | WORK | FRBR Work/Expression/Manifestation/Item | frbroo:F1_Work |
| QTY | QUANTITY | Counts, measurements, currency, ranges | crm:E54_Dimension |
| THG | THING | Artworks, artifacts, events, concepts | crm:E70_Thing |
Heritage Institution Subtype (GRP.HER)
For heritage custodians specifically:
GRP.HER:
name: HERITAGE_CUSTODIAN
description: Heritage institutions managing cultural collections
class_uri: cpov:PublicOrganisation
subtypes:
- GRP.HER.GAL # Gallery
- GRP.HER.LIB # Library
- GRP.HER.ARC # Archive
- GRP.HER.MUS # Museum
- GRP.HER.OFF # Official institution
- GRP.HER.RES # Research center
- GRP.HER.COR # Corporation
# ... (matches GLAMORCUBESFIXPHDNT taxonomy)
Digital Humanities Authority Stack
CH-Annotator prioritizes Digital Humanities standards over web-centric NER:
Primary Authorities
| Authority | Usage |
|---|---|
| TEI P5 | Document structure, person/place/org names, temporal expressions |
| CIDOC-CRM 7.1.3 | Cultural heritage entity modeling, events, temporal entities |
| TimeML/TIMEX3 | Temporal expression annotation (DATE, TIME, DURATION, SET) |
| FRBR/LRM | Work/Expression/Manifestation/Item for bibliographic entities |
| GeoSPARQL | Spatial geometry representation in RDF |
| Pleiades | Historical and ancient world toponyms |
Secondary Authorities
| Authority | Usage |
|---|---|
| W3C Org | Organizational structure, roles, memberships |
| RiC-O | Archival description and record relationships |
| PNV | Structured person name components |
| PiCo | Person observations in historical sources |
Deprecated (Interchange Only)
| System | Status |
|---|---|
| NERD | Retained for NLP pipeline interchange, NOT authoritative |
Breaking Changes in v1.7.0
Hypernym Renames
| Old Name | New Name | Code | Rationale |
|---|---|---|---|
| BEING | AGENT | AGT | CIDOC-CRM E39_Actor (includes non-humans) |
| ORGANISATION | GROUP | GRP | CIDOC-CRM E74_Group (formal + informal) |
| PLACE | TOPONYM | TOP | Nominal references only |
| (new) | GEOMETRY | GEO | Coordinate/shape data split from PLACE |
| TEXTUAL_REFERENCE | WORK | WRK | FRBR model instead of nerd:Product |
| (new) | ROLE | ROL | TEI roleName + PiCo concepts |
Temporal Restructuring (TimeML/TIMEX3)
| Type Code | Description |
|---|---|
| TMP.DAB | Datable (absolute timestamps, fully resolved) |
| TMP.DRL | Deictic/Relative (require context) |
| TMP.DUR | Durations |
| TMP.SET | Recurring/periodic times |
| TMP.RNG | Explicit start-end ranges |
Claim Provenance Model
Every extracted claim MUST have 5-component provenance:
claim:
claim_type: full_name
claim_value: "Rijksmuseum Amsterdam"
provenance:
namespace: skos # Ontology prefix
path: /html/body/h1[1] # XPath/JSONPath to source
timestamp: "2025-12-06T10:00:00Z" # ISO 8601
agent: ch_annotator-v1_7_0 # Extraction model
context_convention: ch_annotator-v1_7_0 # This convention version
Usage in GLAM Project
When to Use CH-Annotator
- Extracting entities from Claude conversation JSON files
- Annotating web-scraped heritage institution pages
- Processing archival finding aids
- Extracting entities from PDF documents
- Annotating ISIL registry entries
When NOT to Use CH-Annotator
- Simple YAML data restructuring (no NER needed)
- Identifier-only extraction (use regex patterns)
- Geographic enrichment (use GeoNames directly)
Provenance Reference
When using CH-Annotator, reference it in extraction metadata:
provenance:
data_source: CONVERSATION_NLP
extraction_method: ch_annotator-v1_7_0
extraction_date: "2025-12-06T10:00:00Z"
confidence_score: 0.92
Integration with LinkML Schema
CH-Annotator aligns with the project's LinkML schema:
| CH-Annotator Type | LinkML Class | Schema File |
|---|---|---|
| GRP.HER | HeritageCustodian | schemas/core.yaml |
| TOP | Location | schemas/core.yaml |
| APP.IDE | Identifier | schemas/core.yaml |
| TMP.DAB | ChangeEvent.event_date | schemas/provenance.yaml |
| WRK | Collection | schemas/collections.yaml |
See Also
data/entity_annotation/ch_annotator-v1_7_0.yaml- Full conventiondata/entity_annotation/modules/- Modular hypernym definitionsschemas/20251121/linkml/modules/classes/WebClaim.yaml- Claim schema.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md- XPath provenance rulesAGENTS.md- Rule 10 (CH-Annotator usage)
Version: 1.0
Last Updated: 2025-12-06