258 lines
8.3 KiB
Markdown
258 lines
8.3 KiB
Markdown
# CH-Annotator: Cultural Heritage Entity Annotation Convention
|
|
|
|
**ID**: `ch_annotator-v1_7_0`
|
|
**Version**: 1.7.0
|
|
**Status**: PRODUCTION
|
|
**Date**: 2025-12-02
|
|
**Renamed**: 2025-12-06
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
CH-Annotator (Cultural Heritage Annotator) is the project's comprehensive convention for:
|
|
- Named Entity Recognition (NER)
|
|
- Property Extraction
|
|
- Entity Resolution
|
|
- Entity Linking
|
|
- Claim Validation
|
|
|
|
This convention applies to ALL text sources in the GLAM project.
|
|
|
|
---
|
|
|
|
## Naming History
|
|
|
|
| Date | Name | File |
|
|
|------|------|------|
|
|
| 2025-12-02 | GLAM-NER v1.7.0-unified | `entity_annotation_rules_v1.6.0_unified.yaml` |
|
|
| 2025-12-06 | CH-Annotator v1.7.0 | `ch_annotator-v1_7_0.yaml` |
|
|
|
|
**Rename Rationale**:
|
|
- "GLAM-NER" was ambiguous (could be confused with a Python NER library)
|
|
- "CH-Annotator" clearly indicates Cultural Heritage domain focus
|
|
- File naming now follows project snake_case conventions with version
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
### Primary Files
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `data/entity_annotation/ch_annotator-v1_7_0.yaml` | **MAIN FILE** - Complete self-contained convention (2500+ lines) |
|
|
| `data/entity_annotation/modules/index.yaml` | Modular schema index for hypernym imports |
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
data/entity_annotation/
|
|
├── ch_annotator-v1_7_0.yaml # Main convention file (complete)
|
|
├── entity_annotation_rules_v1.6.0_unified.yaml # Legacy name (deprecated)
|
|
└── modules/
|
|
├── index.yaml # Module index
|
|
├── core/
|
|
│ ├── convention.yaml # Convention metadata
|
|
│ └── namespaces.yaml # Ontology prefixes
|
|
├── hypernyms/
|
|
│ ├── agt.yaml # AGENT (AGT)
|
|
│ ├── grp.yaml # GROUP (GRP)
|
|
│ ├── top.yaml # TOPONYM (TOP)
|
|
│ ├── geo.yaml # GEOMETRY (GEO)
|
|
│ ├── tmp.yaml # TEMPORAL (TMP)
|
|
│ ├── app.yaml # APPELLATION (APP)
|
|
│ ├── rol.yaml # ROLE (ROL)
|
|
│ ├── wrk.yaml # WORK (WRK)
|
|
│ ├── qty.yaml # QUANTITY (QTY)
|
|
│ └── thg.yaml # THING (THG)
|
|
├── processing/
|
|
│ ├── exclusions.yaml
|
|
│ ├── double_tagging.yaml
|
|
│ └── relationships.yaml
|
|
├── integrations/
|
|
│ ├── pico.yaml # PiCo ontology
|
|
│ └── nif_nerd.yaml # NIF/NERD compatibility
|
|
└── advanced/
|
|
├── document_structure.yaml # DOC hypernym (30+ types)
|
|
├── coreference.yaml
|
|
├── uncertainty.yaml
|
|
└── tei/ # TEI P5 modules
|
|
├── core.yaml
|
|
├── namesdates.yaml
|
|
├── msdescription.yaml
|
|
└── linking.yaml
|
|
```
|
|
|
|
---
|
|
|
|
## Hypernym Entity Types
|
|
|
|
CH-Annotator defines **9 hypernym categories** (10 including DOCUMENT regions):
|
|
|
|
| Code | Hypernym | Description | Primary Ontology Class |
|
|
|------|----------|-------------|------------------------|
|
|
| **AGT** | AGENT | Humans, AI agents, animals, fictional beings | `crm:E39_Actor` |
|
|
| **GRP** | GROUP | Formal/informal collectives of agents | `crm:E74_Group` |
|
|
| **TOP** | TOPONYM | Place names as nominal references | `crm:E53_Place` |
|
|
| **GEO** | GEOMETRY | Coordinates, polygons, spatial primitives | `geo:Geometry` |
|
|
| **TMP** | TEMPORAL | TimeML/TIMEX3 temporal expressions | `crm:E52_Time-Span` |
|
|
| **APP** | APPELLATION | Names, titles, awards, structured names | `crm:E41_Appellation` |
|
|
| **ROL** | ROLE | Occupations, honorifics, positions | `org:Role` |
|
|
| **WRK** | WORK | FRBR Work/Expression/Manifestation/Item | `frbroo:F1_Work` |
|
|
| **QTY** | QUANTITY | Counts, measurements, currency, ranges | `crm:E54_Dimension` |
|
|
| **THG** | THING | Artworks, artifacts, events, concepts | `crm:E70_Thing` |
|
|
|
|
### Heritage Institution Subtype (GRP.HER)
|
|
|
|
For heritage custodians specifically:
|
|
|
|
```yaml
|
|
GRP.HER:
|
|
name: HERITAGE_CUSTODIAN
|
|
description: Heritage institutions managing cultural collections
|
|
class_uri: cpov:PublicOrganisation
|
|
subtypes:
|
|
- GRP.HER.GAL # Gallery
|
|
- GRP.HER.LIB # Library
|
|
- GRP.HER.ARC # Archive
|
|
- GRP.HER.MUS # Museum
|
|
- GRP.HER.OFF # Official institution
|
|
- GRP.HER.RES # Research center
|
|
- GRP.HER.COR # Corporation
|
|
# ... (matches GLAMORCUBESFIXPHDNT taxonomy)
|
|
```
|
|
|
|
---
|
|
|
|
## Digital Humanities Authority Stack
|
|
|
|
CH-Annotator prioritizes Digital Humanities standards over web-centric NER:
|
|
|
|
### Primary Authorities
|
|
|
|
| Authority | Usage |
|
|
|-----------|-------|
|
|
| **TEI P5** | Document structure, person/place/org names, temporal expressions |
|
|
| **CIDOC-CRM 7.1.3** | Cultural heritage entity modeling, events, temporal entities |
|
|
| **TimeML/TIMEX3** | Temporal expression annotation (DATE, TIME, DURATION, SET) |
|
|
| **FRBR/LRM** | Work/Expression/Manifestation/Item for bibliographic entities |
|
|
| **GeoSPARQL** | Spatial geometry representation in RDF |
|
|
| **Pleiades** | Historical and ancient world toponyms |
|
|
|
|
### Secondary Authorities
|
|
|
|
| Authority | Usage |
|
|
|-----------|-------|
|
|
| **W3C Org** | Organizational structure, roles, memberships |
|
|
| **RiC-O** | Archival description and record relationships |
|
|
| **PNV** | Structured person name components |
|
|
| **PiCo** | Person observations in historical sources |
|
|
|
|
### Deprecated (Interchange Only)
|
|
|
|
| System | Status |
|
|
|--------|--------|
|
|
| **NERD** | Retained for NLP pipeline interchange, NOT authoritative |
|
|
|
|
---
|
|
|
|
## Breaking Changes in v1.7.0
|
|
|
|
### Hypernym Renames
|
|
|
|
| Old Name | New Name | Code | Rationale |
|
|
|----------|----------|------|-----------|
|
|
| BEING | AGENT | AGT | CIDOC-CRM E39_Actor (includes non-humans) |
|
|
| ORGANISATION | GROUP | GRP | CIDOC-CRM E74_Group (formal + informal) |
|
|
| PLACE | TOPONYM | TOP | Nominal references only |
|
|
| (new) | GEOMETRY | GEO | Coordinate/shape data split from PLACE |
|
|
| TEXTUAL_REFERENCE | WORK | WRK | FRBR model instead of nerd:Product |
|
|
| (new) | ROLE | ROL | TEI roleName + PiCo concepts |
|
|
|
|
### Temporal Restructuring (TimeML/TIMEX3)
|
|
|
|
| Type Code | Description |
|
|
|-----------|-------------|
|
|
| TMP.DAB | Datable (absolute timestamps, fully resolved) |
|
|
| TMP.DRL | Deictic/Relative (require context) |
|
|
| TMP.DUR | Durations |
|
|
| TMP.SET | Recurring/periodic times |
|
|
| TMP.RNG | Explicit start-end ranges |
|
|
|
|
---
|
|
|
|
## Claim Provenance Model
|
|
|
|
Every extracted claim MUST have 5-component provenance:
|
|
|
|
```yaml
|
|
claim:
|
|
claim_type: full_name
|
|
claim_value: "Rijksmuseum Amsterdam"
|
|
provenance:
|
|
namespace: skos # Ontology prefix
|
|
path: /html/body/h1[1] # XPath/JSONPath to source
|
|
timestamp: "2025-12-06T10:00:00Z" # ISO 8601
|
|
agent: ch_annotator-v1_7_0 # Extraction model
|
|
context_convention: ch_annotator-v1_7_0 # This convention version
|
|
```
|
|
|
|
---
|
|
|
|
## Usage in GLAM Project
|
|
|
|
### When to Use CH-Annotator
|
|
|
|
- Extracting entities from Claude conversation JSON files
|
|
- Annotating web-scraped heritage institution pages
|
|
- Processing archival finding aids
|
|
- Extracting entities from PDF documents
|
|
- Annotating ISIL registry entries
|
|
|
|
### When NOT to Use CH-Annotator
|
|
|
|
- Simple YAML data restructuring (no NER needed)
|
|
- Identifier-only extraction (use regex patterns)
|
|
- Geographic enrichment (use GeoNames directly)
|
|
|
|
### Provenance Reference
|
|
|
|
When using CH-Annotator, reference it in extraction metadata:
|
|
|
|
```yaml
|
|
provenance:
|
|
data_source: CONVERSATION_NLP
|
|
extraction_method: ch_annotator-v1_7_0
|
|
extraction_date: "2025-12-06T10:00:00Z"
|
|
confidence_score: 0.92
|
|
```
|
|
|
|
---
|
|
|
|
## Integration with LinkML Schema
|
|
|
|
CH-Annotator aligns with the project's LinkML schema:
|
|
|
|
| CH-Annotator Type | LinkML Class | Schema File |
|
|
|-------------------|--------------|-------------|
|
|
| GRP.HER | HeritageCustodian | schemas/core.yaml |
|
|
| TOP | Location | schemas/core.yaml |
|
|
| APP.IDE | Identifier | schemas/core.yaml |
|
|
| TMP.DAB | ChangeEvent.event_date | schemas/provenance.yaml |
|
|
| WRK | Collection | schemas/collections.yaml |
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- `data/entity_annotation/ch_annotator-v1_7_0.yaml` - Full convention
|
|
- `data/entity_annotation/modules/` - Modular hypernym definitions
|
|
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Claim schema
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath provenance rules
|
|
- `AGENTS.md` - Rule 10 (CH-Annotator usage)
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Last Updated**: 2025-12-06
|