155 lines
5.9 KiB
Markdown
155 lines
5.9 KiB
Markdown
# CH-Annotator Batch Application Report
|
|
|
|
**Date**: 2025-12-06
|
|
**Convention**: ch_annotator-v1_7_0
|
|
**Annotation Agent**: opencode-claude-sonnet-4
|
|
|
|
## Summary
|
|
|
|
Successfully applied CH-Annotator (Cultural Heritage Annotator) convention to 25 heritage institution datasets across multiple countries and regions.
|
|
|
|
### Key Statistics
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Files Processed | 25 |
|
|
| Files Skipped | 0 |
|
|
| Files Failed | 1 (Denmark - empty file) |
|
|
| **Total Institutions** | **25,224** |
|
|
| **Total Claims** | **92,269** |
|
|
| Total Lines Generated | ~2.5 million |
|
|
|
|
### Hypernym Distribution
|
|
|
|
| Hypernym Code | Description | Count |
|
|
|---------------|-------------|-------|
|
|
| GRP.HER.LIB | Libraries | 17,931 |
|
|
| GRP.HER.MUS | Museums | 5,116 |
|
|
| GRP.HER.ARC | Archives | 1,055 |
|
|
| GRP.HER | Unknown Heritage | 457 |
|
|
| GRP.EDU | Educational | 209 |
|
|
| GRP.HER.OFF | Official Institutions | 211 |
|
|
| GRP.HER.MIX | Mixed Type | 106 |
|
|
| GRP.HER.HOL | Holy Sites | 56 |
|
|
| GRP.HER.GAL | Galleries | 43 |
|
|
| GRP.HER.RES | Research Centers | 32 |
|
|
| GRP.HER.PER | Personal Collections | 8 |
|
|
|
|
## Datasets Processed
|
|
|
|
### North Africa (5 datasets, 215 institutions)
|
|
- `algeria/algerian_institutions_ch_annotator.yaml` - 19 institutions, 91 claims
|
|
- `egypt_institutions_ch_annotator.yaml` - 29 institutions, 92 claims
|
|
- `libya/libyan_institutions_ch_annotator.yaml` - 50 institutions, 197 claims
|
|
- `morocco/moroccan_institutions_ch_annotator.yaml` - 49 institutions, 151 claims
|
|
- `tunisia/tunisian_institutions_enhanced_ch_annotator.yaml` - 68 institutions, 324 claims
|
|
|
|
### Europe (14 datasets, 22,165 institutions)
|
|
- `austria_complete_ch_annotator.yaml` - 223 institutions, 602 claims
|
|
- `belarus_complete_ch_annotator.yaml` - 167 institutions, 506 claims
|
|
- `belgium_complete_ch_annotator.yaml` - 421 institutions, 943 claims
|
|
- `bulgaria_complete_ch_annotator.yaml` - 94 institutions, 313 claims
|
|
- `czech_unified_ch_annotator.yaml` - 8,694 institutions, 40,369 claims
|
|
- `netherlands_complete_ch_annotator.yaml` - 153 institutions, 571 claims
|
|
- `norway/city_archives_ch_annotator.yaml` - 4 institutions, 12 claims
|
|
- `norway/county_archives_ch_annotator.yaml` - 6 institutions, 18 claims
|
|
- `norway/museums_oslo_ch_annotator.yaml` - 6 institutions, 18 claims
|
|
- `switzerland_isil_ch_annotator.yaml` - 2,379 institutions, 4,758 claims
|
|
- `georgia_glam_institutions_enriched_ch_annotator.yaml` - 14 institutions, 53 claims
|
|
- `great_britain/gb_institutions_enriched_manual_ch_annotator.yaml` - 4 institutions, 15 claims
|
|
- `italy/it_institutions_enriched_manual_ch_annotator.yaml` - 3 institutions, 14 claims
|
|
|
|
### Asia (3 datasets, 12,125 institutions)
|
|
- `japan_complete_ch_annotator.yaml` - 12,064 institutions, 40,538 claims
|
|
- `vietnamese_glam_institutions_ch_annotator.yaml` - 21 institutions, 71 claims
|
|
- `palestinian_heritage_custodians_ch_annotator.yaml` - 40 institutions, 126 claims
|
|
|
|
### Americas (4 datasets, 716 institutions)
|
|
- `argentina_complete_ch_annotator.yaml` - 288 institutions, 901 claims
|
|
- `latin_american_institutions_AUTHORITATIVE_ch_annotator.yaml` - 304 institutions, 1,313 claims
|
|
- `mexico/mexican_institutions_curated_ch_annotator.yaml` - 117 institutions, 238 claims
|
|
- `united_states/us_institutions_enriched_manual_ch_annotator.yaml` - 7 institutions, 35 claims
|
|
|
|
## CH-Annotator Features Applied
|
|
|
|
Each institution now includes a `ch_annotator` block with:
|
|
|
|
### 1. Entity Classification
|
|
```yaml
|
|
entity_classification:
|
|
hypernym: GRP
|
|
hypernym_label: GROUP
|
|
subtype: GRP.HER.MUS
|
|
subtype_label: MUSEUM
|
|
ontology_class: schema:Museum
|
|
alternative_classes:
|
|
- org:FormalOrganization
|
|
- rov:RegisteredOrganization
|
|
- glam:HeritageCustodian
|
|
```
|
|
|
|
### 2. Extraction Provenance (5-Component Model)
|
|
```yaml
|
|
extraction_provenance:
|
|
namespace: glam
|
|
path: /conversations/{conversation_id}
|
|
timestamp: 2025-11-09T00:00:00Z
|
|
agent: claude-conversation # Original extraction agent
|
|
context_convention: ch_annotator-v1_7_0
|
|
```
|
|
|
|
### 3. Annotation Provenance
|
|
```yaml
|
|
annotation_provenance:
|
|
annotation_agent: opencode-claude-sonnet-4 # Model applying CH-Annotator
|
|
annotation_date: 2025-12-06T21:47:33.634752+00:00
|
|
annotation_method: retroactive CH-Annotator application via batch script
|
|
source_file: algerian_institutions_ghcid.yaml
|
|
```
|
|
|
|
### 4. Entity Claims
|
|
Each institution has claims for:
|
|
- `full_name` (skos:prefLabel)
|
|
- `institution_type` (rdf:type)
|
|
- `located_in_city` (schema:addressLocality)
|
|
- `wikidata_id` (owl:sameAs) - when available
|
|
- `ghcid` (glam:ghcid) - when available
|
|
|
|
## Scripts Created
|
|
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `scripts/apply_ch_annotator_algeria.py` | Original Algeria-specific script |
|
|
| `scripts/apply_ch_annotator_batch.py` | Generalized batch processing script |
|
|
|
|
## Agent Field Correction
|
|
|
|
The original script incorrectly hardcoded `agent: claude-3.5-sonnet`. This was corrected to:
|
|
|
|
1. **extraction_provenance.agent**: `claude-conversation` (generic identifier for conversation-based extractions, since Claude exports don't specify model version)
|
|
2. **annotation_provenance.annotation_agent**: `opencode-claude-sonnet-4` (the model actually applying CH-Annotator annotations)
|
|
|
|
This separation distinguishes between:
|
|
- The **original extraction agent** (which performed the NER/entity extraction from conversations)
|
|
- The **annotation agent** (which applied the CH-Annotator convention retroactively)
|
|
|
|
## Next Steps
|
|
|
|
1. **Validate** annotated files against LinkML schema
|
|
2. **Integrate** CH-Annotator metadata into RDF export pipeline
|
|
3. **Apply** to remaining country-specific batch files (Chile batches 1-20, Brazil batches, etc.)
|
|
4. **Update** extraction pipelines to generate CH-Annotator metadata natively (not retroactively)
|
|
|
|
## Files Location
|
|
|
|
All CH-Annotator enhanced files are located at:
|
|
```
|
|
/Users/kempersc/apps/glam/data/instances/*_ch_annotator.yaml
|
|
```
|
|
|
|
## Convention Reference
|
|
|
|
- Convention ID: `ch_annotator-v1_7_0`
|
|
- Convention File: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
|
|
- Documentation: `.opencode/CH_ANNOTATOR_CONVENTION.md`
|
|
- Quick Reference: `docs/CH_ANNOTATOR_QUICK_REFERENCE.md`
|