glam/reports/CH_ANNOTATOR_BATCH_APPLICATION_REPORT.md
2025-12-07 00:26:01 +01:00

155 lines
5.9 KiB
Markdown

# CH-Annotator Batch Application Report
**Date**: 2025-12-06
**Convention**: ch_annotator-v1_7_0
**Annotation Agent**: opencode-claude-sonnet-4
## Summary
Successfully applied CH-Annotator (Cultural Heritage Annotator) convention to 25 heritage institution datasets across multiple countries and regions.
### Key Statistics
| Metric | Value |
|--------|-------|
| Files Processed | 25 |
| Files Skipped | 0 |
| Files Failed | 1 (Denmark - empty file) |
| **Total Institutions** | **25,224** |
| **Total Claims** | **92,269** |
| Total Lines Generated | ~2.5 million |
### Hypernym Distribution
| Hypernym Code | Description | Count |
|---------------|-------------|-------|
| GRP.HER.LIB | Libraries | 17,931 |
| GRP.HER.MUS | Museums | 5,116 |
| GRP.HER.ARC | Archives | 1,055 |
| GRP.HER | Unknown Heritage | 457 |
| GRP.EDU | Educational | 209 |
| GRP.HER.OFF | Official Institutions | 211 |
| GRP.HER.MIX | Mixed Type | 106 |
| GRP.HER.HOL | Holy Sites | 56 |
| GRP.HER.GAL | Galleries | 43 |
| GRP.HER.RES | Research Centers | 32 |
| GRP.HER.PER | Personal Collections | 8 |
## Datasets Processed
### North Africa (5 datasets, 215 institutions)
- `algeria/algerian_institutions_ch_annotator.yaml` - 19 institutions, 91 claims
- `egypt_institutions_ch_annotator.yaml` - 29 institutions, 92 claims
- `libya/libyan_institutions_ch_annotator.yaml` - 50 institutions, 197 claims
- `morocco/moroccan_institutions_ch_annotator.yaml` - 49 institutions, 151 claims
- `tunisia/tunisian_institutions_enhanced_ch_annotator.yaml` - 68 institutions, 324 claims
### Europe (14 datasets, 22,165 institutions)
- `austria_complete_ch_annotator.yaml` - 223 institutions, 602 claims
- `belarus_complete_ch_annotator.yaml` - 167 institutions, 506 claims
- `belgium_complete_ch_annotator.yaml` - 421 institutions, 943 claims
- `bulgaria_complete_ch_annotator.yaml` - 94 institutions, 313 claims
- `czech_unified_ch_annotator.yaml` - 8,694 institutions, 40,369 claims
- `netherlands_complete_ch_annotator.yaml` - 153 institutions, 571 claims
- `norway/city_archives_ch_annotator.yaml` - 4 institutions, 12 claims
- `norway/county_archives_ch_annotator.yaml` - 6 institutions, 18 claims
- `norway/museums_oslo_ch_annotator.yaml` - 6 institutions, 18 claims
- `switzerland_isil_ch_annotator.yaml` - 2,379 institutions, 4,758 claims
- `georgia_glam_institutions_enriched_ch_annotator.yaml` - 14 institutions, 53 claims
- `great_britain/gb_institutions_enriched_manual_ch_annotator.yaml` - 4 institutions, 15 claims
- `italy/it_institutions_enriched_manual_ch_annotator.yaml` - 3 institutions, 14 claims
### Asia (3 datasets, 12,125 institutions)
- `japan_complete_ch_annotator.yaml` - 12,064 institutions, 40,538 claims
- `vietnamese_glam_institutions_ch_annotator.yaml` - 21 institutions, 71 claims
- `palestinian_heritage_custodians_ch_annotator.yaml` - 40 institutions, 126 claims
### Americas (4 datasets, 716 institutions)
- `argentina_complete_ch_annotator.yaml` - 288 institutions, 901 claims
- `latin_american_institutions_AUTHORITATIVE_ch_annotator.yaml` - 304 institutions, 1,313 claims
- `mexico/mexican_institutions_curated_ch_annotator.yaml` - 117 institutions, 238 claims
- `united_states/us_institutions_enriched_manual_ch_annotator.yaml` - 7 institutions, 35 claims
## CH-Annotator Features Applied
Each institution now includes a `ch_annotator` block with:
### 1. Entity Classification
```yaml
entity_classification:
hypernym: GRP
hypernym_label: GROUP
subtype: GRP.HER.MUS
subtype_label: MUSEUM
ontology_class: schema:Museum
alternative_classes:
- org:FormalOrganization
- rov:RegisteredOrganization
- glam:HeritageCustodian
```
### 2. Extraction Provenance (5-Component Model)
```yaml
extraction_provenance:
namespace: glam
path: /conversations/{conversation_id}
timestamp: 2025-11-09T00:00:00Z
agent: claude-conversation # Original extraction agent
context_convention: ch_annotator-v1_7_0
```
### 3. Annotation Provenance
```yaml
annotation_provenance:
annotation_agent: opencode-claude-sonnet-4 # Model applying CH-Annotator
annotation_date: 2025-12-06T21:47:33.634752+00:00
annotation_method: retroactive CH-Annotator application via batch script
source_file: algerian_institutions_ghcid.yaml
```
### 4. Entity Claims
Each institution has claims for:
- `full_name` (skos:prefLabel)
- `institution_type` (rdf:type)
- `located_in_city` (schema:addressLocality)
- `wikidata_id` (owl:sameAs) - when available
- `ghcid` (glam:ghcid) - when available
## Scripts Created
| Script | Purpose |
|--------|---------|
| `scripts/apply_ch_annotator_algeria.py` | Original Algeria-specific script |
| `scripts/apply_ch_annotator_batch.py` | Generalized batch processing script |
## Agent Field Correction
The original script incorrectly hardcoded `agent: claude-3.5-sonnet`. This was corrected to:
1. **extraction_provenance.agent**: `claude-conversation` (generic identifier for conversation-based extractions, since Claude exports don't specify model version)
2. **annotation_provenance.annotation_agent**: `opencode-claude-sonnet-4` (the model actually applying CH-Annotator annotations)
This separation distinguishes between:
- The **original extraction agent** (which performed the NER/entity extraction from conversations)
- The **annotation agent** (which applied the CH-Annotator convention retroactively)
## Next Steps
1. **Validate** annotated files against LinkML schema
2. **Integrate** CH-Annotator metadata into RDF export pipeline
3. **Apply** to remaining country-specific batch files (Chile batches 1-20, Brazil batches, etc.)
4. **Update** extraction pipelines to generate CH-Annotator metadata natively (not retroactively)
## Files Location
All CH-Annotator enhanced files are located at:
```
/Users/kempersc/apps/glam/data/instances/*_ch_annotator.yaml
```
## Convention Reference
- Convention ID: `ch_annotator-v1_7_0`
- Convention File: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
- Documentation: `.opencode/CH_ANNOTATOR_CONVENTION.md`
- Quick Reference: `docs/CH_ANNOTATOR_QUICK_REFERENCE.md`