# CH-Annotator Batch Application Report **Date**: 2025-12-06 **Convention**: ch_annotator-v1_7_0 **Annotation Agent**: opencode-claude-sonnet-4 ## Summary Successfully applied CH-Annotator (Cultural Heritage Annotator) convention to 25 heritage institution datasets across multiple countries and regions. ### Key Statistics | Metric | Value | |--------|-------| | Files Processed | 25 | | Files Skipped | 0 | | Files Failed | 1 (Denmark - empty file) | | **Total Institutions** | **25,224** | | **Total Claims** | **92,269** | | Total Lines Generated | ~2.5 million | ### Hypernym Distribution | Hypernym Code | Description | Count | |---------------|-------------|-------| | GRP.HER.LIB | Libraries | 17,931 | | GRP.HER.MUS | Museums | 5,116 | | GRP.HER.ARC | Archives | 1,055 | | GRP.HER | Unknown Heritage | 457 | | GRP.EDU | Educational | 209 | | GRP.HER.OFF | Official Institutions | 211 | | GRP.HER.MIX | Mixed Type | 106 | | GRP.HER.HOL | Holy Sites | 56 | | GRP.HER.GAL | Galleries | 43 | | GRP.HER.RES | Research Centers | 32 | | GRP.HER.PER | Personal Collections | 8 | ## Datasets Processed ### North Africa (5 datasets, 215 institutions) - `algeria/algerian_institutions_ch_annotator.yaml` - 19 institutions, 91 claims - `egypt_institutions_ch_annotator.yaml` - 29 institutions, 92 claims - `libya/libyan_institutions_ch_annotator.yaml` - 50 institutions, 197 claims - `morocco/moroccan_institutions_ch_annotator.yaml` - 49 institutions, 151 claims - `tunisia/tunisian_institutions_enhanced_ch_annotator.yaml` - 68 institutions, 324 claims ### Europe (14 datasets, 22,165 institutions) - `austria_complete_ch_annotator.yaml` - 223 institutions, 602 claims - `belarus_complete_ch_annotator.yaml` - 167 institutions, 506 claims - `belgium_complete_ch_annotator.yaml` - 421 institutions, 943 claims - `bulgaria_complete_ch_annotator.yaml` - 94 institutions, 313 claims - `czech_unified_ch_annotator.yaml` - 8,694 institutions, 40,369 claims - `netherlands_complete_ch_annotator.yaml` - 153 institutions, 571 claims - `norway/city_archives_ch_annotator.yaml` - 4 institutions, 12 claims - `norway/county_archives_ch_annotator.yaml` - 6 institutions, 18 claims - `norway/museums_oslo_ch_annotator.yaml` - 6 institutions, 18 claims - `switzerland_isil_ch_annotator.yaml` - 2,379 institutions, 4,758 claims - `georgia_glam_institutions_enriched_ch_annotator.yaml` - 14 institutions, 53 claims - `great_britain/gb_institutions_enriched_manual_ch_annotator.yaml` - 4 institutions, 15 claims - `italy/it_institutions_enriched_manual_ch_annotator.yaml` - 3 institutions, 14 claims ### Asia (3 datasets, 12,125 institutions) - `japan_complete_ch_annotator.yaml` - 12,064 institutions, 40,538 claims - `vietnamese_glam_institutions_ch_annotator.yaml` - 21 institutions, 71 claims - `palestinian_heritage_custodians_ch_annotator.yaml` - 40 institutions, 126 claims ### Americas (4 datasets, 716 institutions) - `argentina_complete_ch_annotator.yaml` - 288 institutions, 901 claims - `latin_american_institutions_AUTHORITATIVE_ch_annotator.yaml` - 304 institutions, 1,313 claims - `mexico/mexican_institutions_curated_ch_annotator.yaml` - 117 institutions, 238 claims - `united_states/us_institutions_enriched_manual_ch_annotator.yaml` - 7 institutions, 35 claims ## CH-Annotator Features Applied Each institution now includes a `ch_annotator` block with: ### 1. Entity Classification ```yaml entity_classification: hypernym: GRP hypernym_label: GROUP subtype: GRP.HER.MUS subtype_label: MUSEUM ontology_class: schema:Museum alternative_classes: - org:FormalOrganization - rov:RegisteredOrganization - glam:HeritageCustodian ``` ### 2. Extraction Provenance (5-Component Model) ```yaml extraction_provenance: namespace: glam path: /conversations/{conversation_id} timestamp: 2025-11-09T00:00:00Z agent: claude-conversation # Original extraction agent context_convention: ch_annotator-v1_7_0 ``` ### 3. Annotation Provenance ```yaml annotation_provenance: annotation_agent: opencode-claude-sonnet-4 # Model applying CH-Annotator annotation_date: 2025-12-06T21:47:33.634752+00:00 annotation_method: retroactive CH-Annotator application via batch script source_file: algerian_institutions_ghcid.yaml ``` ### 4. Entity Claims Each institution has claims for: - `full_name` (skos:prefLabel) - `institution_type` (rdf:type) - `located_in_city` (schema:addressLocality) - `wikidata_id` (owl:sameAs) - when available - `ghcid` (glam:ghcid) - when available ## Scripts Created | Script | Purpose | |--------|---------| | `scripts/apply_ch_annotator_algeria.py` | Original Algeria-specific script | | `scripts/apply_ch_annotator_batch.py` | Generalized batch processing script | ## Agent Field Correction The original script incorrectly hardcoded `agent: claude-3.5-sonnet`. This was corrected to: 1. **extraction_provenance.agent**: `claude-conversation` (generic identifier for conversation-based extractions, since Claude exports don't specify model version) 2. **annotation_provenance.annotation_agent**: `opencode-claude-sonnet-4` (the model actually applying CH-Annotator annotations) This separation distinguishes between: - The **original extraction agent** (which performed the NER/entity extraction from conversations) - The **annotation agent** (which applied the CH-Annotator convention retroactively) ## Next Steps 1. **Validate** annotated files against LinkML schema 2. **Integrate** CH-Annotator metadata into RDF export pipeline 3. **Apply** to remaining country-specific batch files (Chile batches 1-20, Brazil batches, etc.) 4. **Update** extraction pipelines to generate CH-Annotator metadata natively (not retroactively) ## Files Location All CH-Annotator enhanced files are located at: ``` /Users/kempersc/apps/glam/data/instances/*_ch_annotator.yaml ``` ## Convention Reference - Convention ID: `ch_annotator-v1_7_0` - Convention File: `data/entity_annotation/ch_annotator-v1_7_0.yaml` - Documentation: `.opencode/CH_ANNOTATOR_CONVENTION.md` - Quick Reference: `docs/CH_ANNOTATOR_QUICK_REFERENCE.md`