5.9 KiB
5.9 KiB
CH-Annotator Batch Application Report
Date: 2025-12-06
Convention: ch_annotator-v1_7_0
Annotation Agent: opencode-claude-sonnet-4
Summary
Successfully applied CH-Annotator (Cultural Heritage Annotator) convention to 25 heritage institution datasets across multiple countries and regions.
Key Statistics
| Metric | Value |
|---|---|
| Files Processed | 25 |
| Files Skipped | 0 |
| Files Failed | 1 (Denmark - empty file) |
| Total Institutions | 25,224 |
| Total Claims | 92,269 |
| Total Lines Generated | ~2.5 million |
Hypernym Distribution
| Hypernym Code | Description | Count |
|---|---|---|
| GRP.HER.LIB | Libraries | 17,931 |
| GRP.HER.MUS | Museums | 5,116 |
| GRP.HER.ARC | Archives | 1,055 |
| GRP.HER | Unknown Heritage | 457 |
| GRP.EDU | Educational | 209 |
| GRP.HER.OFF | Official Institutions | 211 |
| GRP.HER.MIX | Mixed Type | 106 |
| GRP.HER.HOL | Holy Sites | 56 |
| GRP.HER.GAL | Galleries | 43 |
| GRP.HER.RES | Research Centers | 32 |
| GRP.HER.PER | Personal Collections | 8 |
Datasets Processed
North Africa (5 datasets, 215 institutions)
algeria/algerian_institutions_ch_annotator.yaml- 19 institutions, 91 claimsegypt_institutions_ch_annotator.yaml- 29 institutions, 92 claimslibya/libyan_institutions_ch_annotator.yaml- 50 institutions, 197 claimsmorocco/moroccan_institutions_ch_annotator.yaml- 49 institutions, 151 claimstunisia/tunisian_institutions_enhanced_ch_annotator.yaml- 68 institutions, 324 claims
Europe (14 datasets, 22,165 institutions)
austria_complete_ch_annotator.yaml- 223 institutions, 602 claimsbelarus_complete_ch_annotator.yaml- 167 institutions, 506 claimsbelgium_complete_ch_annotator.yaml- 421 institutions, 943 claimsbulgaria_complete_ch_annotator.yaml- 94 institutions, 313 claimsczech_unified_ch_annotator.yaml- 8,694 institutions, 40,369 claimsnetherlands_complete_ch_annotator.yaml- 153 institutions, 571 claimsnorway/city_archives_ch_annotator.yaml- 4 institutions, 12 claimsnorway/county_archives_ch_annotator.yaml- 6 institutions, 18 claimsnorway/museums_oslo_ch_annotator.yaml- 6 institutions, 18 claimsswitzerland_isil_ch_annotator.yaml- 2,379 institutions, 4,758 claimsgeorgia_glam_institutions_enriched_ch_annotator.yaml- 14 institutions, 53 claimsgreat_britain/gb_institutions_enriched_manual_ch_annotator.yaml- 4 institutions, 15 claimsitaly/it_institutions_enriched_manual_ch_annotator.yaml- 3 institutions, 14 claims
Asia (3 datasets, 12,125 institutions)
japan_complete_ch_annotator.yaml- 12,064 institutions, 40,538 claimsvietnamese_glam_institutions_ch_annotator.yaml- 21 institutions, 71 claimspalestinian_heritage_custodians_ch_annotator.yaml- 40 institutions, 126 claims
Americas (4 datasets, 716 institutions)
argentina_complete_ch_annotator.yaml- 288 institutions, 901 claimslatin_american_institutions_AUTHORITATIVE_ch_annotator.yaml- 304 institutions, 1,313 claimsmexico/mexican_institutions_curated_ch_annotator.yaml- 117 institutions, 238 claimsunited_states/us_institutions_enriched_manual_ch_annotator.yaml- 7 institutions, 35 claims
CH-Annotator Features Applied
Each institution now includes a ch_annotator block with:
1. Entity Classification
entity_classification:
hypernym: GRP
hypernym_label: GROUP
subtype: GRP.HER.MUS
subtype_label: MUSEUM
ontology_class: schema:Museum
alternative_classes:
- org:FormalOrganization
- rov:RegisteredOrganization
- glam:HeritageCustodian
2. Extraction Provenance (5-Component Model)
extraction_provenance:
namespace: glam
path: /conversations/{conversation_id}
timestamp: 2025-11-09T00:00:00Z
agent: claude-conversation # Original extraction agent
context_convention: ch_annotator-v1_7_0
3. Annotation Provenance
annotation_provenance:
annotation_agent: opencode-claude-sonnet-4 # Model applying CH-Annotator
annotation_date: 2025-12-06T21:47:33.634752+00:00
annotation_method: retroactive CH-Annotator application via batch script
source_file: algerian_institutions_ghcid.yaml
4. Entity Claims
Each institution has claims for:
full_name(skos:prefLabel)institution_type(rdf:type)located_in_city(schema:addressLocality)wikidata_id(owl:sameAs) - when availableghcid(glam:ghcid) - when available
Scripts Created
| Script | Purpose |
|---|---|
scripts/apply_ch_annotator_algeria.py |
Original Algeria-specific script |
scripts/apply_ch_annotator_batch.py |
Generalized batch processing script |
Agent Field Correction
The original script incorrectly hardcoded agent: claude-3.5-sonnet. This was corrected to:
- extraction_provenance.agent:
claude-conversation(generic identifier for conversation-based extractions, since Claude exports don't specify model version) - annotation_provenance.annotation_agent:
opencode-claude-sonnet-4(the model actually applying CH-Annotator annotations)
This separation distinguishes between:
- The original extraction agent (which performed the NER/entity extraction from conversations)
- The annotation agent (which applied the CH-Annotator convention retroactively)
Next Steps
- Validate annotated files against LinkML schema
- Integrate CH-Annotator metadata into RDF export pipeline
- Apply to remaining country-specific batch files (Chile batches 1-20, Brazil batches, etc.)
- Update extraction pipelines to generate CH-Annotator metadata natively (not retroactively)
Files Location
All CH-Annotator enhanced files are located at:
/Users/kempersc/apps/glam/data/instances/*_ch_annotator.yaml
Convention Reference
- Convention ID:
ch_annotator-v1_7_0 - Convention File:
data/entity_annotation/ch_annotator-v1_7_0.yaml - Documentation:
.opencode/CH_ANNOTATOR_CONVENTION.md - Quick Reference:
docs/CH_ANNOTATOR_QUICK_REFERENCE.md