glam/reports/CH_ANNOTATOR_BATCH_APPLICATION_REPORT.md
2025-12-07 00:26:01 +01:00

5.9 KiB

CH-Annotator Batch Application Report

Date: 2025-12-06
Convention: ch_annotator-v1_7_0
Annotation Agent: opencode-claude-sonnet-4

Summary

Successfully applied CH-Annotator (Cultural Heritage Annotator) convention to 25 heritage institution datasets across multiple countries and regions.

Key Statistics

Metric Value
Files Processed 25
Files Skipped 0
Files Failed 1 (Denmark - empty file)
Total Institutions 25,224
Total Claims 92,269
Total Lines Generated ~2.5 million

Hypernym Distribution

Hypernym Code Description Count
GRP.HER.LIB Libraries 17,931
GRP.HER.MUS Museums 5,116
GRP.HER.ARC Archives 1,055
GRP.HER Unknown Heritage 457
GRP.EDU Educational 209
GRP.HER.OFF Official Institutions 211
GRP.HER.MIX Mixed Type 106
GRP.HER.HOL Holy Sites 56
GRP.HER.GAL Galleries 43
GRP.HER.RES Research Centers 32
GRP.HER.PER Personal Collections 8

Datasets Processed

North Africa (5 datasets, 215 institutions)

  • algeria/algerian_institutions_ch_annotator.yaml - 19 institutions, 91 claims
  • egypt_institutions_ch_annotator.yaml - 29 institutions, 92 claims
  • libya/libyan_institutions_ch_annotator.yaml - 50 institutions, 197 claims
  • morocco/moroccan_institutions_ch_annotator.yaml - 49 institutions, 151 claims
  • tunisia/tunisian_institutions_enhanced_ch_annotator.yaml - 68 institutions, 324 claims

Europe (14 datasets, 22,165 institutions)

  • austria_complete_ch_annotator.yaml - 223 institutions, 602 claims
  • belarus_complete_ch_annotator.yaml - 167 institutions, 506 claims
  • belgium_complete_ch_annotator.yaml - 421 institutions, 943 claims
  • bulgaria_complete_ch_annotator.yaml - 94 institutions, 313 claims
  • czech_unified_ch_annotator.yaml - 8,694 institutions, 40,369 claims
  • netherlands_complete_ch_annotator.yaml - 153 institutions, 571 claims
  • norway/city_archives_ch_annotator.yaml - 4 institutions, 12 claims
  • norway/county_archives_ch_annotator.yaml - 6 institutions, 18 claims
  • norway/museums_oslo_ch_annotator.yaml - 6 institutions, 18 claims
  • switzerland_isil_ch_annotator.yaml - 2,379 institutions, 4,758 claims
  • georgia_glam_institutions_enriched_ch_annotator.yaml - 14 institutions, 53 claims
  • great_britain/gb_institutions_enriched_manual_ch_annotator.yaml - 4 institutions, 15 claims
  • italy/it_institutions_enriched_manual_ch_annotator.yaml - 3 institutions, 14 claims

Asia (3 datasets, 12,125 institutions)

  • japan_complete_ch_annotator.yaml - 12,064 institutions, 40,538 claims
  • vietnamese_glam_institutions_ch_annotator.yaml - 21 institutions, 71 claims
  • palestinian_heritage_custodians_ch_annotator.yaml - 40 institutions, 126 claims

Americas (4 datasets, 716 institutions)

  • argentina_complete_ch_annotator.yaml - 288 institutions, 901 claims
  • latin_american_institutions_AUTHORITATIVE_ch_annotator.yaml - 304 institutions, 1,313 claims
  • mexico/mexican_institutions_curated_ch_annotator.yaml - 117 institutions, 238 claims
  • united_states/us_institutions_enriched_manual_ch_annotator.yaml - 7 institutions, 35 claims

CH-Annotator Features Applied

Each institution now includes a ch_annotator block with:

1. Entity Classification

entity_classification:
  hypernym: GRP
  hypernym_label: GROUP
  subtype: GRP.HER.MUS
  subtype_label: MUSEUM
  ontology_class: schema:Museum
  alternative_classes:
    - org:FormalOrganization
    - rov:RegisteredOrganization
    - glam:HeritageCustodian

2. Extraction Provenance (5-Component Model)

extraction_provenance:
  namespace: glam
  path: /conversations/{conversation_id}
  timestamp: 2025-11-09T00:00:00Z
  agent: claude-conversation  # Original extraction agent
  context_convention: ch_annotator-v1_7_0

3. Annotation Provenance

annotation_provenance:
  annotation_agent: opencode-claude-sonnet-4  # Model applying CH-Annotator
  annotation_date: 2025-12-06T21:47:33.634752+00:00
  annotation_method: retroactive CH-Annotator application via batch script
  source_file: algerian_institutions_ghcid.yaml

4. Entity Claims

Each institution has claims for:

  • full_name (skos:prefLabel)
  • institution_type (rdf:type)
  • located_in_city (schema:addressLocality)
  • wikidata_id (owl:sameAs) - when available
  • ghcid (glam:ghcid) - when available

Scripts Created

Script Purpose
scripts/apply_ch_annotator_algeria.py Original Algeria-specific script
scripts/apply_ch_annotator_batch.py Generalized batch processing script

Agent Field Correction

The original script incorrectly hardcoded agent: claude-3.5-sonnet. This was corrected to:

  1. extraction_provenance.agent: claude-conversation (generic identifier for conversation-based extractions, since Claude exports don't specify model version)
  2. annotation_provenance.annotation_agent: opencode-claude-sonnet-4 (the model actually applying CH-Annotator annotations)

This separation distinguishes between:

  • The original extraction agent (which performed the NER/entity extraction from conversations)
  • The annotation agent (which applied the CH-Annotator convention retroactively)

Next Steps

  1. Validate annotated files against LinkML schema
  2. Integrate CH-Annotator metadata into RDF export pipeline
  3. Apply to remaining country-specific batch files (Chile batches 1-20, Brazil batches, etc.)
  4. Update extraction pipelines to generate CH-Annotator metadata natively (not retroactively)

Files Location

All CH-Annotator enhanced files are located at:

/Users/kempersc/apps/glam/data/instances/*_ch_annotator.yaml

Convention Reference

  • Convention ID: ch_annotator-v1_7_0
  • Convention File: data/entity_annotation/ch_annotator-v1_7_0.yaml
  • Documentation: .opencode/CH_ANNOTATOR_CONVENTION.md
  • Quick Reference: docs/CH_ANNOTATOR_QUICK_REFERENCE.md