glam/schemas/20251121/linkml/modules/classes/VideoAnnotation.yaml
kempersc b34992b1d3 Migrate all 293 class files to ontology-aligned slots
Extends migration to all class types (museums, libraries, galleries, etc.)

New slots added to class_metadata_slots.yaml:
- RiC-O: rico_record_set_type, rico_organizational_principle,
  rico_has_or_had_holder, rico_note
- Multilingual: label_de, label_es, label_fr, label_nl, label_it, label_pt
- Scope: scope_includes, scope_excludes, custodian_only,
  organizational_level, geographic_restriction
- Notes: privacy_note, preservation_note, legal_note

Migration script now handles 30+ annotation types.
All migrated schemas pass linkml-validate.

Total: 387 class files now use proper slots instead of annotations.
2026-01-06 12:24:54 +01:00

458 lines
16 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

id: https://nde.nl/ontology/hc/class/VideoAnnotation
name: video_annotation_class
title: Video Annotation Class
imports:
- linkml:types
- ./VideoTextContent
- ./VideoTimeSegment
- ../slots/class_metadata_slots
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
crm: http://www.cidoc-crm.org/cidoc-crm/
oa: http://www.w3.org/ns/oa#
as: https://www.w3.org/ns/activitystreams#
default_prefix: hc
classes:
VideoAnnotation:
is_a: VideoTextContent
class_uri: oa:Annotation
abstract: true
description: |
Abstract base class for computer vision and multimodal video annotations.
**DEFINITION**:
VideoAnnotation represents structured information derived from visual
analysis of video content. This includes:
| Subclass | Analysis Type | Output |
|----------|---------------|--------|
| VideoSceneAnnotation | Shot/scene detection | Scene boundaries, types |
| VideoObjectAnnotation | Object detection | Objects, faces, logos |
| VideoOCRAnnotation | Text extraction | On-screen text (OCR) |
**RELATIONSHIP TO W3C WEB ANNOTATION**:
VideoAnnotation aligns with the W3C Web Annotation Data Model:
```turtle
:annotation a oa:Annotation ;
oa:hasBody :detection_result ;
oa:hasTarget [
oa:hasSource :video ;
oa:hasSelector [
a oa:FragmentSelector ;
dcterms:conformsTo <http://www.w3.org/TR/media-frags/> ;
rdf:value "t=30,35"
]
] ;
oa:motivatedBy oa:classifying .
```
**FRAME-BASED ANALYSIS**:
Unlike audio transcription (continuous stream), video annotation is
typically frame-based:
- `frame_sample_rate`: Frames analyzed per second (e.g., 1 fps, 5 fps)
- `total_frames_analyzed`: Total frames processed
- Higher sample rates = more detections but higher compute cost
**DETECTION THRESHOLDS**:
CV models output confidence scores. Thresholds filter noise:
| Threshold | Use Case |
|-----------|----------|
| 0.9+ | High precision, production display |
| 0.7-0.9 | Balanced, general use |
| 0.5-0.7 | High recall, research/review |
| < 0.5 | Raw output, needs filtering |
**MODEL ARCHITECTURE TRACKING**:
Different model architectures have different characteristics:
| Architecture | Examples | Strengths |
|--------------|----------|-----------|
| CNN | ResNet, VGG | Fast inference, good for objects |
| Transformer | ViT, CLIP | Better context, multimodal |
| Hybrid | DETR, Swin | Balance of speed and accuracy |
**HERITAGE INSTITUTION CONTEXT**:
Video annotations enable:
- **Discovery**: Find videos containing specific objects/artworks
- **Accessibility**: Scene descriptions for visually impaired
- **Research**: Analyze visual content at scale
- **Preservation**: Document visual content as text
- **Linking**: Connect detected artworks to collection records
**CIDOC-CRM E13_Attribute_Assignment**:
Annotations are attribute assignments - asserting properties about
video segments. The CV model or human annotator is the assigning agent.
exact_mappings:
- oa:Annotation
close_mappings:
- crm:E13_Attribute_Assignment
related_mappings:
- as:Activity
- schema:ClaimReview
slots:
- annotation_motivation
- annotation_segments
- annotation_type
- detection_count
- detection_threshold
- frame_sample_rate
- includes_bounding_boxes
- includes_segmentation_masks
- keyframe_extraction
- model_architecture
- model_task
- specificity_annotation
- template_specificity
- total_frames_analyzed
slot_usage:
annotation_type:
slot_uri: dcterms:type
description: |
High-level type classification for this annotation.
Dublin Core: type for resource categorization.
**Standard Types**:
- SCENE_DETECTION: Shot/scene boundary detection
- OBJECT_DETECTION: Object, face, logo detection
- OCR: Text-in-video extraction
- ACTION_RECOGNITION: Human action detection
- SEMANTIC_SEGMENTATION: Pixel-level classification
- MULTIMODAL: Combined audio+visual analysis
range: AnnotationTypeEnum
required: true
examples:
- value: OBJECT_DETECTION
description: Object and face detection annotation
annotation_segments:
slot_uri: oa:hasBody
description: |
List of temporal segments with detection results.
Web Annotation: hasBody links annotation to its content.
Each segment contains:
- Time boundaries (start/end)
- Detection text/description
- Per-segment confidence
Reuses VideoTimeSegment for consistent temporal modeling.
range: VideoTimeSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: '[{start_seconds: 30.0, end_seconds: 35.0, segment_text: ''Night
Watch painting visible''}]'
description: Object detection segment
detection_threshold:
slot_uri: hc:detectionThreshold
description: |
Minimum confidence threshold used for detection filtering.
Detections below this threshold were excluded from results.
Range: 0.0 to 1.0
**Common Values**:
- 0.5: Standard threshold (balanced)
- 0.7: High precision mode
- 0.3: High recall mode (includes uncertain detections)
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.5
description: Standard detection threshold
detection_count:
slot_uri: hc:detectionCount
description: |
Total number of detections across all analyzed frames.
Useful for:
- Understanding annotation density
- Quality assessment
- Performance metrics
Note: May be higher than annotation_segments count if segments
are aggregated or filtered.
range: integer
required: false
minimum_value: 0
examples:
- value: 342
description: 342 total detections found
frame_sample_rate:
slot_uri: hc:frameSampleRate
description: |
Number of frames analyzed per second of video.
**Common Values**:
- 1.0: One frame per second (efficient)
- 5.0: Five frames per second (balanced)
- 30.0: Every frame at 30fps (thorough but expensive)
- 0.1: One frame every 10 seconds (overview only)
Higher rates catch more content but increase compute cost.
range: float
required: false
minimum_value: 0.0
examples:
- value: 1.0
description: Analyzed 1 frame per second
total_frames_analyzed:
slot_uri: hc:totalFramesAnalyzed
description: |
Total number of video frames that were analyzed.
Calculated as: video_duration_seconds × frame_sample_rate
Useful for:
- Understanding analysis coverage
- Cost estimation
- Reproducibility
range: integer
required: false
minimum_value: 0
examples:
- value: 1800
description: Analyzed 1,800 frames (30 min video at 1 fps)
keyframe_extraction:
slot_uri: hc:keyframeExtraction
description: |
Whether keyframe extraction was used instead of uniform sampling.
**Keyframe extraction** selects visually distinct frames
(scene changes, significant motion) rather than uniform intervals.
- true: Keyframes extracted (variable frame selection)
- false: Uniform sampling at frame_sample_rate
Keyframe extraction is more efficient but may miss content
between scene changes.
range: boolean
required: false
examples:
- value: true
description: Used keyframe extraction
model_architecture:
slot_uri: hc:modelArchitecture
description: |
Architecture type of the CV/ML model used.
**Common Architectures**:
- CNN: Convolutional Neural Network (ResNet, VGG, EfficientNet)
- Transformer: Vision Transformer (ViT, Swin, CLIP)
- Hybrid: Combined architectures (DETR, ConvNeXt)
- RNN: Recurrent (for temporal analysis)
- GAN: Generative (for reconstruction tasks)
Useful for understanding model capabilities and limitations.
range: string
required: false
examples:
- value: Transformer
description: Vision Transformer architecture
- value: CNN
description: Convolutional Neural Network
model_task:
slot_uri: hc:modelTask
description: |
Specific task the model was trained for.
**Common Tasks**:
- classification: Image/frame classification
- detection: Object detection with bounding boxes
- segmentation: Pixel-level classification
- captioning: Image/video captioning
- embedding: Feature extraction for similarity
A model's task determines its output format.
range: string
required: false
examples:
- value: detection
description: Object detection task
- value: captioning
description: Video captioning task
includes_bounding_boxes:
slot_uri: hc:includesBoundingBoxes
description: |
Whether annotation includes spatial bounding box coordinates.
Bounding boxes define rectangular regions in frames where
objects/faces/text were detected.
Format typically: [x, y, width, height] or [x1, y1, x2, y2]
- true: Spatial coordinates available in segment data
- false: Only temporal information (no spatial)
range: boolean
required: false
examples:
- value: true
description: Includes bounding box coordinates
includes_segmentation_masks:
slot_uri: hc:includesSegmentationMasks
description: |
Whether annotation includes pixel-level segmentation masks.
Segmentation masks provide precise object boundaries
(more detailed than bounding boxes).
- true: Pixel masks available (typically as separate files)
- false: No segmentation data
Masks are memory-intensive; often stored externally.
range: boolean
required: false
examples:
- value: false
description: No segmentation masks included
annotation_motivation:
slot_uri: oa:motivatedBy
description: |
The motivation or purpose for creating this annotation.
Web Annotation: motivatedBy describes why annotation was created.
**Standard Motivations** (from W3C Web Annotation):
- classifying: Categorizing content
- describing: Adding description
- identifying: Identifying depicted things
- tagging: Adding tags/keywords
- linking: Linking to external resources
**Heritage-Specific**:
- accessibility: For accessibility services
- discovery: For search/discovery
- preservation: For digital preservation
range: AnnotationMotivationEnum
required: false
examples:
- value: CLASSIFYING
description: Annotation for classification purposes
specificity_annotation:
range: SpecificityAnnotation
inlined: true
template_specificity:
range: TemplateSpecificityScores
inlined: true
comments:
- Abstract base for all CV/multimodal video annotations
- Extends VideoTextContent with frame-based analysis parameters
- W3C Web Annotation compatible structure
- Supports both temporal and spatial annotation
- Tracks detection thresholds and model architecture
see_also:
- https://www.w3.org/TR/annotation-model/
- http://www.cidoc-crm.org/cidoc-crm/E13_Attribute_Assignment
- https://iiif.io/api/presentation/3.0/
enums:
AnnotationTypeEnum:
description: |
Types of video annotation based on analysis method.
permissible_values:
SCENE_DETECTION:
description: Shot and scene boundary detection
OBJECT_DETECTION:
description: Object, face, and logo detection
OCR:
description: Optical character recognition (text-in-video)
ACTION_RECOGNITION:
description: Human action and activity detection
SEMANTIC_SEGMENTATION:
description: Pixel-level semantic classification
POSE_ESTIMATION:
description: Human body pose detection
EMOTION_RECOGNITION:
description: Facial emotion/expression analysis
MULTIMODAL:
description: Combined audio-visual analysis
CAPTIONING:
description: Automated video captioning/description
CUSTOM:
description: Custom annotation type
AnnotationMotivationEnum:
description: |
Motivation for creating annotation (W3C Web Annotation aligned).
permissible_values:
CLASSIFYING:
description: Categorizing or classifying content
meaning: oa:classifying
DESCRIBING:
description: Adding descriptive information
meaning: oa:describing
IDENTIFYING:
description: Identifying depicted entities
meaning: oa:identifying
TAGGING:
description: Adding tags or keywords
meaning: oa:tagging
LINKING:
description: Linking to external resources
meaning: oa:linking
COMMENTING:
description: Adding commentary
meaning: oa:commenting
ACCESSIBILITY:
description: Providing accessibility support
DISCOVERY:
description: Enabling search and discovery
PRESERVATION:
description: Supporting digital preservation
RESEARCH:
description: Supporting research and analysis
slots:
annotation_type:
description: High-level type of video annotation
range: AnnotationTypeEnum
annotation_segments:
description: List of temporal segments with detection results
range: VideoTimeSegment
multivalued: true
detection_threshold:
description: Minimum confidence threshold for detection filtering
range: float
detection_count:
description: Total number of detections found
range: integer
frame_sample_rate:
description: Frames analyzed per second of video
range: float
total_frames_analyzed:
description: Total number of frames analyzed
range: integer
keyframe_extraction:
description: Whether keyframe extraction was used
range: boolean
model_architecture:
description: Architecture type of CV/ML model (CNN, Transformer, etc.)
range: string
model_task:
description: Specific task model was trained for
range: string
includes_bounding_boxes:
description: Whether annotation includes spatial bounding boxes
range: boolean
includes_segmentation_masks:
description: Whether annotation includes pixel segmentation masks
range: boolean
annotation_motivation:
description: Motivation for creating annotation (W3C Web Annotation)
range: AnnotationMotivationEnum