glam/schemas/20251121/linkml/modules/classes/VideoAnnotation.yaml
kempersc 51554947a0 feat(schema): Add video content schema with comprehensive examples
Video Schema Classes (9 files):
- VideoPost, VideoComment: Social media video modeling
- VideoTextContent: Base class for text content extraction
- VideoTranscript, VideoSubtitle: Text with timing and formatting
- VideoTimeSegment: Time code handling with ISO 8601 duration
- VideoAnnotation: Base annotation with W3C Web Annotation alignment
- VideoAnnotationTypes: Scene, Object, OCR detection annotations
- VideoChapter, VideoChapterList: Navigation and chapter structure
- VideoAudioAnnotation: Speaker diarization, music, sound events

Enumerations (12 enums):
- VideoDefinitionEnum, LiveBroadcastStatusEnum
- TranscriptFormatEnum, SubtitleFormatEnum, SubtitlePositionEnum
- AnnotationTypeEnum, AnnotationMotivationEnum
- DetectionLevelEnum, SceneTypeEnum, TransitionTypeEnum, TextTypeEnum
- ChapterSourceEnum, AudioEventTypeEnum, SoundEventTypeEnum, MusicTypeEnum

Examples (904 lines, 10 comprehensive heritage-themed examples):
- Rijksmuseum virtual tour chapters (5 chapters with heritage entity refs)
- Operation Night Watch documentary chapters (5 chapters)
- VideoAudioAnnotation: curator interview, exhibition promo, museum lecture

All examples reference real heritage entities with Wikidata IDs:
Q5598 (Rembrandt), Q41264 (Vermeer), Q219831 (The Night Watch)
2025-12-16 20:03:17 +01:00

542 lines
18 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Video Annotation Class
# Abstract base class for computer vision and multimodal video annotations
#
# Part of Heritage Custodian Ontology v0.9.5
#
# HIERARCHY:
# E73_Information_Object (CIDOC-CRM)
# │
# └── VideoTextContent (abstract base)
# │
# ├── VideoTranscript (audio-derived)
# │ │
# │ └── VideoSubtitle (time-coded captions)
# │
# └── VideoAnnotation (this class - ABSTRACT)
# │
# ├── VideoSceneAnnotation (scene/shot detection)
# ├── VideoObjectAnnotation (object/face/logo detection)
# └── VideoOCRAnnotation (text-in-video extraction)
#
# DESIGN RATIONALE:
# VideoAnnotation is the abstract parent for all annotations derived from
# visual analysis of video content. Unlike VideoTranscript (audio-derived),
# these annotations come from computer vision, multimodal AI, or manual
# visual analysis.
#
# Key differences from transcript branch:
# - Frame-based rather than audio-based analysis
# - Spatial information (bounding boxes, regions)
# - Detection thresholds and frame sampling
# - Multiple detection types per segment
#
# ONTOLOGY ALIGNMENT:
# - W3C Web Annotation (oa:Annotation) for annotation structure
# - CIDOC-CRM E13_Attribute_Assignment for attribution activities
# - IIIF Presentation API for spatial/temporal selectors
id: https://nde.nl/ontology/hc/class/VideoAnnotation
name: video_annotation_class
title: Video Annotation Class
imports:
- linkml:types
- ./VideoTextContent
- ./VideoTimeSegment
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
crm: http://www.cidoc-crm.org/cidoc-crm/
oa: http://www.w3.org/ns/oa#
as: https://www.w3.org/ns/activitystreams#
default_prefix: hc
classes:
VideoAnnotation:
is_a: VideoTextContent
class_uri: oa:Annotation
abstract: true
description: |
Abstract base class for computer vision and multimodal video annotations.
**DEFINITION**:
VideoAnnotation represents structured information derived from visual
analysis of video content. This includes:
| Subclass | Analysis Type | Output |
|----------|---------------|--------|
| VideoSceneAnnotation | Shot/scene detection | Scene boundaries, types |
| VideoObjectAnnotation | Object detection | Objects, faces, logos |
| VideoOCRAnnotation | Text extraction | On-screen text (OCR) |
**RELATIONSHIP TO W3C WEB ANNOTATION**:
VideoAnnotation aligns with the W3C Web Annotation Data Model:
```turtle
:annotation a oa:Annotation ;
oa:hasBody :detection_result ;
oa:hasTarget [
oa:hasSource :video ;
oa:hasSelector [
a oa:FragmentSelector ;
dcterms:conformsTo <http://www.w3.org/TR/media-frags/> ;
rdf:value "t=30,35"
]
] ;
oa:motivatedBy oa:classifying .
```
**FRAME-BASED ANALYSIS**:
Unlike audio transcription (continuous stream), video annotation is
typically frame-based:
- `frame_sample_rate`: Frames analyzed per second (e.g., 1 fps, 5 fps)
- `total_frames_analyzed`: Total frames processed
- Higher sample rates = more detections but higher compute cost
**DETECTION THRESHOLDS**:
CV models output confidence scores. Thresholds filter noise:
| Threshold | Use Case |
|-----------|----------|
| 0.9+ | High precision, production display |
| 0.7-0.9 | Balanced, general use |
| 0.5-0.7 | High recall, research/review |
| < 0.5 | Raw output, needs filtering |
**MODEL ARCHITECTURE TRACKING**:
Different model architectures have different characteristics:
| Architecture | Examples | Strengths |
|--------------|----------|-----------|
| CNN | ResNet, VGG | Fast inference, good for objects |
| Transformer | ViT, CLIP | Better context, multimodal |
| Hybrid | DETR, Swin | Balance of speed and accuracy |
**HERITAGE INSTITUTION CONTEXT**:
Video annotations enable:
- **Discovery**: Find videos containing specific objects/artworks
- **Accessibility**: Scene descriptions for visually impaired
- **Research**: Analyze visual content at scale
- **Preservation**: Document visual content as text
- **Linking**: Connect detected artworks to collection records
**CIDOC-CRM E13_Attribute_Assignment**:
Annotations are attribute assignments - asserting properties about
video segments. The CV model or human annotator is the assigning agent.
exact_mappings:
- oa:Annotation
close_mappings:
- crm:E13_Attribute_Assignment
related_mappings:
- as:Activity
- schema:ClaimReview
slots:
# Annotation structure
- annotation_type
- annotation_segments
# Detection parameters
- detection_threshold
- detection_count
# Frame analysis
- frame_sample_rate
- total_frames_analyzed
- keyframe_extraction
# Model details
- model_architecture
- model_task
# Spatial information
- includes_bounding_boxes
- includes_segmentation_masks
# Annotation motivation
- annotation_motivation
slot_usage:
annotation_type:
slot_uri: dcterms:type
description: |
High-level type classification for this annotation.
Dublin Core: type for resource categorization.
**Standard Types**:
- SCENE_DETECTION: Shot/scene boundary detection
- OBJECT_DETECTION: Object, face, logo detection
- OCR: Text-in-video extraction
- ACTION_RECOGNITION: Human action detection
- SEMANTIC_SEGMENTATION: Pixel-level classification
- MULTIMODAL: Combined audio+visual analysis
range: AnnotationTypeEnum
required: true
examples:
- value: "OBJECT_DETECTION"
description: "Object and face detection annotation"
annotation_segments:
slot_uri: oa:hasBody
description: |
List of temporal segments with detection results.
Web Annotation: hasBody links annotation to its content.
Each segment contains:
- Time boundaries (start/end)
- Detection text/description
- Per-segment confidence
Reuses VideoTimeSegment for consistent temporal modeling.
range: VideoTimeSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 30.0, end_seconds: 35.0, segment_text: 'Night Watch painting visible'}]"
description: "Object detection segment"
detection_threshold:
slot_uri: hc:detectionThreshold
description: |
Minimum confidence threshold used for detection filtering.
Detections below this threshold were excluded from results.
Range: 0.0 to 1.0
**Common Values**:
- 0.5: Standard threshold (balanced)
- 0.7: High precision mode
- 0.3: High recall mode (includes uncertain detections)
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.5
description: "Standard detection threshold"
detection_count:
slot_uri: hc:detectionCount
description: |
Total number of detections across all analyzed frames.
Useful for:
- Understanding annotation density
- Quality assessment
- Performance metrics
Note: May be higher than annotation_segments count if segments
are aggregated or filtered.
range: integer
required: false
minimum_value: 0
examples:
- value: 342
description: "342 total detections found"
frame_sample_rate:
slot_uri: hc:frameSampleRate
description: |
Number of frames analyzed per second of video.
**Common Values**:
- 1.0: One frame per second (efficient)
- 5.0: Five frames per second (balanced)
- 30.0: Every frame at 30fps (thorough but expensive)
- 0.1: One frame every 10 seconds (overview only)
Higher rates catch more content but increase compute cost.
range: float
required: false
minimum_value: 0.0
examples:
- value: 1.0
description: "Analyzed 1 frame per second"
total_frames_analyzed:
slot_uri: hc:totalFramesAnalyzed
description: |
Total number of video frames that were analyzed.
Calculated as: video_duration_seconds × frame_sample_rate
Useful for:
- Understanding analysis coverage
- Cost estimation
- Reproducibility
range: integer
required: false
minimum_value: 0
examples:
- value: 1800
description: "Analyzed 1,800 frames (30 min video at 1 fps)"
keyframe_extraction:
slot_uri: hc:keyframeExtraction
description: |
Whether keyframe extraction was used instead of uniform sampling.
**Keyframe extraction** selects visually distinct frames
(scene changes, significant motion) rather than uniform intervals.
- true: Keyframes extracted (variable frame selection)
- false: Uniform sampling at frame_sample_rate
Keyframe extraction is more efficient but may miss content
between scene changes.
range: boolean
required: false
examples:
- value: true
description: "Used keyframe extraction"
model_architecture:
slot_uri: hc:modelArchitecture
description: |
Architecture type of the CV/ML model used.
**Common Architectures**:
- CNN: Convolutional Neural Network (ResNet, VGG, EfficientNet)
- Transformer: Vision Transformer (ViT, Swin, CLIP)
- Hybrid: Combined architectures (DETR, ConvNeXt)
- RNN: Recurrent (for temporal analysis)
- GAN: Generative (for reconstruction tasks)
Useful for understanding model capabilities and limitations.
range: string
required: false
examples:
- value: "Transformer"
description: "Vision Transformer architecture"
- value: "CNN"
description: "Convolutional Neural Network"
model_task:
slot_uri: hc:modelTask
description: |
Specific task the model was trained for.
**Common Tasks**:
- classification: Image/frame classification
- detection: Object detection with bounding boxes
- segmentation: Pixel-level classification
- captioning: Image/video captioning
- embedding: Feature extraction for similarity
A model's task determines its output format.
range: string
required: false
examples:
- value: "detection"
description: "Object detection task"
- value: "captioning"
description: "Video captioning task"
includes_bounding_boxes:
slot_uri: hc:includesBoundingBoxes
description: |
Whether annotation includes spatial bounding box coordinates.
Bounding boxes define rectangular regions in frames where
objects/faces/text were detected.
Format typically: [x, y, width, height] or [x1, y1, x2, y2]
- true: Spatial coordinates available in segment data
- false: Only temporal information (no spatial)
range: boolean
required: false
examples:
- value: true
description: "Includes bounding box coordinates"
includes_segmentation_masks:
slot_uri: hc:includesSegmentationMasks
description: |
Whether annotation includes pixel-level segmentation masks.
Segmentation masks provide precise object boundaries
(more detailed than bounding boxes).
- true: Pixel masks available (typically as separate files)
- false: No segmentation data
Masks are memory-intensive; often stored externally.
range: boolean
required: false
examples:
- value: false
description: "No segmentation masks included"
annotation_motivation:
slot_uri: oa:motivatedBy
description: |
The motivation or purpose for creating this annotation.
Web Annotation: motivatedBy describes why annotation was created.
**Standard Motivations** (from W3C Web Annotation):
- classifying: Categorizing content
- describing: Adding description
- identifying: Identifying depicted things
- tagging: Adding tags/keywords
- linking: Linking to external resources
**Heritage-Specific**:
- accessibility: For accessibility services
- discovery: For search/discovery
- preservation: For digital preservation
range: AnnotationMotivationEnum
required: false
examples:
- value: "CLASSIFYING"
description: "Annotation for classification purposes"
comments:
- "Abstract base for all CV/multimodal video annotations"
- "Extends VideoTextContent with frame-based analysis parameters"
- "W3C Web Annotation compatible structure"
- "Supports both temporal and spatial annotation"
- "Tracks detection thresholds and model architecture"
see_also:
- "https://www.w3.org/TR/annotation-model/"
- "http://www.cidoc-crm.org/cidoc-crm/E13_Attribute_Assignment"
- "https://iiif.io/api/presentation/3.0/"
# ============================================================================
# Enumerations
# ============================================================================
enums:
AnnotationTypeEnum:
description: |
Types of video annotation based on analysis method.
permissible_values:
SCENE_DETECTION:
description: Shot and scene boundary detection
OBJECT_DETECTION:
description: Object, face, and logo detection
OCR:
description: Optical character recognition (text-in-video)
ACTION_RECOGNITION:
description: Human action and activity detection
SEMANTIC_SEGMENTATION:
description: Pixel-level semantic classification
POSE_ESTIMATION:
description: Human body pose detection
EMOTION_RECOGNITION:
description: Facial emotion/expression analysis
MULTIMODAL:
description: Combined audio-visual analysis
CAPTIONING:
description: Automated video captioning/description
CUSTOM:
description: Custom annotation type
AnnotationMotivationEnum:
description: |
Motivation for creating annotation (W3C Web Annotation aligned).
permissible_values:
CLASSIFYING:
description: Categorizing or classifying content
meaning: oa:classifying
DESCRIBING:
description: Adding descriptive information
meaning: oa:describing
IDENTIFYING:
description: Identifying depicted entities
meaning: oa:identifying
TAGGING:
description: Adding tags or keywords
meaning: oa:tagging
LINKING:
description: Linking to external resources
meaning: oa:linking
COMMENTING:
description: Adding commentary
meaning: oa:commenting
ACCESSIBILITY:
description: Providing accessibility support
DISCOVERY:
description: Enabling search and discovery
PRESERVATION:
description: Supporting digital preservation
RESEARCH:
description: Supporting research and analysis
# ============================================================================
# Slot Definitions
# ============================================================================
slots:
annotation_type:
description: High-level type of video annotation
range: AnnotationTypeEnum
annotation_segments:
description: List of temporal segments with detection results
range: VideoTimeSegment
multivalued: true
detection_threshold:
description: Minimum confidence threshold for detection filtering
range: float
detection_count:
description: Total number of detections found
range: integer
frame_sample_rate:
description: Frames analyzed per second of video
range: float
total_frames_analyzed:
description: Total number of frames analyzed
range: integer
keyframe_extraction:
description: Whether keyframe extraction was used
range: boolean
model_architecture:
description: Architecture type of CV/ML model (CNN, Transformer, etc.)
range: string
model_task:
description: Specific task model was trained for
range: string
includes_bounding_boxes:
description: Whether annotation includes spatial bounding boxes
range: boolean
includes_segmentation_masks:
description: Whether annotation includes pixel segmentation masks
range: boolean
annotation_motivation:
description: Motivation for creating annotation (W3C Web Annotation)
range: AnnotationMotivationEnum