glam/schemas/20251121/linkml/modules/classes/VideoAudioAnnotation.yaml
kempersc 51554947a0 feat(schema): Add video content schema with comprehensive examples
Video Schema Classes (9 files):
- VideoPost, VideoComment: Social media video modeling
- VideoTextContent: Base class for text content extraction
- VideoTranscript, VideoSubtitle: Text with timing and formatting
- VideoTimeSegment: Time code handling with ISO 8601 duration
- VideoAnnotation: Base annotation with W3C Web Annotation alignment
- VideoAnnotationTypes: Scene, Object, OCR detection annotations
- VideoChapter, VideoChapterList: Navigation and chapter structure
- VideoAudioAnnotation: Speaker diarization, music, sound events

Enumerations (12 enums):
- VideoDefinitionEnum, LiveBroadcastStatusEnum
- TranscriptFormatEnum, SubtitleFormatEnum, SubtitlePositionEnum
- AnnotationTypeEnum, AnnotationMotivationEnum
- DetectionLevelEnum, SceneTypeEnum, TransitionTypeEnum, TextTypeEnum
- ChapterSourceEnum, AudioEventTypeEnum, SoundEventTypeEnum, MusicTypeEnum

Examples (904 lines, 10 comprehensive heritage-themed examples):
- Rijksmuseum virtual tour chapters (5 chapters with heritage entity refs)
- Operation Night Watch documentary chapters (5 chapters)
- VideoAudioAnnotation: curator interview, exhibition promo, museum lecture

All examples reference real heritage entities with Wikidata IDs:
Q5598 (Rembrandt), Q41264 (Vermeer), Q219831 (The Night Watch)
2025-12-16 20:03:17 +01:00

1108 lines
32 KiB
YAML

# Video Audio Annotation Class
# Models audio event detection in video content (speech, music, silence, diarization)
#
# Part of Heritage Custodian Ontology v0.9.10
#
# HIERARCHY:
# VideoAnnotation (abstract base)
# │
# ├── VideoSceneAnnotation (scene/shot detection)
# ├── VideoObjectAnnotation (object/face/logo detection)
# ├── VideoOCRAnnotation (text-in-video extraction)
# └── VideoAudioAnnotation (this class)
# - Speech detection and diarization
# - Music detection and classification
# - Sound event detection
# - Silence/noise detection
#
# HERITAGE INSTITUTION USE CASES:
# - Speaker identification in curator interviews
# - Music detection in promotional videos
# - Silence detection for video quality analysis
# - Language detection for multilingual content
# - Applause/audience reaction in lecture recordings
# - Sound effects in exhibition media
#
# ONTOLOGY ALIGNMENT:
# - W3C Web Annotation for annotation structure
# - CIDOC-CRM E13_Attribute_Assignment for attribution
# - W3C Media Ontology for audio properties
# - Speech-to-Text standards for diarization
id: https://nde.nl/ontology/hc/class/VideoAudioAnnotation
name: video_audio_annotation_class
title: Video Audio Annotation Class
imports:
- linkml:types
- ./VideoAnnotation
- ./VideoTimeSegment
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
crm: http://www.cidoc-crm.org/cidoc-crm/
oa: http://www.w3.org/ns/oa#
ma: http://www.w3.org/ns/ma-ont#
wikidata: http://www.wikidata.org/entity/
default_prefix: hc
# ============================================================================
# Classes
# ============================================================================
classes:
VideoAudioAnnotation:
is_a: VideoAnnotation
class_uri: hc:VideoAudioAnnotation
abstract: false
description: |
Annotation for audio events detected in video content.
**DEFINITION**:
VideoAudioAnnotation captures structured information derived from audio
analysis of video content. This includes speech, music, silence, and
various sound events.
**AUDIO ANALYSIS TYPES**:
| Type | Description | Use Case |
|------|-------------|----------|
| **Speech Detection** | Identify spoken segments | Transcript alignment |
| **Speaker Diarization** | Who spoke when | Interview navigation |
| **Music Detection** | Identify musical segments | Content classification |
| **Sound Events** | Applause, laughter, etc. | Audience engagement |
| **Silence Detection** | Find quiet segments | Quality assessment |
| **Language Detection** | Identify spoken languages | Multilingual content |
**SPEAKER DIARIZATION**:
Diarization answers "who spoke when":
```
0:00-0:15 Speaker 1 (Curator)
0:15-0:45 Speaker 2 (Artist)
0:45-1:00 Speaker 1 (Curator)
1:00-1:30 Speaker 3 (Museum Director)
```
Heritage applications:
- Navigate to specific speakers in interviews
- Count speaking time per person
- Identify unnamed speakers for annotation
- Build speaker databases for recognition
**MUSIC DETECTION**:
Music detection classifies audio segments as containing music:
| Category | Examples |
|----------|----------|
| **Background music** | Documentary soundtracks |
| **Featured music** | Concert recordings, performances |
| **Historical music** | Archival recordings |
| **Licensed music** | Rights-managed content |
Music segments may also include:
- Genre classification (classical, jazz, folk)
- Mood/tempo analysis
- Fingerprinting for identification
**SOUND EVENT DETECTION**:
Non-speech, non-music audio events:
| Event Type | Heritage Context |
|------------|------------------|
| APPLAUSE | Lecture recordings, openings |
| LAUGHTER | Tour guides, educational content |
| CROWD_NOISE | Event documentation |
| DOOR/FOOTSTEPS | Ambient archive recordings |
| NATURE_SOUNDS | Outdoor heritage site recordings |
| MACHINERY | Industrial heritage, conservation |
**LANGUAGE DETECTION**:
Multilingual heritage content requires language identification:
```yaml
speech_segments:
- start: 0.0
end: 120.0
language: nl
speaker_id: speaker_001
- start: 120.0
end: 240.0
language: en
speaker_id: speaker_001 # Same speaker, switched language
```
**AUDIO QUALITY ANALYSIS**:
Audio quality metrics for preservation and accessibility:
| Metric | Description | Threshold |
|--------|-------------|-----------|
| SNR | Signal-to-noise ratio | > 20 dB good |
| Clipping | Peak distortion | None ideal |
| Noise floor | Background noise level | < -50 dB good |
| Frequency response | Bandwidth | Full-range ideal |
**HERITAGE INSTITUTION USE CASES**:
| Content Type | Audio Analysis Need |
|--------------|---------------------|
| Oral histories | Diarization, transcription alignment |
| Curator interviews | Speaker identification, language |
| Virtual tours | Background music, voiceover detection |
| Lecture recordings | Audience reactions, Q&A segments |
| Conservation videos | Narration vs demonstration audio |
| Archival footage | Speech recovery, noise reduction |
**RELATIONSHIP TO VideoTranscript**:
VideoAudioAnnotation is complementary to VideoTranscript:
- **VideoTranscript**: The text content of speech (WHAT was said)
- **VideoAudioAnnotation**: Audio structure (WHO spoke, music, sounds)
Together they provide complete audio understanding:
```
VideoAudioAnnotation: Speaker 1 spoke 0:00-0:15
VideoTranscript: "Welcome to the Rijksmuseum..." (0:00-0:15)
→ Combined: Curator said "Welcome to the Rijksmuseum..."
```
exact_mappings:
- hc:VideoAudioAnnotation
close_mappings:
- ma:AudioTrack
- crm:E13_Attribute_Assignment
related_mappings:
- wikidata:Q11028 # Speech
- wikidata:Q638 # Music
slots:
# Audio event detection
- audio_event_segments
- primary_audio_event_type
# Speech analysis
- speech_detected
- speech_segments
- speech_language
- speech_language_confidence
- languages_detected
# Speaker diarization
- diarization_enabled
- diarization_segments
- speaker_count
- speaker_labels
# Music detection
- music_detected
- music_segments
- music_genres_detected
- music_confidence
# Sound events
- sound_events_detected
- sound_event_types
# Silence/noise
- silence_segments
- silence_total_seconds
- noise_floor_db
# Audio quality
- audio_quality_score
- snr_db
- has_clipping
slot_usage:
audio_event_segments:
slot_uri: oa:hasBody
description: |
Time-coded segments with detected audio events.
Web Annotation: hasBody links annotation to content.
Each segment contains:
- Start/end time boundaries
- Event type (SPEECH, MUSIC, SILENCE, etc.)
- Confidence score
- Additional metadata (speaker ID, language, etc.)
Segments may overlap (e.g., speech over background music).
range: VideoTimeSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 0.0, end_seconds: 15.0, segment_text: 'Speech detected - Speaker 1'}]"
description: "Speech detection segment"
primary_audio_event_type:
slot_uri: dcterms:type
description: |
The primary type of audio analysis performed.
Dublin Core: type for categorization.
**Types**:
- SPEECH: Speech detection and diarization
- MUSIC: Music detection and classification
- SOUND_EVENTS: Environmental sound detection
- MIXED: Multiple analysis types combined
range: AudioEventTypeEnum
required: true
examples:
- value: "SPEECH"
description: "Primary focus on speech analysis"
speech_detected:
slot_uri: hc:speechDetected
description: |
Whether speech was detected in the video audio.
High-level flag for presence of speech content.
- true: At least one speech segment detected
- false: No speech detected (music-only, silent, etc.)
range: boolean
required: false
examples:
- value: true
description: "Speech is present in video"
speech_segments:
slot_uri: hc:speechSegments
description: |
Detailed speech segments with speaker and language info.
Each segment represents continuous speech from one speaker.
Used for:
- Transcript alignment
- Speaker navigation
- Language segmentation
range: SpeechSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 0.0, end_seconds: 15.0, speaker_id: 'spk_001', language: 'nl'}]"
description: "Dutch speech from speaker 1"
speech_language:
slot_uri: dcterms:language
description: |
Primary language of speech content (ISO 639-1 code).
Dublin Core: language for primary language.
For multilingual content, this is the predominant language.
See `languages_detected` for all languages.
range: string
required: false
examples:
- value: "nl"
description: "Dutch is primary language"
- value: "en"
description: "English is primary language"
speech_language_confidence:
slot_uri: hc:languageConfidence
description: |
Confidence score for language detection (0.0-1.0).
Higher confidence when:
- Longer speech segments
- Clear audio quality
- Distinct language features
Lower confidence when:
- Short utterances
- Background noise
- Code-switching
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.95
description: "High confidence language detection"
languages_detected:
slot_uri: hc:languagesDetected
description: |
All languages detected in speech (ISO 639-1 codes).
Heritage content often includes multiple languages:
- Exhibition videos with translations
- Interviews with multilingual speakers
- Historical content with period languages
Ordered by speaking time (most spoken first).
range: string
multivalued: true
required: false
examples:
- value: "[nl, en, de]"
description: "Dutch, English, and German detected"
diarization_enabled:
slot_uri: hc:diarizationEnabled
description: |
Whether speaker diarization was performed.
Diarization = identifying distinct speakers and their segments.
- true: Speaker IDs assigned to speech segments
- false: Speech detected but speakers not distinguished
range: boolean
required: false
examples:
- value: true
description: "Diarization was performed"
diarization_segments:
slot_uri: hc:diarizationSegments
description: |
Detailed diarization results with speaker assignments.
Each segment identifies:
- Time boundaries
- Speaker ID (anonymous: "spk_001", "spk_002")
- Optional speaker name (if identified)
- Confidence score
Enables "who spoke when" analysis.
range: DiarizationSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 0.0, end_seconds: 15.0, speaker_id: 'spk_001', speaker_label: 'Curator'}]"
description: "Curator speaking for first 15 seconds"
speaker_count:
slot_uri: hc:speakerCount
description: |
Number of distinct speakers detected.
Useful for:
- Interview classification (1 = monologue, 2+ = dialog)
- Content type inference
- Accessibility planning
range: integer
required: false
minimum_value: 0
examples:
- value: 3
description: "Three distinct speakers detected"
speaker_labels:
slot_uri: hc:speakerLabels
description: |
Labels or names assigned to detected speakers.
May be:
- Anonymous: ["Speaker 1", "Speaker 2"]
- Identified: ["Dr. Taco Dibbits", "Interviewer"]
- Role-based: ["Curator", "Artist", "Host"]
Ordered by speaking time (most speaking first).
range: string
multivalued: true
required: false
examples:
- value: "[Curator, Artist, Museum Director]"
description: "Three identified speakers"
music_detected:
slot_uri: hc:musicDetected
description: |
Whether music was detected in the audio.
- true: Musical content detected (any amount)
- false: No music detected (speech-only, silence)
range: boolean
required: false
examples:
- value: true
description: "Music present in video"
music_segments:
slot_uri: hc:musicSegments
description: |
Time segments containing music.
Each segment includes:
- Time boundaries
- Music type (background, featured)
- Genre classification (if detected)
- Confidence score
range: MusicSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 0.0, end_seconds: 30.0, music_type: 'BACKGROUND', genre: 'classical'}]"
description: "Classical background music"
music_genres_detected:
slot_uri: hc:musicGenresDetected
description: |
Music genres detected in audio.
**Common Heritage Genres**:
- classical: Art music, orchestral
- baroque: Period-specific classical
- jazz: Jazz performances
- folk: Traditional/folk music
- ambient: Background/atmospheric
- electronic: Modern electronic music
range: string
multivalued: true
required: false
examples:
- value: "[classical, baroque]"
description: "Classical and baroque music detected"
music_confidence:
slot_uri: hc:musicConfidence
description: |
Overall confidence of music detection (0.0-1.0).
Average confidence across all music segments.
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.88
description: "High confidence music detection"
sound_events_detected:
slot_uri: hc:soundEventsDetected
description: |
Whether non-speech, non-music sound events were detected.
Sound events include applause, laughter, environmental sounds, etc.
range: boolean
required: false
examples:
- value: true
description: "Sound events detected"
sound_event_types:
slot_uri: hc:soundEventTypes
description: |
Types of sound events detected.
**Heritage-Relevant Events**:
- APPLAUSE: Lecture endings, openings
- LAUGHTER: Tour guide humor
- CROWD_NOISE: Event atmosphere
- FOOTSTEPS: Gallery ambiance
- NATURE_SOUNDS: Outdoor heritage sites
- BELLS: Church/temple recordings
range: SoundEventTypeEnum
multivalued: true
required: false
examples:
- value: "[APPLAUSE, CROWD_NOISE]"
description: "Applause and crowd sounds detected"
silence_segments:
slot_uri: hc:silenceSegments
description: |
Time segments containing silence or very low audio.
Silence detection useful for:
- Finding pauses between segments
- Quality assessment (unexpected silence)
- Identifying chapter/scene boundaries
Threshold typically: audio below -40 dB for > 2 seconds.
range: VideoTimeSegment
multivalued: true
required: false
inlined_as_list: true
examples:
- value: "[{start_seconds: 45.0, end_seconds: 48.0}]"
description: "3-second silence"
silence_total_seconds:
slot_uri: hc:silenceTotalSeconds
description: |
Total duration of silence in the video (seconds).
High silence percentage may indicate:
- Extended pauses
- Silent segments (B-roll without audio)
- Audio issues
range: float
required: false
minimum_value: 0.0
examples:
- value: 15.5
description: "15.5 seconds of total silence"
noise_floor_db:
slot_uri: hc:noiseFloorDb
description: |
Background noise floor level in decibels.
**Quality Guidelines**:
- < -60 dB: Excellent (studio quality)
- -60 to -40 dB: Good (professional recording)
- -40 to -30 dB: Acceptable (field recording)
- > -30 dB: Poor (noisy environment)
range: float
required: false
examples:
- value: -45.0
description: "Good quality, moderate noise floor"
audio_quality_score:
slot_uri: hc:audioQualityScore
description: |
Overall audio quality score (0.0-1.0).
Composite score based on:
- Signal-to-noise ratio
- Clipping presence
- Frequency response
- Clarity of speech
**Interpretation**:
- > 0.8: High quality, suitable for all uses
- 0.6-0.8: Good quality, minor issues
- 0.4-0.6: Acceptable, some degradation
- < 0.4: Poor quality, may need enhancement
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.85
description: "High audio quality"
snr_db:
slot_uri: hc:snrDb
description: |
Signal-to-noise ratio in decibels.
Higher is better:
- > 30 dB: Excellent
- 20-30 dB: Good
- 10-20 dB: Acceptable
- < 10 dB: Poor (speech intelligibility affected)
range: float
required: false
examples:
- value: 25.0
description: "Good signal-to-noise ratio"
has_clipping:
slot_uri: hc:hasClipping
description: |
Whether audio clipping (peak distortion) was detected.
Clipping occurs when audio exceeds maximum level:
- true: Clipping detected (distortion present)
- false: No clipping (clean audio)
Clipping is permanent quality loss.
range: boolean
required: false
examples:
- value: false
description: "No clipping detected"
comments:
- "Audio event detection for video content"
- "Supports speech, music, silence, and sound event detection"
- "Speaker diarization for interview navigation"
- "Language detection for multilingual heritage content"
- "Audio quality metrics for preservation assessment"
see_also:
- "https://www.w3.org/TR/annotation-model/"
- "https://arxiv.org/abs/2111.08085" # Speaker diarization survey
# ============================================================================
# Supporting Classes
# ============================================================================
SpeechSegment:
class_uri: hc:SpeechSegment
description: |
A speech segment with speaker and language information.
Extends VideoTimeSegment with speech-specific metadata.
slots:
- segment_start_seconds
- segment_end_seconds
- speaker_id
- speaker_label
- segment_language
- segment_confidence
- speech_text
slot_usage:
segment_start_seconds:
slot_uri: ma:hasStartTime
description: Start time in seconds
range: float
required: true
minimum_value: 0.0
segment_end_seconds:
slot_uri: ma:hasEndTime
description: End time in seconds
range: float
required: true
minimum_value: 0.0
speaker_id:
slot_uri: hc:speakerId
description: |
Unique identifier for the speaker.
Format: "spk_001", "spk_002", etc. (anonymous)
Or: "taco_dibbits" (identified)
range: string
required: false
speaker_label:
slot_uri: schema:name
description: Human-readable speaker name or role
range: string
required: false
segment_language:
slot_uri: dcterms:language
description: Language of speech in this segment (ISO 639-1)
range: string
required: false
segment_confidence:
slot_uri: hc:confidence
description: Confidence score for this segment (0.0-1.0)
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
speech_text:
slot_uri: hc:speechText
description: |
Transcript text for this segment (if available).
Links to VideoTranscript for full transcript.
range: string
required: false
DiarizationSegment:
class_uri: hc:DiarizationSegment
description: |
A diarization segment identifying speaker and time boundaries.
Focused on "who spoke when" rather than transcript content.
slots:
- diarization_start_seconds
- diarization_end_seconds
- diarization_speaker_id
- diarization_speaker_label
- diarization_confidence
- is_overlapping
slot_usage:
diarization_start_seconds:
slot_uri: ma:hasStartTime
description: Start time in seconds
range: float
required: true
minimum_value: 0.0
diarization_end_seconds:
slot_uri: ma:hasEndTime
description: End time in seconds
range: float
required: true
minimum_value: 0.0
diarization_speaker_id:
slot_uri: hc:speakerId
description: Anonymous speaker identifier (spk_001, spk_002, etc.)
range: string
required: true
diarization_speaker_label:
slot_uri: schema:name
description: Optional identified name or role
range: string
required: false
diarization_confidence:
slot_uri: hc:confidence
description: Diarization confidence (0.0-1.0)
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
is_overlapping:
slot_uri: hc:isOverlapping
description: |
Whether this segment overlaps with another speaker.
Overlapping speech occurs when multiple people speak simultaneously.
range: boolean
required: false
MusicSegment:
class_uri: hc:MusicSegment
description: |
A segment of detected music with classification.
slots:
- music_start_seconds
- music_end_seconds
- music_type
- music_genre
- music_segment_confidence
- is_background
slot_usage:
music_start_seconds:
slot_uri: ma:hasStartTime
description: Start time in seconds
range: float
required: true
minimum_value: 0.0
music_end_seconds:
slot_uri: ma:hasEndTime
description: End time in seconds
range: float
required: true
minimum_value: 0.0
music_type:
slot_uri: dcterms:type
description: Type of music (BACKGROUND, FEATURED, ARCHIVAL)
range: MusicTypeEnum
required: false
music_genre:
slot_uri: hc:genre
description: Detected music genre
range: string
required: false
music_segment_confidence:
slot_uri: hc:confidence
description: Music detection confidence (0.0-1.0)
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
is_background:
slot_uri: hc:isBackground
description: |
Whether music is background (under speech) vs featured.
- true: Music is background/ambient
- false: Music is primary audio
range: boolean
required: false
# ============================================================================
# Enumerations
# ============================================================================
enums:
AudioEventTypeEnum:
description: |
Types of audio events detected in video.
permissible_values:
SPEECH:
description: Speech/voice detection and analysis
MUSIC:
description: Music detection and classification
SILENCE:
description: Silence or very low audio
SOUND_EVENT:
description: Non-speech, non-music sound events
NOISE:
description: Noise detection (for quality assessment)
MIXED:
description: Multiple audio event types analyzed
SoundEventTypeEnum:
description: |
Types of non-speech, non-music sound events.
permissible_values:
APPLAUSE:
description: Clapping, applause
LAUGHTER:
description: Laughter from audience or speakers
CROWD_NOISE:
description: General crowd/audience noise
FOOTSTEPS:
description: Walking, footsteps
DOOR:
description: Door opening/closing sounds
NATURE_SOUNDS:
description: Birds, wind, water, etc.
TRAFFIC:
description: Vehicles, urban sounds
BELLS:
description: Church bells, temple bells, etc.
MACHINERY:
description: Industrial, mechanical sounds
COUGHING:
description: Coughing, clearing throat
PAPER:
description: Paper rustling
TYPING:
description: Keyboard typing
PHONE:
description: Phone ringing or notification
MUSIC_INSTRUMENT:
description: Individual instrument sounds
OTHER:
description: Other sound event type
MusicTypeEnum:
description: |
Types of music presence in audio.
permissible_values:
BACKGROUND:
description: Background/ambient music under other content
FEATURED:
description: Primary audio is music (performance, recording)
ARCHIVAL:
description: Historical/archival music recording
INTRO_OUTRO:
description: Opening or closing music/jingle
TRANSITION:
description: Music used for scene transitions
DIEGETIC:
description: Music from within the scene (radio, live performance)
NON_DIEGETIC:
description: Music added in post-production
# ============================================================================
# Slot Definitions
# ============================================================================
slots:
# Audio event slots
audio_event_segments:
description: Time-coded segments with detected audio events
range: VideoTimeSegment
multivalued: true
primary_audio_event_type:
description: Primary type of audio analysis performed
range: AudioEventTypeEnum
# Speech slots
speech_detected:
description: Whether speech was detected
range: boolean
speech_segments:
description: Detailed speech segments with speaker info
range: SpeechSegment
multivalued: true
speech_language:
description: Primary language of speech (ISO 639-1)
range: string
speech_language_confidence:
description: Confidence of language detection
range: float
languages_detected:
description: All languages detected in speech
range: string
multivalued: true
# Diarization slots
diarization_enabled:
description: Whether speaker diarization was performed
range: boolean
diarization_segments:
description: Detailed diarization results
range: DiarizationSegment
multivalued: true
speaker_count:
description: Number of distinct speakers detected
range: integer
speaker_labels:
description: Labels or names for detected speakers
range: string
multivalued: true
# Music slots
music_detected:
description: Whether music was detected
range: boolean
music_segments:
description: Time segments containing music
range: MusicSegment
multivalued: true
music_genres_detected:
description: Music genres detected
range: string
multivalued: true
music_confidence:
description: Overall music detection confidence
range: float
# Sound event slots
sound_events_detected:
description: Whether sound events were detected
range: boolean
sound_event_types:
description: Types of sound events detected
range: SoundEventTypeEnum
multivalued: true
# Silence/noise slots
silence_segments:
description: Time segments with silence
range: VideoTimeSegment
multivalued: true
silence_total_seconds:
description: Total silence duration
range: float
noise_floor_db:
description: Background noise floor in dB
range: float
# Audio quality slots
audio_quality_score:
description: Overall audio quality (0.0-1.0)
range: float
snr_db:
description: Signal-to-noise ratio in dB
range: float
has_clipping:
description: Whether audio clipping was detected
range: boolean
# SpeechSegment slots
segment_start_seconds:
description: Segment start time
range: float
segment_end_seconds:
description: Segment end time
range: float
speaker_id:
description: Speaker identifier
range: string
speaker_label:
description: Speaker name or role
range: string
segment_language:
description: Language of segment
range: string
segment_confidence:
description: Segment confidence score
range: float
speech_text:
description: Transcript text for segment
range: string
# DiarizationSegment slots
diarization_start_seconds:
description: Diarization segment start
range: float
diarization_end_seconds:
description: Diarization segment end
range: float
diarization_speaker_id:
description: Speaker ID in diarization
range: string
diarization_speaker_label:
description: Speaker label in diarization
range: string
diarization_confidence:
description: Diarization confidence
range: float
is_overlapping:
description: Whether segment has overlapping speech
range: boolean
# MusicSegment slots
music_start_seconds:
description: Music segment start
range: float
music_end_seconds:
description: Music segment end
range: float
music_type:
description: Type of music presence
range: MusicTypeEnum
music_genre:
description: Detected music genre
range: string
music_segment_confidence:
description: Music segment confidence
range: float
is_background:
description: Whether music is background
range: boolean