1108 lines
32 KiB
YAML
1108 lines
32 KiB
YAML
# Video Audio Annotation Class
|
|
# Models audio event detection in video content (speech, music, silence, diarization)
|
|
#
|
|
# Part of Heritage Custodian Ontology v0.9.10
|
|
#
|
|
# HIERARCHY:
|
|
# VideoAnnotation (abstract base)
|
|
# │
|
|
# ├── VideoSceneAnnotation (scene/shot detection)
|
|
# ├── VideoObjectAnnotation (object/face/logo detection)
|
|
# ├── VideoOCRAnnotation (text-in-video extraction)
|
|
# └── VideoAudioAnnotation (this class)
|
|
# - Speech detection and diarization
|
|
# - Music detection and classification
|
|
# - Sound event detection
|
|
# - Silence/noise detection
|
|
#
|
|
# HERITAGE INSTITUTION USE CASES:
|
|
# - Speaker identification in curator interviews
|
|
# - Music detection in promotional videos
|
|
# - Silence detection for video quality analysis
|
|
# - Language detection for multilingual content
|
|
# - Applause/audience reaction in lecture recordings
|
|
# - Sound effects in exhibition media
|
|
#
|
|
# ONTOLOGY ALIGNMENT:
|
|
# - W3C Web Annotation for annotation structure
|
|
# - CIDOC-CRM E13_Attribute_Assignment for attribution
|
|
# - W3C Media Ontology for audio properties
|
|
# - Speech-to-Text standards for diarization
|
|
|
|
id: https://nde.nl/ontology/hc/class/VideoAudioAnnotation
|
|
name: video_audio_annotation_class
|
|
title: Video Audio Annotation Class
|
|
|
|
imports:
|
|
- linkml:types
|
|
- ./VideoAnnotation
|
|
- ./VideoTimeSegment
|
|
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
dcterms: http://purl.org/dc/terms/
|
|
prov: http://www.w3.org/ns/prov#
|
|
crm: http://www.cidoc-crm.org/cidoc-crm/
|
|
oa: http://www.w3.org/ns/oa#
|
|
ma: http://www.w3.org/ns/ma-ont#
|
|
wd: http://www.wikidata.org/entity/
|
|
|
|
default_prefix: hc
|
|
|
|
# ============================================================================
|
|
# Classes
|
|
# ============================================================================
|
|
|
|
classes:
|
|
|
|
VideoAudioAnnotation:
|
|
is_a: VideoAnnotation
|
|
class_uri: hc:VideoAudioAnnotation
|
|
abstract: false
|
|
description: |
|
|
Annotation for audio events detected in video content.
|
|
|
|
**DEFINITION**:
|
|
|
|
VideoAudioAnnotation captures structured information derived from audio
|
|
analysis of video content. This includes speech, music, silence, and
|
|
various sound events.
|
|
|
|
**AUDIO ANALYSIS TYPES**:
|
|
|
|
| Type | Description | Use Case |
|
|
|------|-------------|----------|
|
|
| **Speech Detection** | Identify spoken segments | Transcript alignment |
|
|
| **Speaker Diarization** | Who spoke when | Interview navigation |
|
|
| **Music Detection** | Identify musical segments | Content classification |
|
|
| **Sound Events** | Applause, laughter, etc. | Audience engagement |
|
|
| **Silence Detection** | Find quiet segments | Quality assessment |
|
|
| **Language Detection** | Identify spoken languages | Multilingual content |
|
|
|
|
**SPEAKER DIARIZATION**:
|
|
|
|
Diarization answers "who spoke when":
|
|
|
|
```
|
|
0:00-0:15 Speaker 1 (Curator)
|
|
0:15-0:45 Speaker 2 (Artist)
|
|
0:45-1:00 Speaker 1 (Curator)
|
|
1:00-1:30 Speaker 3 (Museum Director)
|
|
```
|
|
|
|
Heritage applications:
|
|
- Navigate to specific speakers in interviews
|
|
- Count speaking time per person
|
|
- Identify unnamed speakers for annotation
|
|
- Build speaker databases for recognition
|
|
|
|
**MUSIC DETECTION**:
|
|
|
|
Music detection classifies audio segments as containing music:
|
|
|
|
| Category | Examples |
|
|
|----------|----------|
|
|
| **Background music** | Documentary soundtracks |
|
|
| **Featured music** | Concert recordings, performances |
|
|
| **Historical music** | Archival recordings |
|
|
| **Licensed music** | Rights-managed content |
|
|
|
|
Music segments may also include:
|
|
- Genre classification (classical, jazz, folk)
|
|
- Mood/tempo analysis
|
|
- Fingerprinting for identification
|
|
|
|
**SOUND EVENT DETECTION**:
|
|
|
|
Non-speech, non-music audio events:
|
|
|
|
| Event Type | Heritage Context |
|
|
|------------|------------------|
|
|
| APPLAUSE | Lecture recordings, openings |
|
|
| LAUGHTER | Tour guides, educational content |
|
|
| CROWD_NOISE | Event documentation |
|
|
| DOOR/FOOTSTEPS | Ambient archive recordings |
|
|
| NATURE_SOUNDS | Outdoor heritage site recordings |
|
|
| MACHINERY | Industrial heritage, conservation |
|
|
|
|
**LANGUAGE DETECTION**:
|
|
|
|
Multilingual heritage content requires language identification:
|
|
|
|
```yaml
|
|
speech_segments:
|
|
- start: 0.0
|
|
end: 120.0
|
|
language: nl
|
|
speaker_id: speaker_001
|
|
- start: 120.0
|
|
end: 240.0
|
|
language: en
|
|
speaker_id: speaker_001 # Same speaker, switched language
|
|
```
|
|
|
|
**AUDIO QUALITY ANALYSIS**:
|
|
|
|
Audio quality metrics for preservation and accessibility:
|
|
|
|
| Metric | Description | Threshold |
|
|
|--------|-------------|-----------|
|
|
| SNR | Signal-to-noise ratio | > 20 dB good |
|
|
| Clipping | Peak distortion | None ideal |
|
|
| Noise floor | Background noise level | < -50 dB good |
|
|
| Frequency response | Bandwidth | Full-range ideal |
|
|
|
|
**HERITAGE INSTITUTION USE CASES**:
|
|
|
|
| Content Type | Audio Analysis Need |
|
|
|--------------|---------------------|
|
|
| Oral histories | Diarization, transcription alignment |
|
|
| Curator interviews | Speaker identification, language |
|
|
| Virtual tours | Background music, voiceover detection |
|
|
| Lecture recordings | Audience reactions, Q&A segments |
|
|
| Conservation videos | Narration vs demonstration audio |
|
|
| Archival footage | Speech recovery, noise reduction |
|
|
|
|
**RELATIONSHIP TO VideoTranscript**:
|
|
|
|
VideoAudioAnnotation is complementary to VideoTranscript:
|
|
|
|
- **VideoTranscript**: The text content of speech (WHAT was said)
|
|
- **VideoAudioAnnotation**: Audio structure (WHO spoke, music, sounds)
|
|
|
|
Together they provide complete audio understanding:
|
|
|
|
```
|
|
VideoAudioAnnotation: Speaker 1 spoke 0:00-0:15
|
|
VideoTranscript: "Welcome to the Rijksmuseum..." (0:00-0:15)
|
|
→ Combined: Curator said "Welcome to the Rijksmuseum..."
|
|
```
|
|
|
|
exact_mappings:
|
|
- hc:VideoAudioAnnotation
|
|
|
|
close_mappings:
|
|
- ma:AudioTrack
|
|
- crm:E13_Attribute_Assignment
|
|
|
|
related_mappings:
|
|
- wikidata:Q11028 # Speech
|
|
- wikidata:Q638 # Music
|
|
|
|
slots:
|
|
# Audio event detection
|
|
- audio_event_segments
|
|
- primary_audio_event_type
|
|
|
|
# Speech analysis
|
|
- speech_detected
|
|
- speech_segments
|
|
- speech_language
|
|
- speech_language_confidence
|
|
- languages_detected
|
|
|
|
# Speaker diarization
|
|
- diarization_enabled
|
|
- diarization_segments
|
|
- speaker_count
|
|
- speaker_labels
|
|
|
|
# Music detection
|
|
- music_detected
|
|
- music_segments
|
|
- music_genres_detected
|
|
- music_confidence
|
|
|
|
# Sound events
|
|
- sound_events_detected
|
|
- sound_event_types
|
|
|
|
# Silence/noise
|
|
- silence_segments
|
|
- silence_total_seconds
|
|
- noise_floor_db
|
|
|
|
# Audio quality
|
|
- audio_quality_score
|
|
- snr_db
|
|
- has_clipping
|
|
|
|
slot_usage:
|
|
audio_event_segments:
|
|
slot_uri: oa:hasBody
|
|
description: |
|
|
Time-coded segments with detected audio events.
|
|
|
|
Web Annotation: hasBody links annotation to content.
|
|
|
|
Each segment contains:
|
|
- Start/end time boundaries
|
|
- Event type (SPEECH, MUSIC, SILENCE, etc.)
|
|
- Confidence score
|
|
- Additional metadata (speaker ID, language, etc.)
|
|
|
|
Segments may overlap (e.g., speech over background music).
|
|
range: VideoTimeSegment
|
|
multivalued: true
|
|
required: false
|
|
inlined_as_list: true
|
|
examples:
|
|
- value: "[{start_seconds: 0.0, end_seconds: 15.0, segment_text: 'Speech detected - Speaker 1'}]"
|
|
description: "Speech detection segment"
|
|
|
|
primary_audio_event_type:
|
|
slot_uri: dcterms:type
|
|
description: |
|
|
The primary type of audio analysis performed.
|
|
|
|
Dublin Core: type for categorization.
|
|
|
|
**Types**:
|
|
- SPEECH: Speech detection and diarization
|
|
- MUSIC: Music detection and classification
|
|
- SOUND_EVENTS: Environmental sound detection
|
|
- MIXED: Multiple analysis types combined
|
|
range: AudioEventTypeEnum
|
|
required: true
|
|
examples:
|
|
- value: "SPEECH"
|
|
description: "Primary focus on speech analysis"
|
|
|
|
speech_detected:
|
|
slot_uri: hc:speechDetected
|
|
description: |
|
|
Whether speech was detected in the video audio.
|
|
|
|
High-level flag for presence of speech content.
|
|
|
|
- true: At least one speech segment detected
|
|
- false: No speech detected (music-only, silent, etc.)
|
|
range: boolean
|
|
required: false
|
|
examples:
|
|
- value: true
|
|
description: "Speech is present in video"
|
|
|
|
speech_segments:
|
|
slot_uri: hc:speechSegments
|
|
description: |
|
|
Detailed speech segments with speaker and language info.
|
|
|
|
Each segment represents continuous speech from one speaker.
|
|
|
|
Used for:
|
|
- Transcript alignment
|
|
- Speaker navigation
|
|
- Language segmentation
|
|
range: SpeechSegment
|
|
multivalued: true
|
|
required: false
|
|
inlined_as_list: true
|
|
examples:
|
|
- value: "[{start_seconds: 0.0, end_seconds: 15.0, speaker_id: 'spk_001', language: 'nl'}]"
|
|
description: "Dutch speech from speaker 1"
|
|
|
|
speech_language:
|
|
slot_uri: dcterms:language
|
|
description: |
|
|
Primary language of speech content (ISO 639-1 code).
|
|
|
|
Dublin Core: language for primary language.
|
|
|
|
For multilingual content, this is the predominant language.
|
|
See `languages_detected` for all languages.
|
|
range: string
|
|
required: false
|
|
examples:
|
|
- value: "nl"
|
|
description: "Dutch is primary language"
|
|
- value: "en"
|
|
description: "English is primary language"
|
|
|
|
speech_language_confidence:
|
|
slot_uri: hc:languageConfidence
|
|
description: |
|
|
Confidence score for language detection (0.0-1.0).
|
|
|
|
Higher confidence when:
|
|
- Longer speech segments
|
|
- Clear audio quality
|
|
- Distinct language features
|
|
|
|
Lower confidence when:
|
|
- Short utterances
|
|
- Background noise
|
|
- Code-switching
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
examples:
|
|
- value: 0.95
|
|
description: "High confidence language detection"
|
|
|
|
languages_detected:
|
|
slot_uri: hc:languagesDetected
|
|
description: |
|
|
All languages detected in speech (ISO 639-1 codes).
|
|
|
|
Heritage content often includes multiple languages:
|
|
- Exhibition videos with translations
|
|
- Interviews with multilingual speakers
|
|
- Historical content with period languages
|
|
|
|
Ordered by speaking time (most spoken first).
|
|
range: string
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value: "[nl, en, de]"
|
|
description: "Dutch, English, and German detected"
|
|
|
|
diarization_enabled:
|
|
slot_uri: hc:diarizationEnabled
|
|
description: |
|
|
Whether speaker diarization was performed.
|
|
|
|
Diarization = identifying distinct speakers and their segments.
|
|
|
|
- true: Speaker IDs assigned to speech segments
|
|
- false: Speech detected but speakers not distinguished
|
|
range: boolean
|
|
required: false
|
|
examples:
|
|
- value: true
|
|
description: "Diarization was performed"
|
|
|
|
diarization_segments:
|
|
slot_uri: hc:diarizationSegments
|
|
description: |
|
|
Detailed diarization results with speaker assignments.
|
|
|
|
Each segment identifies:
|
|
- Time boundaries
|
|
- Speaker ID (anonymous: "spk_001", "spk_002")
|
|
- Optional speaker name (if identified)
|
|
- Confidence score
|
|
|
|
Enables "who spoke when" analysis.
|
|
range: DiarizationSegment
|
|
multivalued: true
|
|
required: false
|
|
inlined_as_list: true
|
|
examples:
|
|
- value: "[{start_seconds: 0.0, end_seconds: 15.0, speaker_id: 'spk_001', speaker_label: 'Curator'}]"
|
|
description: "Curator speaking for first 15 seconds"
|
|
|
|
speaker_count:
|
|
slot_uri: hc:speakerCount
|
|
description: |
|
|
Number of distinct speakers detected.
|
|
|
|
Useful for:
|
|
- Interview classification (1 = monologue, 2+ = dialog)
|
|
- Content type inference
|
|
- Accessibility planning
|
|
range: integer
|
|
required: false
|
|
minimum_value: 0
|
|
examples:
|
|
- value: 3
|
|
description: "Three distinct speakers detected"
|
|
|
|
speaker_labels:
|
|
slot_uri: hc:speakerLabels
|
|
description: |
|
|
Labels or names assigned to detected speakers.
|
|
|
|
May be:
|
|
- Anonymous: ["Speaker 1", "Speaker 2"]
|
|
- Identified: ["Dr. Taco Dibbits", "Interviewer"]
|
|
- Role-based: ["Curator", "Artist", "Host"]
|
|
|
|
Ordered by speaking time (most speaking first).
|
|
range: string
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value: "[Curator, Artist, Museum Director]"
|
|
description: "Three identified speakers"
|
|
|
|
music_detected:
|
|
slot_uri: hc:musicDetected
|
|
description: |
|
|
Whether music was detected in the audio.
|
|
|
|
- true: Musical content detected (any amount)
|
|
- false: No music detected (speech-only, silence)
|
|
range: boolean
|
|
required: false
|
|
examples:
|
|
- value: true
|
|
description: "Music present in video"
|
|
|
|
music_segments:
|
|
slot_uri: hc:musicSegments
|
|
description: |
|
|
Time segments containing music.
|
|
|
|
Each segment includes:
|
|
- Time boundaries
|
|
- Music type (background, featured)
|
|
- Genre classification (if detected)
|
|
- Confidence score
|
|
range: MusicSegment
|
|
multivalued: true
|
|
required: false
|
|
inlined_as_list: true
|
|
examples:
|
|
- value: "[{start_seconds: 0.0, end_seconds: 30.0, music_type: 'BACKGROUND', genre: 'classical'}]"
|
|
description: "Classical background music"
|
|
|
|
music_genres_detected:
|
|
slot_uri: hc:musicGenresDetected
|
|
description: |
|
|
Music genres detected in audio.
|
|
|
|
**Common Heritage Genres**:
|
|
- classical: Art music, orchestral
|
|
- baroque: Period-specific classical
|
|
- jazz: Jazz performances
|
|
- folk: Traditional/folk music
|
|
- ambient: Background/atmospheric
|
|
- electronic: Modern electronic music
|
|
range: string
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value: "[classical, baroque]"
|
|
description: "Classical and baroque music detected"
|
|
|
|
music_confidence:
|
|
slot_uri: hc:musicConfidence
|
|
description: |
|
|
Overall confidence of music detection (0.0-1.0).
|
|
|
|
Average confidence across all music segments.
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
examples:
|
|
- value: 0.88
|
|
description: "High confidence music detection"
|
|
|
|
sound_events_detected:
|
|
slot_uri: hc:soundEventsDetected
|
|
description: |
|
|
Whether non-speech, non-music sound events were detected.
|
|
|
|
Sound events include applause, laughter, environmental sounds, etc.
|
|
range: boolean
|
|
required: false
|
|
examples:
|
|
- value: true
|
|
description: "Sound events detected"
|
|
|
|
sound_event_types:
|
|
slot_uri: hc:soundEventTypes
|
|
description: |
|
|
Types of sound events detected.
|
|
|
|
**Heritage-Relevant Events**:
|
|
- APPLAUSE: Lecture endings, openings
|
|
- LAUGHTER: Tour guide humor
|
|
- CROWD_NOISE: Event atmosphere
|
|
- FOOTSTEPS: Gallery ambiance
|
|
- NATURE_SOUNDS: Outdoor heritage sites
|
|
- BELLS: Church/temple recordings
|
|
range: SoundEventTypeEnum
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value: "[APPLAUSE, CROWD_NOISE]"
|
|
description: "Applause and crowd sounds detected"
|
|
|
|
silence_segments:
|
|
slot_uri: hc:silenceSegments
|
|
description: |
|
|
Time segments containing silence or very low audio.
|
|
|
|
Silence detection useful for:
|
|
- Finding pauses between segments
|
|
- Quality assessment (unexpected silence)
|
|
- Identifying chapter/scene boundaries
|
|
|
|
Threshold typically: audio below -40 dB for > 2 seconds.
|
|
range: VideoTimeSegment
|
|
multivalued: true
|
|
required: false
|
|
inlined_as_list: true
|
|
examples:
|
|
- value: "[{start_seconds: 45.0, end_seconds: 48.0}]"
|
|
description: "3-second silence"
|
|
|
|
silence_total_seconds:
|
|
slot_uri: hc:silenceTotalSeconds
|
|
description: |
|
|
Total duration of silence in the video (seconds).
|
|
|
|
High silence percentage may indicate:
|
|
- Extended pauses
|
|
- Silent segments (B-roll without audio)
|
|
- Audio issues
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
examples:
|
|
- value: 15.5
|
|
description: "15.5 seconds of total silence"
|
|
|
|
noise_floor_db:
|
|
slot_uri: hc:noiseFloorDb
|
|
description: |
|
|
Background noise floor level in decibels.
|
|
|
|
**Quality Guidelines**:
|
|
- < -60 dB: Excellent (studio quality)
|
|
- -60 to -40 dB: Good (professional recording)
|
|
- -40 to -30 dB: Acceptable (field recording)
|
|
- > -30 dB: Poor (noisy environment)
|
|
range: float
|
|
required: false
|
|
examples:
|
|
- value: -45.0
|
|
description: "Good quality, moderate noise floor"
|
|
|
|
audio_quality_score:
|
|
slot_uri: hc:audioQualityScore
|
|
description: |
|
|
Overall audio quality score (0.0-1.0).
|
|
|
|
Composite score based on:
|
|
- Signal-to-noise ratio
|
|
- Clipping presence
|
|
- Frequency response
|
|
- Clarity of speech
|
|
|
|
**Interpretation**:
|
|
- > 0.8: High quality, suitable for all uses
|
|
- 0.6-0.8: Good quality, minor issues
|
|
- 0.4-0.6: Acceptable, some degradation
|
|
- < 0.4: Poor quality, may need enhancement
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
examples:
|
|
- value: 0.85
|
|
description: "High audio quality"
|
|
|
|
snr_db:
|
|
slot_uri: hc:snrDb
|
|
description: |
|
|
Signal-to-noise ratio in decibels.
|
|
|
|
Higher is better:
|
|
- > 30 dB: Excellent
|
|
- 20-30 dB: Good
|
|
- 10-20 dB: Acceptable
|
|
- < 10 dB: Poor (speech intelligibility affected)
|
|
range: float
|
|
required: false
|
|
examples:
|
|
- value: 25.0
|
|
description: "Good signal-to-noise ratio"
|
|
|
|
has_clipping:
|
|
slot_uri: hc:hasClipping
|
|
description: |
|
|
Whether audio clipping (peak distortion) was detected.
|
|
|
|
Clipping occurs when audio exceeds maximum level:
|
|
- true: Clipping detected (distortion present)
|
|
- false: No clipping (clean audio)
|
|
|
|
Clipping is permanent quality loss.
|
|
range: boolean
|
|
required: false
|
|
examples:
|
|
- value: false
|
|
description: "No clipping detected"
|
|
|
|
comments:
|
|
- "Audio event detection for video content"
|
|
- "Supports speech, music, silence, and sound event detection"
|
|
- "Speaker diarization for interview navigation"
|
|
- "Language detection for multilingual heritage content"
|
|
- "Audio quality metrics for preservation assessment"
|
|
|
|
see_also:
|
|
- "https://www.w3.org/TR/annotation-model/"
|
|
- "https://arxiv.org/abs/2111.08085" # Speaker diarization survey
|
|
|
|
# ============================================================================
|
|
# Supporting Classes
|
|
# ============================================================================
|
|
|
|
SpeechSegment:
|
|
class_uri: hc:SpeechSegment
|
|
description: |
|
|
A speech segment with speaker and language information.
|
|
|
|
Extends VideoTimeSegment with speech-specific metadata.
|
|
|
|
slots:
|
|
- segment_start_seconds
|
|
- segment_end_seconds
|
|
- speaker_id
|
|
- speaker_label
|
|
- segment_language
|
|
- segment_confidence
|
|
- speech_text
|
|
|
|
slot_usage:
|
|
segment_start_seconds:
|
|
slot_uri: ma:hasStartTime
|
|
description: Start time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
segment_end_seconds:
|
|
slot_uri: ma:hasEndTime
|
|
description: End time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
speaker_id:
|
|
slot_uri: hc:speakerId
|
|
description: |
|
|
Unique identifier for the speaker.
|
|
|
|
Format: "spk_001", "spk_002", etc. (anonymous)
|
|
Or: "taco_dibbits" (identified)
|
|
range: string
|
|
required: false
|
|
|
|
speaker_label:
|
|
slot_uri: schema:name
|
|
description: Human-readable speaker name or role
|
|
range: string
|
|
required: false
|
|
|
|
segment_language:
|
|
slot_uri: dcterms:language
|
|
description: Language of speech in this segment (ISO 639-1)
|
|
range: string
|
|
required: false
|
|
|
|
segment_confidence:
|
|
slot_uri: hc:confidence
|
|
description: Confidence score for this segment (0.0-1.0)
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
|
|
speech_text:
|
|
slot_uri: hc:speechText
|
|
description: |
|
|
Transcript text for this segment (if available).
|
|
|
|
Links to VideoTranscript for full transcript.
|
|
range: string
|
|
required: false
|
|
|
|
|
|
DiarizationSegment:
|
|
class_uri: hc:DiarizationSegment
|
|
description: |
|
|
A diarization segment identifying speaker and time boundaries.
|
|
|
|
Focused on "who spoke when" rather than transcript content.
|
|
|
|
slots:
|
|
- diarization_start_seconds
|
|
- diarization_end_seconds
|
|
- diarization_speaker_id
|
|
- diarization_speaker_label
|
|
- diarization_confidence
|
|
- is_overlapping
|
|
|
|
slot_usage:
|
|
diarization_start_seconds:
|
|
slot_uri: ma:hasStartTime
|
|
description: Start time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
diarization_end_seconds:
|
|
slot_uri: ma:hasEndTime
|
|
description: End time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
diarization_speaker_id:
|
|
slot_uri: hc:speakerId
|
|
description: Anonymous speaker identifier (spk_001, spk_002, etc.)
|
|
range: string
|
|
required: true
|
|
|
|
diarization_speaker_label:
|
|
slot_uri: schema:name
|
|
description: Optional identified name or role
|
|
range: string
|
|
required: false
|
|
|
|
diarization_confidence:
|
|
slot_uri: hc:confidence
|
|
description: Diarization confidence (0.0-1.0)
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
|
|
is_overlapping:
|
|
slot_uri: hc:isOverlapping
|
|
description: |
|
|
Whether this segment overlaps with another speaker.
|
|
|
|
Overlapping speech occurs when multiple people speak simultaneously.
|
|
range: boolean
|
|
required: false
|
|
|
|
|
|
MusicSegment:
|
|
class_uri: hc:MusicSegment
|
|
description: |
|
|
A segment of detected music with classification.
|
|
|
|
slots:
|
|
- music_start_seconds
|
|
- music_end_seconds
|
|
- music_type
|
|
- music_genre
|
|
- music_segment_confidence
|
|
- is_background
|
|
|
|
slot_usage:
|
|
music_start_seconds:
|
|
slot_uri: ma:hasStartTime
|
|
description: Start time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
music_end_seconds:
|
|
slot_uri: ma:hasEndTime
|
|
description: End time in seconds
|
|
range: float
|
|
required: true
|
|
minimum_value: 0.0
|
|
|
|
music_type:
|
|
slot_uri: dcterms:type
|
|
description: Type of music (BACKGROUND, FEATURED, ARCHIVAL)
|
|
range: MusicTypeEnum
|
|
required: false
|
|
|
|
music_genre:
|
|
slot_uri: hc:genre
|
|
description: Detected music genre
|
|
range: string
|
|
required: false
|
|
|
|
music_segment_confidence:
|
|
slot_uri: hc:confidence
|
|
description: Music detection confidence (0.0-1.0)
|
|
range: float
|
|
required: false
|
|
minimum_value: 0.0
|
|
maximum_value: 1.0
|
|
|
|
is_background:
|
|
slot_uri: hc:isBackground
|
|
description: |
|
|
Whether music is background (under speech) vs featured.
|
|
|
|
- true: Music is background/ambient
|
|
- false: Music is primary audio
|
|
range: boolean
|
|
required: false
|
|
|
|
|
|
# ============================================================================
|
|
# Enumerations
|
|
# ============================================================================
|
|
|
|
enums:
|
|
|
|
AudioEventTypeEnum:
|
|
description: |
|
|
Types of audio events detected in video.
|
|
permissible_values:
|
|
SPEECH:
|
|
description: Speech/voice detection and analysis
|
|
MUSIC:
|
|
description: Music detection and classification
|
|
SILENCE:
|
|
description: Silence or very low audio
|
|
SOUND_EVENT:
|
|
description: Non-speech, non-music sound events
|
|
NOISE:
|
|
description: Noise detection (for quality assessment)
|
|
MIXED:
|
|
description: Multiple audio event types analyzed
|
|
|
|
SoundEventTypeEnum:
|
|
description: |
|
|
Types of non-speech, non-music sound events.
|
|
permissible_values:
|
|
APPLAUSE:
|
|
description: Clapping, applause
|
|
LAUGHTER:
|
|
description: Laughter from audience or speakers
|
|
CROWD_NOISE:
|
|
description: General crowd/audience noise
|
|
FOOTSTEPS:
|
|
description: Walking, footsteps
|
|
DOOR:
|
|
description: Door opening/closing sounds
|
|
NATURE_SOUNDS:
|
|
description: Birds, wind, water, etc.
|
|
TRAFFIC:
|
|
description: Vehicles, urban sounds
|
|
BELLS:
|
|
description: Church bells, temple bells, etc.
|
|
MACHINERY:
|
|
description: Industrial, mechanical sounds
|
|
COUGHING:
|
|
description: Coughing, clearing throat
|
|
PAPER:
|
|
description: Paper rustling
|
|
TYPING:
|
|
description: Keyboard typing
|
|
PHONE:
|
|
description: Phone ringing or notification
|
|
MUSIC_INSTRUMENT:
|
|
description: Individual instrument sounds
|
|
OTHER:
|
|
description: Other sound event type
|
|
|
|
MusicTypeEnum:
|
|
description: |
|
|
Types of music presence in audio.
|
|
permissible_values:
|
|
BACKGROUND:
|
|
description: Background/ambient music under other content
|
|
FEATURED:
|
|
description: Primary audio is music (performance, recording)
|
|
ARCHIVAL:
|
|
description: Historical/archival music recording
|
|
INTRO_OUTRO:
|
|
description: Opening or closing music/jingle
|
|
TRANSITION:
|
|
description: Music used for scene transitions
|
|
DIEGETIC:
|
|
description: Music from within the scene (radio, live performance)
|
|
NON_DIEGETIC:
|
|
description: Music added in post-production
|
|
|
|
|
|
# ============================================================================
|
|
# Slot Definitions
|
|
# ============================================================================
|
|
|
|
slots:
|
|
# Audio event slots
|
|
audio_event_segments:
|
|
description: Time-coded segments with detected audio events
|
|
range: VideoTimeSegment
|
|
multivalued: true
|
|
|
|
primary_audio_event_type:
|
|
description: Primary type of audio analysis performed
|
|
range: AudioEventTypeEnum
|
|
|
|
# Speech slots
|
|
speech_detected:
|
|
description: Whether speech was detected
|
|
range: boolean
|
|
|
|
speech_segments:
|
|
description: Detailed speech segments with speaker info
|
|
range: SpeechSegment
|
|
multivalued: true
|
|
|
|
speech_language:
|
|
description: Primary language of speech (ISO 639-1)
|
|
range: string
|
|
|
|
speech_language_confidence:
|
|
description: Confidence of language detection
|
|
range: float
|
|
|
|
languages_detected:
|
|
description: All languages detected in speech
|
|
range: string
|
|
multivalued: true
|
|
|
|
# Diarization slots
|
|
diarization_enabled:
|
|
description: Whether speaker diarization was performed
|
|
range: boolean
|
|
|
|
diarization_segments:
|
|
description: Detailed diarization results
|
|
range: DiarizationSegment
|
|
multivalued: true
|
|
|
|
speaker_count:
|
|
description: Number of distinct speakers detected
|
|
range: integer
|
|
|
|
speaker_labels:
|
|
description: Labels or names for detected speakers
|
|
range: string
|
|
multivalued: true
|
|
|
|
# Music slots
|
|
music_detected:
|
|
description: Whether music was detected
|
|
range: boolean
|
|
|
|
music_segments:
|
|
description: Time segments containing music
|
|
range: MusicSegment
|
|
multivalued: true
|
|
|
|
music_genres_detected:
|
|
description: Music genres detected
|
|
range: string
|
|
multivalued: true
|
|
|
|
music_confidence:
|
|
description: Overall music detection confidence
|
|
range: float
|
|
|
|
# Sound event slots
|
|
sound_events_detected:
|
|
description: Whether sound events were detected
|
|
range: boolean
|
|
|
|
sound_event_types:
|
|
description: Types of sound events detected
|
|
range: SoundEventTypeEnum
|
|
multivalued: true
|
|
|
|
# Silence/noise slots
|
|
silence_segments:
|
|
description: Time segments with silence
|
|
range: VideoTimeSegment
|
|
multivalued: true
|
|
|
|
silence_total_seconds:
|
|
description: Total silence duration
|
|
range: float
|
|
|
|
noise_floor_db:
|
|
description: Background noise floor in dB
|
|
range: float
|
|
|
|
# Audio quality slots
|
|
audio_quality_score:
|
|
description: Overall audio quality (0.0-1.0)
|
|
range: float
|
|
|
|
snr_db:
|
|
description: Signal-to-noise ratio in dB
|
|
range: float
|
|
|
|
has_clipping:
|
|
description: Whether audio clipping was detected
|
|
range: boolean
|
|
|
|
# SpeechSegment slots
|
|
segment_start_seconds:
|
|
description: Segment start time
|
|
range: float
|
|
|
|
segment_end_seconds:
|
|
description: Segment end time
|
|
range: float
|
|
|
|
speaker_id:
|
|
description: Speaker identifier
|
|
range: string
|
|
|
|
speaker_label:
|
|
description: Speaker name or role
|
|
range: string
|
|
|
|
segment_language:
|
|
description: Language of segment
|
|
range: string
|
|
|
|
segment_confidence:
|
|
description: Segment confidence score
|
|
range: float
|
|
|
|
speech_text:
|
|
description: Transcript text for segment
|
|
range: string
|
|
|
|
# DiarizationSegment slots
|
|
diarization_start_seconds:
|
|
description: Diarization segment start
|
|
range: float
|
|
|
|
diarization_end_seconds:
|
|
description: Diarization segment end
|
|
range: float
|
|
|
|
diarization_speaker_id:
|
|
description: Speaker ID in diarization
|
|
range: string
|
|
|
|
diarization_speaker_label:
|
|
description: Speaker label in diarization
|
|
range: string
|
|
|
|
diarization_confidence:
|
|
description: Diarization confidence
|
|
range: float
|
|
|
|
is_overlapping:
|
|
description: Whether segment has overlapping speech
|
|
range: boolean
|
|
|
|
# MusicSegment slots
|
|
music_start_seconds:
|
|
description: Music segment start
|
|
range: float
|
|
|
|
music_end_seconds:
|
|
description: Music segment end
|
|
range: float
|
|
|
|
music_type:
|
|
description: Type of music presence
|
|
range: MusicTypeEnum
|
|
|
|
music_genre:
|
|
description: Detected music genre
|
|
range: string
|
|
|
|
music_segment_confidence:
|
|
description: Music segment confidence
|
|
range: float
|
|
|
|
is_background:
|
|
description: Whether music is background
|
|
range: boolean
|