glam/schemas/20251121/linkml/modules/classes/VideoTextContent.yaml
kempersc 4034c2a00a Refactor schema slots across multiple classes to improve consistency and clarity
- Removed unused slots from TaxonomicAuthority, TechnicalFeature, TelevisionArchive, TentativeWorldHeritageSite, Threat, TimeSpan, Title, TradeRegister, TradeUnionArchive, TradeUnionArchiveRecordSetType, TransferEvent, UNESCODomain, UnitIdentifier, UniversityArchive, UnspecifiedType, UserCommunity, Venue, Vereinsarchiv, Verlagsarchiv, VerlagsarchivRecordSetType, Version, Verwaltungsarchiv, VideoAnnotationTypes, VideoAudioAnnotation, VideoFrame, VideoPost, VideoSubtitle, VideoTextContent, Warehouse, WebArchive, WebClaim, WebClaimsBlock, WebLink, WebPortal, WebPortalTypes, WomensArchives, WordCount, WorldHeritageSite, WritingSystem, and XPathScore.
- Introduced new slot is_or_was_retrieved_at for tracking data retrieval timestamps.
2026-01-31 00:28:09 +01:00

241 lines
10 KiB
YAML

id: https://nde.nl/ontology/hc/class/VideoTextContent
name: video_text_content_class
title: Video Text Content Class
imports:
- linkml:types
- ../enums/GenerationMethodEnum
- ../slots/content_title
- ../slots/has_or_had_language
- ../slots/has_or_had_quantity
- ../slots/has_or_had_score
- ../slots/is_or_was_generated_by
- ../slots/is_or_was_verified_by
- ../slots/is_verified
- ../slots/model_provider
- ../slots/model_version
- ../slots/overall_confidence
- ../slots/processing_duration_seconds
- ../slots/source_video
- ../slots/source_video_url
- ../slots/specificity_annotation
- ../slots/temporal_extent
- ./Methodology
- ./Quantity
- ./SpecificityAnnotation
- ./TemplateSpecificityScore
- ./TemplateSpecificityType
- ./TemplateSpecificityTypes
- ./TimeSpan
- ./Verifier
- ./VideoPost
- ./GenerationEvent
- ./Language
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
crm: http://www.cidoc-crm.org/cidoc-crm/
skos: http://www.w3.org/2004/02/skos/core#
oa: http://www.w3.org/ns/oa#
default_prefix: hc
classes:
VideoTextContent:
class_uri: crm:E73_Information_Object
abstract: true
description: "Abstract base class for all textual/derived content from videos.\n\n**DEFINITION**:\n\nVideoTextContent is the abstract parent for all text that is extracted,\ntranscribed, or derived from video content. This includes:\n\n| Subclass | Source | Description |\n|----------|--------|-------------|\n| VideoTranscript | Audio | Full text transcription of spoken content |\n| VideoSubtitle | Audio | Time-coded caption entries (SRT/VTT) |\n| VideoAnnotation | Visual | CV/multimodal-derived descriptions |\n\n**PROVENANCE REQUIREMENTS**:\n\nAll video-derived text MUST include comprehensive provenance:\n\n1. **Source**: Which video was processed (`source_video`)\n2. **Method**: How was content generated (`generation_method`)\n3. **Agent**: Who/what generated it (`generated_by`)\n4. **Time**: When was it generated (`generation_timestamp`)\n5. **Version**: Tool/model version (`model_version`)\n6. **Quality**: Overall confidence (`overall_confidence`)\n\n**PROV-O ALIGNMENT**:\n\nMaps\
\ to W3C PROV-O for provenance tracking:\n\n```turtle\n:transcript a hc:VideoTranscript ;\n prov:wasGeneratedBy :asr_activity ;\n prov:wasAttributedTo :whisper_model ;\n prov:generatedAtTime \"2025-12-01T10:00:00Z\" ;\n prov:wasDerivedFrom :source_video .\n```\n\n**CIDOC-CRM E73_Information_Object**:\n\n- E73 is the base for all identifiable immaterial items\n- Includes texts, computer programs, songs, recipes\n- VideoTextContent are E73 instances derived from video (E73)\n\n**GENERATION METHODS**:\n\n| Method | Description | Typical Confidence |\n|--------|-------------|-------------------|\n| ASR_AUTOMATIC | Automatic speech recognition | 0.75-0.95 |\n| ASR_ENHANCED | ASR with post-processing | 0.85-0.98 |\n| MANUAL_TRANSCRIPTION | Human transcription | 0.98-1.0 |\n| MANUAL_CORRECTION | Human-corrected ASR | 0.95-1.0 |\n| CV_AUTOMATIC | Computer vision detection | 0.60-0.90 |\n| MULTIMODAL | Combined audio+visual AI | 0.70-0.95 |\n| OCR | Optical character recognition\
\ | 0.80-0.98 |\n| PLATFORM_PROVIDED | From YouTube/Vimeo API | 0.85-0.95 |\n\n**HERITAGE INSTITUTION CONTEXT**:\n\nVideo text content is critical for:\n- **Accessibility**: Deaf/HoH users need accurate captions\n- **Discovery**: Full-text search over video collections\n- **Preservation**: Text outlasts video format obsolescence\n- **Research**: Analyzing spoken content at scale\n- **Translation**: Multilingual access to heritage content\n\n**LANGUAGE SUPPORT**:\n\n- `content_language`: Primary language of text content\n- May differ from video's default_audio_language if translated\n- ISO 639-1 codes (e.g., \"nl\", \"en\", \"de\")\n"
exact_mappings:
- crm:E73_Information_Object
close_mappings:
- prov:Entity
related_mappings:
- schema:CreativeWork
- dcterms:Text
slots:
- has_or_had_language
- content_title
- generated_by
- is_or_was_generated_by
- temporal_extent
- is_verified
- model_provider
- model_version
- overall_confidence
- processing_duration_seconds
- source_video
- source_video_url
- specificity_annotation
- has_or_had_score
- temporal_extent
- is_or_was_verified_by
- has_or_had_quantity
slot_usage:
source_video:
range: string
required: true
examples:
- value: FbIoC-Owy-M
description: YouTube video ID as source reference
source_video_url:
range: uri
required: false
examples:
- value: https://www.youtube.com/watch?v=FbIoC-Owy-M
description: Full YouTube video URL
has_or_had_language:
range: string
required: true
inlined: true
multivalued: true
description: |
Language of the content.
MIGRATED from content_language (2026-01-28).
examples:
- value:
iso_639_1: "nl"
language_name: "Dutch"
description: Dutch language content
- value:
iso_639_1: "en"
language_name: "English"
description: English translation
content_title:
range: string
required: false
examples:
- value: De Vrijheidsroute Ep.3 - Dutch Transcript
description: Descriptive title for transcript
generated_by:
range: string
required: true
examples:
- value: openai/whisper-large-v3
description: OpenAI Whisper ASR model
- value: YouTube Auto-captions
description: Platform-provided captions
- value: manual:curator@rijksmuseum.nl
description: Human transcriber
is_or_was_generated_by:
description: 'Method used to generate this text content.
MIGRATED from generation_method per Rule 53.
Uses GenerationEvent linking to Methodology (was GenerationMethodEnum).
'
range: GenerationEvent
required: true
inlined: true
examples:
- value:
has_or_had_methodology:
methodology_type: ASR_AUTOMATIC
has_or_had_label: Automatic Speech Recognition
description: Automatic speech recognition
- value:
has_or_had_methodology:
methodology_type: MANUAL_TRANSCRIPTION
has_or_had_label: Manual Transcription
description: Human transcription
temporal_extent:
description: 'Verification date using CIDOC-CRM TimeSpan.
MIGRATED from verification_date per slot_fixes.yaml (Rule 53).
Use begin_of_the_begin for the verification timestamp.
'
range: TimeSpan
inlined: true
required: false
examples:
- value:
begin_of_the_begin: '2025-12-02T15:00:00Z'
description: Verified December 2, 2025
model_version:
range: string
required: false
examples:
- value: large-v3
description: Whisper model version
- value: v2.3.1
description: Software version number
model_provider:
range: string
required: false
examples:
- value: OpenAI
description: Model provider
- value: Google Cloud
description: Cloud service provider
overall_confidence:
range: float
required: false
minimum_value: 0.0
maximum_value: 1.0
examples:
- value: 0.92
description: High confidence ASR output
is_verified:
range: boolean
required: false
ifabsent: 'false'
examples:
- value: true
description: Human-verified transcript
is_or_was_verified_by:
range: Verifier
required: false
inlined: true
description: 'Who verified the annotation.
MIGRATED from verified_by slot (2026-01-14) per Rule 53.
Uses Verifier class for structured verifier with name, type, and URI.
'
examples:
- value: 'verifier_name: curator@rijksmuseum.nl
verifier_type: PERSON
'
description: Staff member who verified
processing_duration_seconds:
range: float
required: false
minimum_value: 0.0
examples:
- value: 45.3
description: Processed in 45.3 seconds
has_or_had_quantity:
range: integer
required: false
multivalued: true
inlined: true
inlined_as_list: true
description: 'Quantitative measurements of the text content.
MIGRATED: word_count (2026-01-14) and character_count (2026-01-18) per Rule 53.
Uses Quantity class for structured quantity with value, type, and unit.
Can represent word count, character count, or other text metrics.
'
examples:
- value:
- quantity_value: 1523
quantity_type: WORD_COUNT
has_or_had_measurement_unit:
has_or_had_type: WORD
has_or_had_symbol: words
has_or_had_description: Word count in transcript
- quantity_value: 8742
quantity_type: CHARACTER_COUNT
has_or_had_measurement_unit:
has_or_had_type: CHARACTER
has_or_had_symbol: chars
has_or_had_description: Character count including spaces
description: Text metrics (word and character count)
comments:
- Abstract base for all video-derived text content
- Comprehensive PROV-O provenance tracking
- Confidence scoring for AI-generated content
- Verification workflow support
- Critical for heritage accessibility and discovery
see_also:
- https://www.w3.org/TR/prov-o/
- http://www.cidoc-crm.org/cidoc-crm/E73_Information_Object
annotations:
specificity_score: 0.1
specificity_rationale: Generic utility class/slot created during migration
custodian_types: "['*']"