Language as the Medium: Multimodal Video Classification through text only

要約

マルチモーダル機械学習モデルの刺激的な新しい波にもかかわらず、現在のアプローチでは、ビデオに存在するさまざまなモダリティ間の複雑な文脈上の関係を解釈するのに依然として苦労しています。
単純なアクティビティやオブジェクトを強調する既存の方法を超えて、マルチモーダルなビデオ情報をキャプチャする詳細なテキスト記述を生成するための、モデルに依存しない新しいアプローチを提案します。
私たちの手法は、GPT-3.5 や Llama2 などの大規模な言語モデルによって学習された広範な知識を活用して、BLIP-2、Whisper、ImageBind から得られる視覚および聴覚モダリティのテキスト記述を推論します。
ビデオテキストモデルやデータセットを追加で微調整する必要がなく、利用可能な LLM がこれらのマルチモーダルテキスト記述を「視覚」または「聴覚」のプロキシとして使用し、ビデオのゼロショットマルチモーダル分類を実行できることを実証します。
-コンテクスト。
UCF-101 や Kinetics などの一般的なアクション認識ベンチマークに関する評価では、これらのコンテキストに富んだ記述がビデオ理解タスクにうまく使用できることが示されています。
この方法は、マルチモーダル分類における有望な新しい研究の方向性を示しており、テキスト、視覚、聴覚の機械学習モデル間の相互作用によって、より全体的なビデオの理解がどのように可能になるかを示しています。

要約(オリジナル)

Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for “sight” or “hearing” and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.

arxiv情報

著者	Laura Hanu,Anita L. Verő,James Thewlis
発行日	2023-09-19 17:32:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language as the Medium: Multimodal Video Classification through text only

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー