Valley: Video Assistant with Large Language model Enhanced abilitY

要約

大規模な言語モデル（LLMS）は、驚くべき会話能力を備えた、視覚的およびテキストモダリティの両方を処理できるAIアシスタントとして浮上しています。
ただし、共同ビデオと言語の理解におけるそれらの有効性は、広範囲に調査されていません。
この論文では、ビデオの理解と指導に従う機能の強化を可能にするように設計されたマルチモーダルファンデーションモデルであるValleyを紹介します。
この目的のために、2つのデータセット、すなわちValley-702KとValley-Instruct-73Kを構築して、多様なビデオテキストアライメントとマルチショットキャプション、長いビデオの説明、アクション認識、原因推論などのビデオベースの指導タスクをカバーします。
ビデオ理解の強化。
さらに、Valleyの2フェーズトレーニングアプローチを実装します。第1フェーズは、視覚入力を理解するLLMの能力を促進するための投影モジュールのトレーニングのみに焦点を当てており、第2フェーズは共同で投影モジュールとLLMをトレーニングして能力を改善します。
広範な実験は、バレーが効果的なビデオアシスタントとして機能する可能性があり、複雑なビデオ理解シナリオを簡素化する可能性があることを示しています。
私たちのコードとデータは、https://github.com/valley-vl/valleyで匿名で公開されています。

要約(オリジナル)

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM’s capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

arxiv情報

著者	Ruipu Luo,Ziwang Zhao,Min Yang,Zheming Yang,Minghui Qiu,Tao Wang,Zhongyu Wei,Yanhao Wang,Cen Chen
発行日	2025-03-17 13:51:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Valley: Video Assistant with Large Language model Enhanced abilitY

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー