LinVT: Empower Your Image-level Large Language Model to Understand Videos

要約

大規模言語モデル (LLM) はさまざまなタスクで広く使用されており、ビデオ用の LLM ベースのアシスタントを開発する動機になっています。
スクラッチからトレーニングする代わりに、適切にトレーニングされた任意の画像ベースの LLM を (ビデオデータでトレーニングされた後) ビデオ LLM に変換するモジュールを提案します。
画像 LLM をビデオ処理にさらに適応させるために、元の視覚言語の配置を維持するための線形変換と、冗長なビデオコンテンツからの代表的な情報の凝縮という 2 つの設計原則を導入します。
これらの原則に基づいて、既存の画像 LLM がビデオを理解できるようにするプラグアンドプレイリニアビデオトークナイザー (LinVT) を提案します。
最近の 6 つのビジュアル LLM (Aquila、Blip-3、InternVL2、Mipha、Molmo、Qwen2-VL) を使用して LinVT のベンチマークを行い、LinVT の高い互換性を示します。
LinVT ベースの LLM は、さまざまなビデオベンチマークにわたって最先端のパフォーマンスを達成し、マルチモーダルビデオの理解における LinVT の有効性を示しています。

要約(オリジナル)

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

arxiv情報

著者	Lishuai Gao,Yujie Zhong,Yingsen Zeng,Haoxian Tan,Dengjie Li,Zheng Zhao
発行日	2024-12-06 17:04:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LinVT: Empower Your Image-level Large Language Model to Understand Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー