InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

要約

このペーパーは、ロングアンドリッチコンテキスト (LRC) モデリングを介してビデオマルチモーダル大規模言語モデル (MLLM) のパフォーマンスを向上させることを目的としています。
その結果、私たちは、ビデオ内の細かい詳細を認識し、長い形式の時間構造をキャプチャする元の MLLM の能力を強化することに重点を置いて、InternVideo2.5 の新しいバージョンを開発しました。
具体的には、私たちのアプローチは、直接優先最適化を使用して高密度ビジョンタスクのアノテーションをMLLMに組み込み、適応型階層トークン圧縮を通じてコンパクトな時空間表現を開発します。
実験結果は、LRC のこのユニークな設計が、主流のビデオ理解ベンチマーク (短編および長編) におけるビデオ MLLM の結果を大幅に向上させ、MLLM が大幅に長いビデオ入力 (元のビデオの少なくとも 6 倍) を記憶し、特殊なビジョン機能を習得できることを示しています。
オブジェクト追跡やセグメンテーションなど。
私たちの研究は、MLLM の生来の能力 (集中力と記憶力) を強化する上でのマルチモーダルなコンテキストの豊富さ (長さと細かさ) の重要性を強調しており、ビデオ MLLM に関する将来の研究に新たな洞察を提供します。
コードとモデルは https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5 で入手できます。

要約(オリジナル)

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs’ ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM’s innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

arxiv情報

著者	Yi Wang,Xinhao Li,Ziang Yan,Yinan He,Jiashuo Yu,Xiangyu Zeng,Chenting Wang,Changlian Ma,Haian Huang,Jianfei Gao,Min Dou,Kai Chen,Wenhai Wang,Yu Qiao,Yali Wang,Limin Wang
発行日	2025-01-21 18:59:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー