V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

要約

ビジョン言語モデル (VLM) は、さまざまなマルチモーダルタスクの処理において有望な機能を示していますが、長いコンテキストのシナリオ、特にビデオ、高解像度の画像、または長いイメージテキストドキュメントを含むタスクでは困難を伴います。
私たちの研究では、まず、拡張されたロングコンテキストのマルチモーダルデータセットを使用して、VLM のロングコンテキスト機能の実証分析を実行します。
私たちの調査結果では、テキストトークンに使用される位置エンコードメカニズムをビジュアルトークンに直接適用するのは最適ではなく、位置エンコードがモデルのコンテキストウィンドウを超えると VLM のパフォーマンスが急激に低下することが明らかになりました。
これに対処するために、私たちは可変ビジュアル位置エンコーディング (V2PE) を提案します。これは、ビジュアルトークンの可変でより小さな増分を採用し、長いマルチモーダルシーケンスのより効率的な管理を可能にする、新しい位置エンコーディングアプローチです。
私たちの実験は、長期にわたるマルチモーダルなコンテキストを効果的に理解し推論する VLM の能力を強化する V2PE の有効性を実証しています。
さらに、V2PE を拡張されたロングコンテキストマルチモーダルデータセットと統合して、オープンソース VLM、InternVL2 を微調整します。
微調整されたモデルは、標準タスクとロングコンテキストのマルチモーダルタスクの両方で強力なパフォーマンスを実現します。
特に、トレーニングデータセットのシーケンス長が 256K トークンに増加すると、モデルは最大 100 万トークンまでのマルチモーダルシーケンスを処理でき、現実世界のロングコンテキストアプリケーションに対する可能性が強調されます。

要約(オリジナル)

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model’s context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs’ ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.

arxiv情報

著者	Junqi Ge,Ziyi Chen,Jintao Lin,Jinguo Zhu,Xihui Liu,Jifeng Dai,Xizhou Zhu
発行日	2024-12-12 18:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー