Introducing Visual Perception Token into Multimodal Large Language Model

要約

視覚情報を利用するために、マルチモーダル大手言語モデル（MLLM）は、ビジョンエンコーダーの知覚プロセスに依存しています。
視覚的知覚の完全性と正確性は、空間的推論、きめ細かな理解、およびその他のタスクの精度に大きく影響します。
ただし、MLLMには、たとえば、画像の特定の領域を選択的に確認したり、特定のオブジェクトカテゴリに関連する情報に焦点を当てたりするなど、独自の視覚的知覚プロセスを制御する自律能力がまだ欠けています。
この作業では、視覚的知覚プロセスを制御するメカニズムをMLLMに強化することを目指して、視覚的知覚トークンの概念を提案します。
2種類の視覚認識トークンを設計し、リージョン選択トークンとビジョンの再エンコードトークンと呼ばれます。
MLLMは、テキストを生成し、それらを使用して追加の視覚的知覚アクションをトリガーするように、これらのトークンを自律的に生成します。
領域の選択トークンは、さらなる知覚を必要とする画像内の特定の領域を明示的に識別しますが、ビジョンの再エンコードトークンは、その隠された状態を制御信号として使用して、追加の視覚的知覚プロセスを導きます。
広範な実験は、空間的推論の処理、微調整された理解の改善、およびその他のタスクにおけるこれらのトークンの利点を示しています。
平均して、視覚的知覚トークンの導入により、2Bモデルのパフォーマンスが23.6 \％増加し、スコアが0.572から0.708に増加し、7Bパラメーターモデルを13.4 \％（0.624から）よりも上回ります。
レポhttps://github.com/yu-rp/visualceptiontokenをご覧ください

要約(オリジナル)

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

arxiv情報

著者	Runpeng Yu,Xinyin Ma,Xinchao Wang
発行日	2025-02-24 18:56:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Introducing Visual Perception Token into Multimodal Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー