FlexAttention for Efficient High-Resolution Vision-Language Models

要約

現在の高解像度ビジョン言語モデルは、画像を高解像度画像トークンとしてエンコードし、アテンションを計算するためにこれらすべてのトークンを徹底的に使用するため、計算コストが大幅に増加します。
この問題に対処するために、効率的な高解像度視覚言語モデルのための柔軟なアテンションメカニズムである FlexAttend を提案します。
具体的には、高解像度画像は高解像度トークンと低解像度トークンの両方としてエンコードされ、低解像度トークンといくつかの選択された高解像度トークンのみがアテンションマップの計算に利用され、計算コストが大幅に削減されます。
。
高解像度トークンは、入力アテンションマップに基づいて関連領域のトークンを取得できる高解像度選択モジュールを介して選択されます。
選択された高解像度トークンは、低解像度トークンとテキストトークンに連結され、次のステップの高解像度トークンの選択に使用できるアテンションマップを生成する階層型セルフアテンションレイヤーに入力されます。
階層的セルフアテンションプロセスと高解像度トークン選択プロセスは、各アテンションレイヤーに対して繰り返し実行されます。
マルチモーダルベンチマークの実験では、当社の FlexAttend が既存の高解像度 VLM よりも優れたパフォーマンスを示し (たとえば、V* Bench で相対的に約 9%、TextVQA で約 7%)、計算コストを 40% 近く大幅に削減することが証明されました。

要約(オリジナル)

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.

arxiv情報

著者	Junyan Li,Delin Chen,Tianle Cai,Peihao Chen,Yining Hong,Zhenfang Chen,Yikang Shen,Chuang Gan
発行日	2024-07-29 17:59:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlexAttention for Efficient High-Resolution Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー