CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

要約

最近の視覚言語モデルは、私たちがこれまで予想していたものをはるかに超える大きな進歩を遂げました。
ただし、特に大規模なモデルの場合、急速な開発に伴い、計算コストも劇的に増加しています。
リソースが限られているシナリオでは、モデルの高速化が非常に重要になります。
ユニモーダルモデルについては広く研究されていますが、マルチモーダルモデル、特に視覚言語トランスフォーマーの高速化については、比較的十分に研究されていません。
より効率的でアクセスしやすい視覚言語トランスフォーマーを追求するために、この文書では \textbf{Cross}-\textbf{G} を使用した \textbf{E} の \textbf{T}okens (\textbf{\emph{CrossGET}}) のサンプルを紹介します。
、ビジョン言語トランスフォーマー用のユニバーサルアクセラレーションフレームワーク。
このフレームワークは、リアルタイムのクロスモーダルガイダンスを通じてトークンを適応的に組み合わせることで、高いパフォーマンスを維持しながら大幅な高速化を実現します。
\textit{CrossGET} には 2 つの重要な革新があります: 1) \textit{クロスガイドマッチングとアンサンブル}。
\textit{CrossGET} には、クロスモーダル情報を効果的に活用するためのクロスモーダルガイド付きトークンマッチングとアンサンブルが組み込まれており、ごくわずかな追加パラメータを持つクロスモーダルトークンのみが導入されます。
2) \textit{完全なグラフソフトマッチング}。
既存の 2 部構成のソフトマッチングアプローチとは対照的に、 \textit{CrossGET} は完全なグラフソフトマッチングポリシーを導入し、並列性と高効率を維持しながらより信頼性の高いトークンマッチング結果を実現します。
画像テキスト検索、視覚的推論、画像キャプション、視覚的質問応答など、さまざまな視覚言語タスクについて広範な実験が行われています。
従来のマルチモーダルアーキテクチャと新しいマルチモーダル LLM の両方でのパフォーマンスは、提案された \textit{CrossGET} フレームワークの有効性と多用途性を示しています。
コードは \url{https://github.com/sdc17/CrossGET} にあります。

要約(オリジナル)

Recent vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs are also dramatically growing with rapid development, especially for the large models. It makes model acceleration exceedingly critical in a scenario of limited resources. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is relatively under-explored. To pursue more efficient and accessible vision-language Transformers, this paper introduces \textbf{Cross}-\textbf{G}uided \textbf{E}nsemble of \textbf{T}okens (\textbf{\emph{CrossGET}}), a universal acceleration framework for vision-language Transformers. This framework adaptively combines tokens through real-time, cross-modal guidance, thereby achieving substantial acceleration while keeping high performance. \textit{CrossGET} has two key innovations: 1) \textit{Cross-Guided Matching and Ensemble}. \textit{CrossGET} incorporates cross-modal guided token matching and ensemble to exploit cross-modal information effectively, only introducing cross-modal tokens with negligible extra parameters. 2) \textit{Complete-Graph Soft Matching}. In contrast to the existing bipartite soft matching approach, \textit{CrossGET} introduces a complete-graph soft matching policy to achieve more reliable token-matching results while maintaining parallelizability and high efficiency. Extensive experiments are conducted on various vision-language tasks, including image-text retrieval, visual reasoning, image captioning, and visual question answering. Performance on both classic multimodal architectures and emerging multimodal LLMs demonstrate the effectiveness and versatility of the proposed \textit{CrossGET} framework. The code will be at \url{https://github.com/sdc17/CrossGET}.

arxiv情報

著者	Dachuan Shi,Chaofan Tao,Anyi Rao,Zhendong Yang,Chun Yuan,Jiaqi Wang
発行日	2023-11-24 18:39:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー