MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

要約

ビデオと言語の理解には、ビデオによる質問への回答、テキストとビデオの検索、マルチラベル分類など、業界でさまざまな用途があります。
既存のビデオと言語の理解方法は、一般に重いマルチモーダルエンコーダーとフュージョンモジュールを採用しているため、大量の GPU メモリを消費します。
特に、産業用アプリケーションで一般的な高密度のビデオフレームや長いテキストを処理するのは困難です。
この論文では、機能サンプリングとアテンションモジュールを通じて効率的かつ効果的な機能融合を実現する、高精度でメモリ効率の高いビデオと言語の理解モデルである MuLTI を提案します。
したがって、MuLTI は限られた GPU メモリでより長いシーケンスを処理できます。
次に、アテンションベースのアダプターをエンコーダーに導入します。これにより、浅い機能が微調整され、少ない GPU メモリ消費でモデルのパフォーマンスが向上します。
最後に、モデルのパフォーマンスをさらに向上させるために、Multiple Choice Modeling という名前の新しい事前トレーニングタスクを導入して、事前トレーニングと下流のタスクの間のタスクギャップを埋め、ビデオとテキストを整列させるモデルの機能を強化します。
MuLTI は、効率的な機能融合モジュール、アテンションベースのアダプター、および新しい事前トレーニングタスクの恩恵を受けて、複数のデータセットで最先端のパフォーマンスを実現します。
実装モデルと事前トレーニング済みモデルがリリースされます。

要約(オリジナル)

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume large amounts of GPU memory. Especially, they have difficulty dealing with dense video frames or long text that are prevalent in industrial applications. In this paper, we propose MuLTI, a highly accurate and memory-efficient video-and-language understanding model that achieves efficient and effective feature fusion through feature sampling and attention modules. Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we introduce an attention-based adapter to the encoders, which finetunes the shallow features to improve the model’s performance with low GPU memory consumption. Finally, to further improve the model’s performance, we introduce a new pretraining task named Multiple Choice Modeling to bridge the task gap between pretraining and downstream tasks and enhance the model’s ability to align the video and the text. Benefiting from the efficient feature fusion module, the attention-based adapter and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

arxiv情報

著者	Jiaqi Xu,Bo Liu,Yunkuo Chen,Mengli Cheng,Xing Shi
発行日	2023-03-10 05:22:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー