OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

要約

このホワイトペーパーでは、1 つのユニバーサルアーキテクチャを使用してイメージ言語とビデオ言語の両方のタスクをサポートする新しい基盤モデルである OmniVL を紹介します。
画像とビデオの両方の入力に統一されたトランスベースのビジュアルエンコーダーを採用しているため、画像言語とビデオ言語の事前トレーニングを共同で実行できます。
このようなパラダイムは、従来の一方向転送 (たとえば、ビデオ言語を支援するために画像言語を使用する) とは対照的に、画像タスクとビデオタスクの両方にメリットがあることを初めて示します。
この目的のために、画像言語とビデオ言語の分離された共同事前トレーニングを提案して、視覚言語モデリングを空間的および時間的次元に効果的に分解し、画像とビデオの両方のタスクでパフォーマンスを向上させます。
さらに、画像テキスト、ビデオテキスト、画像ラベル（画像分類など）、ビデオラベル（ビデオアクション認識など）データを一緒に活用するために、新しい統合視覚言語コントラスト（UniVLC）損失を導入します。
教師ありおよび騒々しく教師ありの事前トレーニングデータの両方が可能な限り利用されます。
OmniVL は、追加のタスク固有のアダプターを発生させることなく、視覚のみのタスク (画像分類、ビデオアクション認識など)、クロスモーダルアラインメントタスク (画像/ビデオテキスト検索など)、マルチモーダルの理解および生成タスクを同時にサポートできます。
(例: 画像/ビデオの質問への回答、キャプション)。
幅広いダウンストリームタスクで OmniVL を評価し、同様のモデルサイズとデータスケールで最先端または競争力のある結果を達成します。

要約(オリジナル)

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

arxiv情報

著者	Junke Wang,Dongdong Chen,Zuxuan Wu,Chong Luo,Luowei Zhou,Yucheng Zhao,Yujia Xie,Ce Liu,Yu-Gang Jiang,Lu Yuan
発行日	2022-09-15 17:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー