One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

要約

ビジョンおよび言語 (V+L) タスクの解決を目的とした現在のマルチモーダルモデルは、主にビジョンエンコーダー (VE) を特徴抽出器として転用しています。
さまざまなデータと目的でトレーニングされたさまざまなアーキテクチャの多くの VE が公開されていますが、それらは下流の V+L タスク用に設計されていません。
それにもかかわらず、現在のほとんどの作業では、\textit{single} の事前トレーニング済み VE が汎用エンコーダーとして機能できると想定しています。
この作業では、異なる VE 内に保存されている情報が補完的であるかどうか、つまり、モデルに複数の VE からの機能を提供することで、ターゲットタスクのパフォーマンスを向上できるかどうかを評価します。
6 つのダウンストリーム V+L タスクで 3 つの一般的な VE を徹底的に実験し、注意と VE ドロップアウトのパターンを分析します。
私たちの結果と分析は、多様な VE が互いに補完し合い、結果として下流の V+L タスクのパフォーマンスが向上することを示唆しています。この改善は、単純なアンサンブル効果によるものではありません (つまり、エンコーダーの数を増やしてもパフォーマンスが常に向上するとは限りません)。
\textit{repurposed} ではなく、V+L タスク用に明示的に \textit{designed} されている将来の VE は、ターゲットの V+L タスクのパフォーマンスを向上させる可能性があることを示しています。

要約(オリジナル)

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs — of different architectures, trained on different data and objectives — are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a \textit{single} pre-trained VE can serve as a general-purpose encoder. In this work, we evaluate whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our results and analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not \textit{repurposed}, but explicitly \textit{designed} for V+L tasks, have the potential of improving performance on the target V+L tasks.

arxiv情報

著者	Gregor Geigle,Chen Liu,Jonas Pfeiffer,Iryna Gurevych
発行日	2022-10-12 16:31:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー