A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

要約

最近、多くのタスクを実行し、画像エンコーダー (通常は ViT) と自己回帰デコーダー (通常はトランスフォーマー) で構成されるコンピュータービジョンモデルが急増しています。
ただし、この作業のほとんどは単に 1 つのシステムとその結果を提示するだけであり、そのようなシステムの設計上の決定とトレードオフに関する多くの疑問が解決されていません。
この作品では、そのような答えを提供することを目指しています。
分類、キャプション、視覚的質問応答、光学式文字認識など、マルチモーダルコンピュータービジョンにおけるマルチタスク学習のための自己回帰デコーダーを詳しく調べます。
広範な体系的な実験を通じて、タスクとデータの混合、トレーニングと正則化のハイパーパラメーター、条件付けの種類と特異性、モダリティの組み合わせなどの影響を研究しています。
重要なのは、これらを適切に調整された単一タスクのベースラインと比較して、マルチタスクによって発生するコストを強調することです。
重要な発見は、凍結された事前トレーニング済みエンコーダーの上で学習された小さなデコーダーが驚くほどうまく機能することです。
このセットアップを、デコーダーを使用したロックされたイメージのチューニング (LiT デコーダー) と呼びます。
これは、自然言語を介して事前トレーニング済みの視覚モデルとやり取りするようにデコーダーに教えるものと見なすことができます。

要約(オリジナル)

There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answers. We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision, including classification, captioning, visual question answering, and optical character recognition. Through extensive systematic experiments, we study the effects of task and data mixture, training and regularization hyperparameters, conditioning type and specificity, modality combination, and more. Importantly, we compare these to well-tuned single-task baselines to highlight the cost incurred by multi-tasking. A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well. We call this setup locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.

arxiv情報

著者	Lucas Beyer,Bo Wan,Gagan Madan,Filip Pavetic,Andreas Steiner,Alexander Kolesnikov,André Susano Pinto,Emanuele Bugliarello,Xiao Wang,Qihang Yu,Liang-Chieh Chen,Xiaohua Zhai
発行日	2023-03-30 13:42:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー