Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

要約

ポーズ推定、オブジェクト検出、深度推定、画像生成、領域キャプションや参照表現理解などの視覚と言語のタスクなど、従来のコンピュータビジョンタスクにまたがる多種多様なAIタスクを実行するモデルであるUnified-IOを提案します。
質問応答や言い換えなどの自然言語処理タスクに。
このように多種多様なタスク用に単一の統合モデルを開発すると、RGB画像、ピクセルごとのマップ、バイナリマスク、バウンディングボックス、言語など、各タスクに関連する入力と出力が不均一になるため、固有の課題が発生します。
サポートされているすべての入力と出力を一連の個別の語彙トークンに均質化することで、この統合を実現します。
すべてのタスクに共通するこの表現により、ビジョンと言語の分野で80を超える多様なデータセットを共同で使用して、単一のトランスフォーマーベースのアーキテクチャをトレーニングできます。
Unified-IOは、GRITベンチマークで7つのタスクすべてを実行できる最初のモデルであり、NYUv2-Depth、ImageNet、VQA2.0、OK-VQA、Swig、VizWizGround、BoolQ、SciTailなどの16の多様なベンチマークで強力な結果を生成します。
タスクやベンチマーク固有の微調整はありません。
Unified-IOのデモは、https：//unified-io.allenai.orgで入手できます。

要約(オリジナル)

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression comprehension, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 80 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task or benchmark specific fine-tuning. Demos for Unified-IO are available at https://unified-io.allenai.org.

arxiv情報

著者	Jiasen Lu,Christopher Clark,Rowan Zellers,Roozbeh Mottaghi,Aniruddha Kembhavi
発行日	2022-06-17 17:53:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー