VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

要約

視覚的な認識、理解、生成を単一のフレームワーク内で統合する、エンドツーエンドのジェネラリストマルチモーダル大規模モデル (MLLM) である VisionLLM v2 を紹介します。
テキスト出力に限定された従来の MLLM とは異なり、VisionLLM v2 はアプリケーション範囲を大幅に広げます。
従来の視覚的質問応答 (VQA) だけでなく、オブジェクトの位置特定、姿勢推定、画像の生成と編集など、オープンエンドのクロスドメインビジョンタスクにも優れています。
この目的のために、我々は、MLLM とタスク固有のデコーダを接続する媒体として、「スーパーリンク」と呼ばれる新しい情報伝送メカニズムを提案します。
これにより、MLLM と複数の下流デコーダーの間でタスク情報と勾配フィードバックを柔軟に送信できるだけでなく、マルチタスクシナリオにおけるトレーニングの競合も効果的に解決できます。
さらに、多様なタスクをサポートするために、何百もの公共視覚および視覚言語タスクからトレーニングデータを注意深く収集し、精査しました。
このようにして、私たちのモデルは、何百ものビジョン言語タスクでエンドツーエンドで共同トレーニングされ、さまざまなユーザープロンプトを通じて一連の共有パラメーターを使用してこれらのタスクに一般化され、タスク固有のモデルに匹敵するパフォーマンスを達成できます。
私たちは、VisionLLM v2 が MLLM の一般化に関して新しい視点を提供すると信じています。

要約(オリジナル)

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed ‘super link’, as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.

arxiv情報

著者	Jiannan Wu,Muyan Zhong,Sen Xing,Zeqiang Lai,Zhaoyang Liu,Wenhai Wang,Zhe Chen,Xizhou Zhu,Lewei Lu,Tong Lu,Ping Luo,Yu Qiao,Jifeng Dai
発行日	2024-06-12 16:44:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー