OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

要約

現在のユニバーサルセグメンテーション手法は、ピクセルレベルの画像とビデオの理解において強力な機能を実証しています。
ただし、彼らには推論能力が欠けており、テキストの指示で制御することはできません。
対照的に、大規模なビジョン言語マルチモーダルモデルは、強力なビジョンベースの会話および推論機能を示しますが、ピクセルレベルの理解が不足しており、柔軟なユーザーインタラクションのための視覚的なプロンプトを受け入れることが困難です。
この論文では、強力なピクセルレベルの視覚理解と推論能力を組み合わせた新しくエレガントなフレームワークである OMG-LLaVA を提案します。
さまざまなビジュアルおよびテキストプロンプトを受け入れて、柔軟なユーザーインタラクションを実現できます。
具体的には、ユニバーサルセグメンテーション手法をビジュアルエンコーダとして使用し、画像情報、知覚事前分布、視覚的プロンプトを、LLM に提供されるビジュアルトークンに統合します。
LLM は、ユーザーのテキスト指示を理解し、視覚情報に基づいてテキスト応答とピクセルレベルのセグメンテーション結果を提供する責任があります。
私たちは、知覚事前埋め込みを提案して、知覚事前埋め込みを画像特徴とより適切に統合します。
OMG-LLaVA は、単一のモデルで画像レベル、オブジェクトレベル、ピクセルレベルの推論と理解を実現し、複数のベンチマークで特殊なメソッドのパフォーマンスと同等またはそれを上回ります。
LLM を使用して各専門家を接続するのではなく、私たちの仕事は 1 つのエンコーダー、1 つのデコーダー、および 1 つの LLM でのエンドツーエンドのトレーニングを目的としています。
コードとモデルはさらなる研究のために公開されました。

要約(オリジナル)

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user’s text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

arxiv情報

著者	Tao Zhang,Xiangtai Li,Hao Fei,Haobo Yuan,Shengqiong Wu,Shunping Ji,Chen Change Loy,Shuicheng Yan
発行日	2024-06-27 17:59:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー