Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまなマルチモーダルタスクにおいて優れた機能を示しています。
それにもかかわらず、きめの細かい画像理解タスクにおけるパフォーマンスは依然として限られています。
この問題に対処するために、本論文では、MLLM のきめ細かい画像理解能力を強化するための新しいフレームワークを提案します。
具体的には、既存のデータセットのアノテーションを活用して、低コストで命令調整データセットを構築する新しい方法を紹介します。
既存の密オブジェクト注釈を高品質の参照式境界ボックスのペアに拡張するために、自己一貫性のあるブートストラップ手法も導入されています。
これらの手法により、きめ細かい画像認識に必要な基礎能力を幅広く網羅した高品質な指導データの生成が可能となります。
さらに、完全な画像認識と詳細な画像認識の間のギャップを軽減するために、命令調整中にビジュアルエンコーダを調整する必要があると主張します。
実験結果は、私たちの方法の優れたパフォーマンスを示しています。
たとえば、私たちのモデルは、GQA で Qwen-VL と比較して 5.2% の精度向上を示し、RefCOCO_val では Kosmos-2 の精度を 24.7% 上回っています。
MMBenchのリーダーボードでもトップランクを獲得しています。
この有望なパフォーマンスは、公開されているデータのみを使用してトレーニングすることによって達成され、簡単に再現可能になります。
モデル、データセット、コードは https://github.com/SY-Xuan/Pink で公開されています。

要約(オリジナル)

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance in fine-grained image understanding tasks is still limited. To address this issue, this paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. Specifically, we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. A self-consistent bootstrapping method is also introduced to extend existing dense object annotations into high-quality referring-expression-bounding-box pairs. These methods enable the generation of high-quality instruction data which includes a wide range of fundamental abilities essential for fine-grained image perception. Moreover, we argue that the visual encoder should be tuned during instruction tuning to mitigate the gap between full image perception and fine-grained image perception. Experimental results demonstrate the superior performance of our method. For instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We also attain the top rank on the leaderboard of MMBench. This promising performance is achieved by training on only publicly available data, making it easily reproducible. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink.

arxiv情報

著者	Shiyu Xuan,Qingpei Guo,Ming Yang,Shiliang Zhang
発行日	2023-11-21 10:32:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー