Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

要約

マルチモーダル大規模言語モデル (MLLM) の一般的な視覚的命令追従能力は最近進歩しているにもかかわらず、視覚的命令に対して正確かつ詳細な応答を提供する必要がある場合、依然として重大な問題に悩まされています。 (1) 新しいオブジェクトまたはエンティティを識別できない
、(2) 存在しないオブジェクトについての言及、(3) オブジェクトの属性詳細の無視。
直感的なソリューションには、データのサイズと品質の向上、またはより大きな基盤モデルの使用が含まれます。
これらの問題の軽減には有効ですが、膨大な量の新しいデータを収集し、大幅に大規模なモデルを導入するという高価なコストがかかります。
これらのアプローチの交差点に立って、マルチモーダルコネクタによる画像からテキストへのマッピングプロセスの観点からオブジェクト指向の 3 つの問題を検討します。
このペーパーでは、まず、不十分なトレーニングデータに起因するマルチモーダルコネクタの限界を特定します。
これを推進して、オブジェクト名や属性などの豊富なオブジェクト対応情報を含む、検索拡張タグトークンを使用してマッピングを強化することを提案します。
検索拡張 (TUNA) を使用したタグベースの視覚的命令チューニングにより、12 のベンチマークで同じ言語モデルとトレーニングデータを共有するベースラインを上回るパフォーマンスを達成しました。
さらに、特定のデータストアが提供された場合の TUNA のゼロショット機能を示します。

要約(オリジナル)

Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object’s attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.

arxiv情報

著者	Daiqing Qi,Handong Zhao,Zijun Wei,Sheng Li
発行日	2024-11-12 05:33:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー