X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

要約

ビジョン言語の事前トレーニングと命令チューニングは、ビジュアルエンコーダーを最先端の大規模言語モデル (LLM) と連携させることにより、2D ビジュアル推論タスクにおける汎用機能を実証しました。
このペーパーでは、モダリティ固有の広範なカスタマイズを行わずに、さまざまなモダリティの統合を可能にする、凍結された LLM 上に構築された、シンプルでありながら効果的なクロスモダリティフレームワークを紹介します。
命令モダリティの微調整を容易にするために、オーディオ用の 24,000 QA サンプルと 3D 用の 250,000 QA サンプルで構成される高品質の命令チューニングデータを自動的かつスケーラブルな方法で収集します。
命令を認識した表現を活用することで、私たちのモデルは、広範なモダリティ固有の事前トレーニングやカスタマイズを必要とせずに、最先端のモデルと同等のパフォーマンスを発揮します。
さらに、私たちのアプローチは、各モダリティ投影が個別にトレーニングされているにもかかわらず、2 つ以上の入力モダリティにわたるクロスモーダル推論能力を実証します。
モデルのクロスモーダル能力を研究するために、モデルが異種の入力モダリティ間で識別的に推論することを必要とする、9K オーディオビデオ QA サンプルと 28K 画像 3D QA サンプルで構成される新しい判別クロスモーダル推論 (DisCRn) 評価タスクを提供します。

要約(オリジナル)

Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization. To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually. To study the model’s cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.

arxiv情報

著者	Artemis Panagopoulou,Le Xue,Ning Yu,Junnan Li,Dongxu Li,Shafiq Joty,Ran Xu,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles
発行日	2023-11-30 18:43:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー