Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

要約

ビジュアルコンテンツの解釈と推論が可能なインテリジェントシステムへの需要が高まっているため、正確なだけでなく明示的な推論機能も備えた大規模マルチモーダルモデル (LMM) の開発が必要です。
この論文では、視覚的なコンテンツとテキストによる指示に基づいて明示的な推論を実行する機能を LMM に組み込むための新しいアプローチを紹介します。
必要な知識を得るために質問できるシステムを導入し、推論プロセスの堅牢性と説明性を高めます。
私たちの方法は、質問メカニズムと組み合わせた思考連鎖推論を促進するように設計された、大規模言語モデル (LLM) によって生成される新しいデータセットの開発で構成されます。
私たちは、画像とテキストの位置合わせの複雑な要件に対処するために、領域認識に関する高い機能を備えた LMM を設計しました。
モデルは 3 段階のトレーニングフェーズを経ます。まず、大規模なデータセットを使用した大規模な画像とテキストの位置合わせから始まり、次に命令の調整、そして思考連鎖推論に焦点を当てた微調整が続きます。
この結果は、あいまいな視覚入力に直面したときに明確に推論し、積極的に情報を探すことができる、より堅牢で正確で解釈可能な LMM への進歩を示しています。

要約(オリジナル)

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

arxiv情報

著者	Kohei Uehara,Nabarun Goswami,Hanqin Wang,Toshiaki Baba,Kohtaro Tanaka,Tomohiro Hashimoto,Kai Wang,Rei Ito,Takagi Naoya,Ryo Umagami,Yingyi Wen,Tanachai Anakewat,Tatsuya Harada
発行日	2024-01-18 14:21:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー