Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

要約

複雑な視覚的推論は、今日でも重要な課題です。
通常、この課題には、思考連鎖 (COT) や視覚的指示の調整などの方法論を使用して取り組みます。
しかし、より大きな成功をもたらすためにこれら 2 つの方法論を有機的に組み合わせる方法はまだ解明されていません。
また、幻覚や高額な訓練コストなどの問題にも依然として対処する必要があります。
この研究では、軽量のマルチモーダル大規模言語モデル (MLLM) に適した革新的なマルチラウンドのトレーニングと推論フレームワークを考案します。
私たちの自問アプローチは、MLLM がターゲットの問題に関連する視覚的な手がかりに焦点を当てるようにヒューリスティックに導き、幻覚を減らし、画像の詳細を詳細に記述するモデルの能力を強化します。
これにより、最終的にモデルは複雑な視覚的推論や質問応答タスクで適切に実行できるようになります。
私たちはこのフレームワークを Socratic Questioning (SQ) と名付けました。
将来の研究を促進するために、視覚指導の調整と評価のために、CapQA という名前のマルチモーダルミニデータセットを作成しました。これには、視覚的指導の調整と評価のために、きめの細かい活動の 1,000 枚の画像が含まれています。提案した SQ 手法は、幻覚スコアの 31.2% の改善につながりました。
さまざまなベンチマークに関する広範な実験により、ヒューリスティックな自問自答、ゼロショット視覚的推論、幻覚軽減における SQ の優れた機能が実証されています。
私たちのモデルとコードは公開される予定です。

要約(オリジナル)

Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model’s ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ’s remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.

arxiv情報

著者	Wanpeng Hu,Haodi Liu,Lin Chen,Feng Zhou,Changming Xiao,Qi Yang,Changshui Zhang
発行日	2025-01-07 02:55:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー