IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

要約

ビジョンと言語（VL）の理解の分野は、エンドツーエンドの大規模な事前訓練VLモデル（VLM）で前例のない進歩を遂げました。
ただし、マルチステップの推論を必要とするゼロショット推論タスクではまだ不足しています。
この目標を達成するために、以前の作品は分割統治パイプラインに頼ります。
この論文では、以前の取り組みにはいくつかの固有の欠点があると主張します。1）それらはドメイン固有のサブ質問モデルに依存しています。
2）サブ質問やサブアンドワーが十分な情報を提供している場合でも、モデルに最終的な答えを予測するように強制します。
これらの制限には、大規模な言語モデル（LLM）を使用してVL推論を繰り返し分解するフレームワークであるIdealGPTを介して対処します。
具体的には、IdealGPTはLLMを利用してサブ質問を生成し、VLMを生成して対応するサブアンドワーを提供し、別のLLMを使用して最終的な回答を達成します。
これらの3つのモジュールは、モデルが主な質問に対する最終回答について自信を持つまで、格差と征服の手順を繰り返し実行します。
ゼロショット設定の下で、複数の挑戦的なVL推論タスクで理想的なGPTを評価します。
特に、当社の理想的なGPTは、VCRで絶対10％、SNLI-VEで15％で最高の既存のGPT-4様モデルよりも優れています。
コードはhttps://github.com/hxyou/idealgptで入手できます

要約(オリジナル)

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

arxiv情報

著者	Haoxuan You,Zhecan Wang,Rui Sun,Long Chen,Gengyu Wang,Hammad A. Ayyubi,Kai-Wei Chang,Shih-Fu Chang
発行日	2025-04-11 07:26:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー