Improve Vision Language Model Chain-of-thought Reasoning

要約

ビジョン言語モデル (VLM) における思考連鎖 (CoT) 推論は、解釈可能性と信頼性を向上させるために重要です。
ただし、現在のトレーニングレシピには堅牢な CoT 推論データが不足しており、最小限の根拠を備えた短いアノテーションが大半を占めるデータセットに依存しています。
この研究では、短い回答に関する VLM のトレーニングが、より詳細な回答を必要とする推論タスクにはうまく一般化できないことを示しています。
これに対処するために、私たちは 2 つのアプローチを提案します。
まず、GPT-4o モデルから理論的根拠を抽出してトレーニングデータを強化し、VLM を微調整して、CoT パフォーマンスを向上させます。
次に、強化学習を適用して推論の品質をさらに調整します。
具体的には、予測を注釈付きの短い回答と比較することにより、モデルが生成した推論チェーンの正 (正しい) と負 (不正確) のペアを構築します。
このペアごとのデータを使用して、直接優先最適化アルゴリズムを適用して、モデルの推論能力を磨きます。
私たちの実験では、ベンチマークデータセットでの CoT 推論の大幅な改善と、直接的な回答予測に対する一般化の向上が実証されました。
この研究では、トレーニングに詳細な理論的根拠を組み込み、強化学習を活用して VLM の推論能力を強化することの重要性を強調しています。

要約(オリジナル)

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model’s reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

arxiv情報

著者	Ruohong Zhang,Bowen Zhang,Yanghao Li,Haotian Zhang,Zhiqing Sun,Zhe Gan,Yinfei Yang,Ruoming Pang,Yiming Yang
発行日	2024-10-21 17:00:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improve Vision Language Model Chain-of-thought Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー