Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

要約

特殊な視覚指示に従うデータに基づいて微調整された大規模なビジョン言語モデル (VLM) は、さまざまなシナリオにわたって優れた言語推論能力を示しています。
ただし、この微調整パラダイムでは、対話型環境からの複数ステップの目標指向タスクで最適な意思決定エージェントを効率的に学習できない可能性があります。
この課題に対処するために、強化学習 (RL) を使用して VLM を微調整するアルゴリズムフレームワークを提案します。
具体的には、私たちのフレームワークはタスクの説明を提供し、VLM に思考連鎖 (CoT) 推論を生成するよう促し、VLM が最終的なテキストベースのアクションにつながる中間推論ステップを効率的に探索できるようにします。
次に、自由形式のテキスト出力が実行可能なアクションに解析され、環境と対話して目標指向のタスク報酬を取得します。
最後に、私たちのフレームワークはこれらのタスク報酬を使用して、RL で VLM 全体を微調整します。
私たちは、提案したフレームワークがさまざまなタスクにわたって VLM エージェントの意思決定能力を強化し、7b モデルが GPT4-V や Gemini などの商用モデルを上回るパフォーマンスを発揮できることを経験的に示しています。
さらに、CoT 推論を削除するとメソッド全体のパフォーマンスが大幅に低下するため、CoT 推論がパフォーマンス向上にとって重要なコンポーネントであることがわかりました。

要約(オリジナル)

Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.

arxiv情報

著者	Yuexiang Zhai,Hao Bai,Zipeng Lin,Jiayi Pan,Shengbang Tong,Yifei Zhou,Alane Suhr,Saining Xie,Yann LeCun,Yi Ma,Sergey Levine
発行日	2024-05-16 17:50:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー