Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

要約

大規模言語モデル(LLM)の進歩は、外部ツールを呼び出すためのコントローラとして使用されるマルチモーダルエージェントの開発を促し、実用的なタスクを解決するための実現可能な方法を提供する。本論文では、マルチモーダルエージェントチューニング手法を提案する。この手法は、マルチモーダルなツール使用データを自動的に生成し、強力なツール使用推論のためのコントローラとしてビジョン言語モデル（VLM）をチューニングする。データ品質を維持するために、GPT-4oミニモデルによるクエリ、ファイル、軌跡の生成を促し、クエリファイルと軌跡の検証を行う。データ合成パイプラインに基づき、我々は、ツール使用の軌跡を持つ20Kタスクを含むMM-Trajデータセットを収集する。そして、MM-Trajを用いた工具使用量のVLM上での軌跡生成により、T3-Agentを開発する。GTAとGAIAベンチマークでの評価は、T3-Agentが2つの一般的なVLMで一貫して改善を達成することを示している：MiniCPM-V-8.5Bと｛Qwen2-VL-7B｝は、訓練されていないVLMを$20%$上回り、提案したデータ合成パイプラインの有効性を示し、ツール使用能力のための高品質なデータを導いた。

要約(オリジナル)

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underline{T}rajectory \underline{T}uning on VLMs for \underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20\%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.

arxiv情報

著者	Zhi Gao,Bofei Zhang,Pengxiang Li,Xiaojian Ma,Tao Yuan,Yue Fan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li
発行日	2025-02-03 12:56:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー