AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

要約

ビジョン言語モデル（VLM）は、自律運転の約束を示していますが、幻覚との闘い、非効率的な推論、および限られた実世界の検証は、正確な知覚と堅牢な段階的な推論を妨げます。
これを克服するために、自律運転タスクの動的なエージェントスタイルのツールの呼び出しと初めて統合（COT）推論を初めて統合する先駆的な統一フレームワークであるAgentHinkを紹介します。
AgentThinkのコアイノベーションには、次のものが含まれます。（i）構造化されたデータ生成が含まれます。自動運転ツールライブラリを確立して、多様な運転シナリオのためのツール使用を明示的に組み込んだ構造化された自己検証の推論データを自動的に構築すること。
（ii）自律的なツールの呼び出しの機能をVLMに装備するために、グループ相対ポリシー最適化（GRPO）を備えた監視付き微調整（SFT）を使用した2段階のトレーニングパイプライン。
（iii）エージェントスタイルのツール使用評価。モデルのツールの呼び出しと利用を厳密に評価するための新しいマルチツール評価プロトコルを導入します。
DrivelMM-O1ベンチマークでの実験により、AgentHinkが全体的な推論スコアを53.91％増加させ、回答の精度を33.54％増加させ、推論の質と一貫性を著しく改善します。
さらに、さまざまなベンチマークにわたるアブレーション研究と堅牢なゼロショット/少数のショット一般化実験は、その強力な機能を強調しています。
これらの調査結果は、信頼できるツールを意識する自律運転モデルを開発するための有望な軌跡を強調しています。

要約(オリジナル)

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: (i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and (iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

arxiv情報

著者	Kangan Qian,Sicong Jiang,Yang Zhong,Ziang Luo,Zilin Huang,Tianze Zhu,Kun Jiang,Mengmeng Yang,Zheng Fu,Jinyu Miao,Yining Shi,He Zhe Lim,Li Liu,Tianbao Zhou,Huang Yu,Yifei Hu,Guang Li,Guang Chen,Hao Ye,Lijun Sun,Diange Yang
発行日	2025-06-12 06:27:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー