Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

要約

本研究では、エージェントの具現化された意思決定プロセスの改善におけるマルチモーダル大規模言語モデル（MLLM）の可能性を探る。ラージ・ランゲージ・モデル（LLM）は、その高度な推論能力と膨大な世界知識により広く利用されてきたが、GPT4-VisionのようなMLLMは、視覚的理解と推論能力を強化する。我々は、最先端のMLLMが具現化された意思決定をエンド・ツー・エンドで扱えるのか、また、LLMとMLLMのコラボレーションが意思決定を強化できるのかを調査する。これらの疑問を解決するために、我々はPCA-EVALと呼ばれる新しいベンチマークを導入し、知覚（Perception）、認知（Cognition）、行動（Action）の観点から身体化された意思決定を評価する。さらに、LLMがMLLMとAPIを活用して、情報に基づいた意思決定のためのマルチモーダル情報を収集することを可能にするマルチエージェント協調フレームワークであるHOLMESを提案する。我々のベンチマークでエンド・ツー・エンドの具現化された意思決定とHOLMESを比較した結果、GPT4-Visionモデルがエンド・ツー・エンドで強力な具現化された意思決定能力を示し、平均意思決定精度（+3%）の点でGPT4-HOLMESを上回ることがわかった。しかし、この性能は最新のGPT4-Visionモデルだけのものであり、オープンソースの最先端MLLMを26%上回っている。我々の結果は、GPT4-Visionのような強力なMLLMが具現化エージェントの意思決定に有望であり、MLLM研究に新たな道を提供することを示している。

要約(オリジナル)

In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research.

arxiv情報

著者	Liang Chen,Yichi Zhang,Shuhuai Ren,Haozhe Zhao,Zefan Cai,Yuchi Wang,Tianyu Liu,Baobao Chang
発行日	2023-10-03 14:13:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー