Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

要約

この研究では、エージェントの具体的な意思決定プロセスの改善におけるマルチモーダル大規模言語モデル (MLLM) の可能性を探ります。
大規模言語モデル (LLM) は高度な推論スキルと膨大な世界知識により広く使用されていますが、GPT4-Vision のような MLLM は強化された視覚的理解と推論機能を提供します。
私たちは、最先端の MLLM が具体化された意思決定をエンドツーエンドで処理できるかどうか、また、LLM と MLLM 間のコラボレーションによって意思決定を強化できるかどうかを調査します。
これらの質問に対処するために、私たちは PCA-EVAL と呼ばれる新しいベンチマークを導入します。これは、知覚、認知、および行動の観点から身体化された意思決定を評価します。
さらに、LLM が MLLM と API を活用して情報に基づいた意思決定のためのマルチモーダルな情報を収集できるようにするマルチエージェント協力フレームワークである HOLMES を提案します。
ベンチマークでエンドツーエンドの具体化された意思決定と HOLMES を比較したところ、GPT4-Vision モデルが強力なエンドツーエンドの具体化された意思決定能力を示し、平均意思決定精度 (+3) の点で GPT4-HOLMES を上回っていることがわかりました。
%)。
ただし、このパフォーマンスは最新の GPT4-Vision モデルに限定されたもので、オープンソースの最先端の MLLM を 26% 上回っています。
私たちの結果は、GPT4-Vision のような強力な MLLM が、身体化されたエージェントの意思決定に有望であり、MLLM 研究に新たな道を提供することを示しています。
コードとデータは https://github.com/pkunlp-icler/PCA-EVAL/ で公開されています。

要約(オリジナル)

In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. Code and data are open at https://github.com/pkunlp-icler/PCA-EVAL/.

arxiv情報

著者	Liang Chen,Yichi Zhang,Shuhuai Ren,Haozhe Zhao,Zefan Cai,Yuchi Wang,Peiyi Wang,Tianyu Liu,Baobao Chang
発行日	2023-11-28 11:23:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー