MEIA: Towards Realistic Multimodal Interaction and Manipulation for Embodied Robots

要約

大規模な言語モデルの開発の急増に伴い、身体化された知能への注目が高まっています。
それにもかかわらず、身体化された知能に関するこれまでの研究は、通常、視覚的または言語的単峰性の方法で場面または歴史的記憶をエンコードしており、これにより、モデルの行動計画と身体化された制御との調整が複雑になる。
この制限を克服するために、自然言語で表現された高レベルのタスクを一連の実行可能なアクションに変換できる Multimodal Embodied Interactive Agent (MEIA) を導入します。
具体的には、シーンの視覚言語メモリを通じて、身体化された制御と大規模モデルの統合を促進する、新しいマルチモーダル環境メモリ (MEM) モジュールを提案します。
この機能により、MEIA はさまざまな要件とロボットの機能に基づいて実行可能なアクションプランを生成できます。
さらに、大規模言語モデルの助けを借りて、動的な仮想カフェ環境に基づいて具現化された質問応答データセットを構築します。
この仮想環境では、ゼロショット学習を通じて複数の大規模モデルを利用していくつかの実験を行い、さまざまな状況に応じたシナリオを慎重に設計します。
実験結果は、さまざまな具体化されたインタラクティブなタスクにおける MEIA の有望なパフォーマンスを示しています。

要約(オリジナル)

With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model’s action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot’s capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.

arxiv情報

著者	Yang Liu,Xinshuai Song,Kaixuan Jiang,Weixing Chen,Jingzhou Luo,Guanbin Li,Liang Lin
発行日	2024-04-26 13:13:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MEIA: Towards Realistic Multimodal Interaction and Manipulation for Embodied Robots

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー