Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions

要約

Vision-Language-active（VLA）モデルは、最近、ロボット工学の分野で非常に顕著になりました。
大規模なインターネットデータでトレーニングされたビジョン言語基礎モデルを活用すると、VLAモデルは、単一のエンドツーエンドニューラルネットワークを介して、視覚観測と人間の指示から直接ロボットアクションを生成できます。
それらの有効性にもかかわらず、現在のVLAモデルは通常、人間のプロンプト、言語指示の1つの形式のみを受け入れます。これにより、オープンエンドの人間とロボットの相互作用における適用性が制約されます。
たとえば、ユーザーは、ロボットが画像に表示されているオブジェクトを取得したり、ホワイトボードに書かれた命令に従い、言語ベースの説明に依存するのではなく、ビデオで示されている動作を模倣することを期待する場合があります。
このギャップに対処するために、OE-VLAを紹介します。OE-VLAは、オープンエンドマルチモーダル命令のVLAモデルの可能性を調査します。
広範な結果は、OE-VLAが言語入力を備えた従来のVLAモデルに匹敵するパフォーマンスを達成するだけでなく、4つの追加カテゴリのオープンエンドタスクにわたって印象的な結果をもたらすことを示しています。
提案された方法論は、さまざまな日常のシナリオにわたってVLAモデルのアプリケーションを大幅に拡張し、人間とロボットの相互作用を促進する可能性があります。

要約(オリジナル)

Vision-Language-Action (VLA) models have recently become highly prominent in the field of robotics. Leveraging vision-language foundation models trained on large-scale internet data, the VLA model can generate robotic actions directly from visual observations and human instructions through a single end-to-end neural network. Despite their effectiveness, current VLA models usually accept only one form of human prompting, language instructions, which may constrain their applicability in open-ended human-robot interactions. For example, a user might expect the robot to retrieve an object shown in an image, follow an instruction written on the whiteboard, or imitate a behavior demonstrated in a video, rather than relying solely on language-based descriptions. To address this gap, we introduce OE-VLA, which explores the potential of VLA models for open-ended multimodal instructions. Extensive results demonstrate that our OE-VLA not only achieves comparable performance to traditional VLA models with linguistic input but also delivers impressive results across four additional categories of open-ended tasks. The proposed methodology could significantly expand the applications of VLA models across various everyday scenarios and facilitate human-robot interaction.

arxiv情報

著者	Wei Zhao,Gongsheng Li,Zhefei Gong,Pengxiang Ding,Han Zhao,Donglin Wang
発行日	2025-05-16 13:12:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー