VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

要約

ビジョン言語アクションモデル（VLA）は、エンドツーエンドのデザインと驚くべきパフォーマンスのために、ロボット操作でますます人気が高まっています。
ただし、既存のVLAは、テキストベースの指示のみをサポートするビジョン言語モデル（VLM）に大きく依存しており、人間とロボットの相互作用のより自然な音声モダリティを無視しています。
従来の音声統合方法には通常、個別の音声認識システムが含まれ、モデルを複雑にし、エラーの伝播を導入します。
さらに、転写手順では、ボイスプリントなど、生のスピーチで非セマンチックな情報が失われます。これは、ロボットがカスタマイズされたタスクを正常に完了するために重要です。
上記の課題を克服するために、音声認識をロボットポリシーモデルに直接統合する新しいエンドツーエンドのVLAであるVLAを提案します。
VLASを使用すると、ロボットは内側の音声テキストアラインメントを介して音声コマンドを理解し、対応するアクションを生成してタスクを満たすことができます。
また、2つの新しいデータセット、SQAとCSIを提示して、テキスト、画像、音声、およびロボットアクションを介したマルチモーダル相互作用の能力をVLAに強化する3段階のチューニングプロセスをサポートします。
さらに一歩進んで、音声検索された生成（RAG）パラダイムは、モデルが個人固有の知識を必要とするタスクを効果的に処理できるように設計されています。
私たちの広範な実験は、VLAが多様な音声コマンドを使用してロボット操作タスクを効果的に達成できることを示しており、シームレスでカスタマイズされたインタラクションエクスペリエンスを提供します。

要約(オリジナル)

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.

arxiv情報

著者	Wei Zhao,Pengxiang Ding,Min Zhang,Zhefei Gong,Shuanghao Bai,Han Zhao,Donglin Wang
発行日	2025-02-19 07:53:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー