Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

要約

この論文では、自動カクテル調製のための両手ロボット操作を可能にするように設計された視覚言語アクション (VLA) モデルベースのシステムである Shake-VLA を紹介します。
このシステムには、原料ボトルの検出とラベルの読み取りのためのビジョンモジュール、ユーザーコマンドを解釈するための音声テキスト変換モジュール、およびタスク固有のロボット命令を生成するための言語モデルが統合されています。
注がれた液体の量を正確に測定するためにフォーストルク (FT) センサーが採用されており、混合プロセス中の成分の割合の精度が保証されます。
システムアーキテクチャには、レシピにアクセスして適応させるための検索拡張生成 (RAG) モジュール、材料の入手可能性の問題に対処するための異常検出メカニズム、および器用な操作のための両手ロボットアームが含まれています。
実験による評価では、システムコンポーネント全体で高い成功率が実証されており、音声テキスト変換モジュールは騒がしい環境で 93% の成功率を達成し、ビジョンモジュールは乱雑な環境での物体とラベルの検出で 91% の成功率を達成し、異常モジュールは
は、検出された材料とレシピ要件の間の不一致の 95% を特定することに成功し、システムはレシピの作成からアクションの生成まで、カクテルの準備において全体の成功率 100% を達成しました。

要約(オリジナル)

This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.

arxiv情報

著者	Muhamamd Haris Khan,Selamawit Asfaw,Dmitrii Iarchuk,Miguel Altamirano Cabrera,Luis Moreno,Issatay Tokmurziyev,Dzmitry Tsetserukou
発行日	2025-01-12 20:07:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー