Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

要約

人間とロボットのコラボレーションを実現するには、有限の事前知識が与えられた人間の指示に従って、ロボットが新しいタスクのアクションを実行する必要があります。
人間の専門家は、長期的な目標を達成するための一連の短期的なステップを示すマルチモーダルな指示を通じて、ロボットとタスクを実行する方法に関する知識をデモンストレーションで共有できます。
本稿では、(1) 視聴覚特徴と指示音声をダイナミックムーブメントプリミティブ (DMP) と呼ばれる一連のロボット動作に変換するオーディオビジュアルトランスフォーマと、(2) スタイルを使用して、指示ビデオからロボット動作シーケンスを生成する方法を紹介します。
ビデオキャプションを使用したマルチタスク学習と、ペアになっていないビデオアクションデータを活用する意味分類器を使用した弱教師あり学習を採用した転送ベースのトレーニング。
調理ビデオから取得したDMPシーケンスを、オーディオビジュアルトランスフォーマーを使用してアームロボットが実行することで、さまざまな調理アクションを実現するシステムを構築しました。
Epic-Kitchen-100、YouCookII、QuerYD、および社内の指導ビデオデータセットを使用した実験では、提案された方法により DMP シーケンスの品質が、ベースラインのビデオアクション Transformer で得られた METEOR スコアの 2.3 倍向上することが示されています。
このモデルは、オブジェクトのタスク知識を使用してタスク成功率の 32% を達成しました。

要約(オリジナル)

To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal instructions in their demonstrations, showing a sequence of short-horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Experiments with Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Transformer. The model achieved 32% of the task success rate with the task knowledge of the object.

arxiv情報

著者	Chiori Hori,Puyuan Peng,David Harwath,Xinyu Liu,Kei Ota,Siddarth Jain,Radu Corcodel,Devesh Jha,Diego Romeres,Jonathan Le Roux
発行日	2023-06-27 17:37:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー