TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

要約

タスク指向のハンドオブジェクトインタラクションビデオ生成の既存のデータセットとモデルの重要な制限に対処します。これは、ロボット模倣学習のためのビデオデモンストレーションを生成する重要なアプローチです。
EGO4Dなどの現在のデータセットは、しばしば一貫性のない視点の視点と相互作用の不一致に悩まされ、ビデオの品質の低下につながり、正確な模倣学習タスクの適用性を制限します。
この目的に向けて、100,856の自我中心のハンドオブジェクトインタラクションビデオの先駆的な大規模なデータセットであるTaste-Robを紹介します。
各ビデオは、言語の指示に細心の注意を払って整合し、一貫したカメラの視点から記録され、相互作用の明確さを確保します。
テイストロブに関するビデオ拡散モデル（VDM）を微調整することにより、現実的なオブジェクトの相互作用を実現しますが、手の把握姿勢で時折矛盾が観察されました。
リアリズムを強化するために、生成されたビデオの手姿勢の精度を向上させる3段階のポーズ補給パイプラインを導入します。
専門化されたポーズ修正フレームワークと組み合わせたキュレーションされたデータセットは、高品質でタスク指向のハンドオブジェクトインタラクションビデオを生成し、優れた一般化可能なロボット操作を実現する際の顕著なパフォーマンスの向上を提供します。
Taste-Robデータセットは、現場でのさらなる進歩を促進するために、公開時に公開されます。

要約(オリジナル)

We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob — a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.

arxiv情報

著者	Hongxiang Zhao,Xingchen Liu,Mutian Xu,Yiming Hao,Weikai Chen,Xiaoguang Han
発行日	2025-03-14 14:09:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー