Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

要約

一般化可能な二近操作ポリシーを学ぶことは、大きなアクションスペースと調整された腕の動きの必要性により、具体化されたエージェントにとって非常に困難です。
既存のアプローチは、ビジョン言語アクション（VLA）モデルに依存して、双方向のポリシーを獲得します。
ただし、主に両腕のデータが不足しているため、単一腕と双方の操作の基本的な違いにより、単一の腕のデータセットまたは事前に訓練されたVLAモデルからの知識を転送することは、効果的に一般化できないことがよくあります。
この論文では、ロボットの軌跡を予測し、アクション生成のための軽量拡散ポリシーをトレーニングするために、主要なテキストからビデオへの主要なビデオモデルを微調整することにより、新しい双方向の基礎政策を提案します。
テキスト間モデルに具体化された知識がないことを考えると、事前に訓練されたテキスト間モデルから派生した独立したテキストからフローとフローツービデオモデルを微調整する2段階のパラダイムを導入します。
具体的には、光学フローは中間変数として機能し、画像間の微妙な動きの簡潔な表現を提供します。
テキストからフローモデルは、言語指示の意図を具体化する光学フローを予測し、フローからビデオへのモデルはこのフローを活用して、細粒のビデオ予測を行います。
私たちの方法は、単一段階のテキストからビデオへの予測における言語の曖昧さを軽減し、低レベルのアクションの直接使用を回避することにより、ロボットデータの要件を大幅に削減します。
実験では、実際のデュアルアームロボットの高品質の操作データを収集し、シミュレーションと実際の実験の結果は、方法の有効性を示しています。

要約(オリジナル)

Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.

arxiv情報

著者	Chenyou Fan,Fangzheng Yan,Chenjia Bai,Jiepeng Wang,Chi Zhang,Zhen Wang,Xuelong Li
発行日	2025-05-30 03:01:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー