Video Language Planning

要約

私たちは、インターネット規模のデータで事前トレーニングされた大規模な生成モデルの最近の進歩を活用して、生成されたビデオと言語の空間で複雑な長期タスクの視覚的な計画を可能にすることに興味を持っています。
この目的を達成するために、我々は、ツリー検索手順で構成されるアルゴリズムであるビデオ言語プランニング (VLP) を提案します。このアルゴリズムでは、(i) ポリシーと価値関数の両方として機能するビジョン言語モデルをトレーニングし、(ii) テキストから
ビデオモデルをダイナミクスモデルとして使用します。
VLP は、長期的なタスクの指示と現在の画像観察を入力として受け取り、最終タスクの完了方法を説明する詳細なマルチモーダル (ビデオと言語) 仕様を提供する長いビデオプランを出力します。
VLP は、計算量の増加に応じて拡張でき、計算時間が長くなるとビデオプランが改善され、複数のオブジェクトの再配置から複数のカメラのバイアームによる器用な操作まで、さまざまなロボティクスドメインにわたって長期的なビデオプランを合成できます。
生成されたビデオプランは、生成されたビデオの各中間フレームに条件付けされた目標条件付きポリシーを介して、実際のロボットのアクションに変換できます。
実験の結果、VLP は、シミュレートされたロボットと実際のロボットの両方 (3 つのハードウェアプラットフォームにわたって) で、従来の方法と比較して長期タスクの成功率が大幅に向上することが示されています。

要約(オリジナル)

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

arxiv情報

著者	Yilun Du,Mengjiao Yang,Pete Florence,Fei Xia,Ayzaan Wahid,Brian Ichter,Pierre Sermanet,Tianhe Yu,Pieter Abbeel,Joshua B. Tenenbaum,Leslie Kaelbling,Andy Zeng,Jonathan Tompson
発行日	2023-10-16 17:48:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Language Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー