Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?

要約

事前トレーニングされたテキストから画像への生成モデルは、自然言語の記述から多様で意味論的に豊かで現実的な画像を生成できます。
言語と比較して、画像は通常、より詳細で曖昧さが少ない情報を伝えます。
この研究では、事前にトレーニングされたテキストから画像へのモデルと高度な画像編集技術を活用してロボットの学習をガイドする方法である Learning from the Void (LfVoid) を提案します。
自然言語の命令が与えられると、LfVoid は元の観察結果を編集して、テーブルの汚れを「拭き取る」などの目標画像を取得できます。
その後、LfVoid は、生成された画像上でアンサンブル目標識別器をトレーニングし、強化学習エージェントに報酬信号を提供し、目標を達成するように導きます。
LfVoid が専門家のデモンストレーションや真の目標の観察 (ボイド) についてドメイン内トレーニングなしで学習できるのは、Web スケールの生成モデルからの知識の利用によるものです。
3 つのシミュレートされたタスクにわたって LfVoid を評価し、対応する現実世界のシナリオでの実現可能性を検証します。
さらに、視覚生成モデルをロボット学習ワークフローに効果的に統合するための重要な考慮事項についての洞察を提供します。
私たちは、私たちの研究が、ロボット工学分野における事前トレーニングされた視覚生成モデルのより広範な応用に向けた最初の一歩であると主張しています。
私たちのプロジェクトページ: https://lfvoid-rl.github.io/。

要約(オリジナル)

Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniques to guide robot learning. Given natural language instructions, LfVoid can edit the original observations to obtain goal images, such as ‘wiping’ a stain off a table. Subsequently, LfVoid trains an ensembled goal discriminator on the generated image to provide reward signals for a reinforcement learning agent, guiding it to achieve the goal. The ability of LfVoid to learn with zero in-domain training on expert demonstrations or true goal observations (the void) is attributed to the utilization of knowledge from web-scale generative models. We evaluate LfVoid across three simulated tasks and validate its feasibility in the corresponding real-world scenarios. In addition, we offer insights into the key considerations for the effective integration of visual generative models into robot learning workflows. We posit that our work represents an initial step towards the broader application of pre-trained visual generative models in the robotics field. Our project page: https://lfvoid-rl.github.io/.

arxiv情報

著者	Jialu Gao,Kaizhe Hu,Guowei Xu,Huazhe Xu
発行日	2023-07-15 16:03:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー