FLARE: Robot Learning with Implicit World Modeling

要約

$ \ textbf {f} $ uture $ \ textbf {la} $ $ \ textbf {re} $プレゼンテーションalignment（$ \ textbf {flare} $）を紹介します。
将来の観測の潜在的な埋め込みを備えた拡散トランスからの機能を整列させることにより、$ \ textBf {flare} $を有効にします。
驚くほど軽量である$ \ textBf {flare} $には、標準のビジョン言語アクション（VLA）モデルにいくつかのトークンを追加する最小限のアーキテクチャの変更のみが必要ですが、かなりのパフォーマンスの向上を実現します。
単一の腕とヒューマノイドの卓上操作にまたがる2つの挑戦的なマルチタスクシミュレーション模倣学習ベンチマークにまたがって、$ \ textBf {flare} $は最新のパフォーマンスを達成し、以前のポリシー学習ベースラインを最大26％上回ります。
さらに、$ \ textbf {flare} $は、アクションラベルなしで人間のエゴセントリックビデオデモンストレーションと共同訓練する能力を解き放ち、1つのロボットデモンストレーションを持つ目に見えないジオメトリを持つ新しいオブジェクトへのポリシーの一般化を大幅に高めます。
私たちの結果は、暗黙の世界モデリングと高頻度のロボット制御を組み合わせるための一般的かつスケーラブルなアプローチとして$ \ textBf {flare} $を確立します。

要約(オリジナル)

We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$ requires only minimal architectural modifications — adding a few tokens to standard vision-language-action (VLA) models — yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $\textbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

arxiv情報

著者	Ruijie Zheng,Jing Wang,Scott Reed,Johan Bjorck,Yu Fang,Fengyuan Hu,Joel Jang,Kaushil Kundalia,Zongyu Lin,Loic Magne,Avnish Narayan,You Liang Tan,Guanzhi Wang,Qi Wang,Jiannan Xiang,Yinzhen Xu,Seonghyeon Ye,Jan Kautz,Furong Huang,Yuke Zhu,Linxi Fan
発行日	2025-05-21 15:33:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FLARE: Robot Learning with Implicit World Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー