iVideoGPT: Interactive VideoGPTs are Scalable World Models

要約

ワールドモデルは、モデルベースのエージェントに、実世界での意思決定のために、想像された環境内をインタラクティブに探索、推論、計画する力を与える。しかし、双方向性の要求が高いため、ビデオ生成モデルの最近の進歩を利用して、ワールドモデルを大規模に開発することは困難である。iVideoGPTの特徴は、高次元の視覚的観察を効率的に離散化する新しい圧縮トークン化技術である。そのスケーラブルなアーキテクチャを活用して、iVideoGPTを何百万もの人間やロボットの操作軌跡で事前学習させることができ、幅広い下流タスクのためのインタラクティブな世界モデルとして機能するように適応可能な汎用性の高い基盤を確立している。iVideoGPTは、行動条件付きビデオ予測、視覚プランニング、モデルベースの強化学習など、最先端の手法と比較して遜色のない性能を達成しています。私たちの研究は、生成的なビデオモデルと実用的なモデルベースの強化学習アプリケーションの間のギャップを埋める、インタラクティブな一般世界モデルの開発を進めています。

要約(オリジナル)

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals–visual observations, actions, and rewards–into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.

arxiv情報

著者	Jialong Wu,Shaofeng Yin,Ningya Feng,Xu He,Dong Li,Jianye Hao,Mingsheng Long
発行日	2024-06-02 09:44:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

iVideoGPT: Interactive VideoGPTs are Scalable World Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー