VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

要約

この作業では、深い生成モデルが、大規模な言語モデル（LLMS）などのテキストベースのモデルに一般的な焦点とは対照的に、視覚入力のみから複雑な知識を学ぶことができるかどうかを調査します。
Unlabled Video Dataで訓練された自動動的なビデオ生成モデルであるVideWorldを開発し、ビデオベースのGOおよびロボット制御タスクで知識習得能力をテストします。
私たちの実験では、2つの重要な調査結果が明らかになりました。（1）ビデオのみのトレーニングは、ルール、推論、計画能力を含む知識を学習するための十分な情報を提供し、（2）視覚変化の表現は知識の獲得に不可欠です。
このプロセスの効率と有効性の両方を改善するために、Videworldの重要なコンポーネントとして潜在ダイナミクスモデル（LDM）を紹介します。
驚くべきことに、VideWorldは、補強学習に典型的な検索アルゴリズムや報酬メカニズムに依存することなく、わずか3億パラメーターモデルでVideo-Gobenchで5ダンのプロフェッショナルレベルに達します。
ロボットタスクでは、VideoWorldは多様な制御操作を効果的に学習し、環境全体で一般化し、CalvinとRLBenchのOracleモデルのパフォーマンスに近づきます。
この調査では、視覚データからの知識獲得のための新しい手段を開き、すべてのコード、データ、モデルがオープンソーシングされ、さらなる研究のためにオープンソーシングされています。

要約(オリジナル)

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.

arxiv情報

著者	Zhongwei Ren,Yunchao Wei,Xun Guo,Yao Zhao,Bingyi Kang,Jiashi Feng,Xiaojie Jin
発行日	2025-03-05 14:44:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー