A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

要約

事前に訓練されたビジョンモデル（PVM）は、最新のロボット工学の基本ですが、最適な構成は不明のままです。
体系的な評価を通じて、ディノとイボットは視覚運動制御および知覚タスク全体でMAEを上回る一方で、非（単一）オブジェクト中心（NOC）データで訓練されたときに苦労していることがわかります。
この調査は、非オブジェクト中心のロボット工学データセットからオブジェクト中心の表現を形成する能力がPVMの成功の鍵であることを示しています。
この発見に動機付けられた私たちは、セマンティックボトルネックを導入してオブジェクト中心のボトルネックを導入して、オブジェクトの出現とマルチビューの不変性を促進するためのクロスビューの一貫性の正規化を促進することにより、オブジェクト中心の表現を誘導する方法を設計しました。
私たちの実験には、オブジェクト中心、シーン中心、Webがクロールされた、自我中心のデータに関するトレーニング前の実験が含まれます。
すべての設定で、私たちのアプローチは転送可能な表現を学習し、画像認識、シーンの理解、およびロボット学習評価の以前の作業よりも大幅な改善を達成します。
百万スケールのデータセットで拡大すると、この方法は優れたデータ効率とスケーラビリティも示します。
私たちのコードとモデルは、https://github.com/cvmi-lab/slotmimで公開されています。

要約(オリジナル)

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data–a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.

arxiv情報

著者	Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi
発行日	2025-03-10 06:18:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー