Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

要約

エンドツーエンド（E2E）視覚運動ポリシーは通常、統一された全体として扱われますが、視覚エンコーダーを前処理するためにドメイン外（OOD）データを使用した最近のアプローチは、視覚エンコーダーをネットワークからきれいに分離し、残りはポリシーと呼ばれます。
この機能的分離の妥当性を評価するために設計された実験的なフレームワークである視覚的アライメントテストを提案します。
我々の結果は、E2Eトレーニングを受けたモデルでは、視覚エンコーダーが運動データの監督に起因する意思決定に積極的に貢献し、想定される機能的分離と矛盾することを示しています。
対照的に、エンコーダーがこの機能を欠いているOODプレーンモデルは、E2Eポリシーによって達成された最先端のパフォーマンスと比較して、ベンチマーク結果で平均パフォーマンス低下を経験します。
視覚エンコーダーの役割のこの最初の調査は、タスクコンディショニングやコンテキスト認識エンコーダーの開発など、意思決定能力に対処するための将来の事前トレーニング方法を導くための最初のステップを提供できると考えています。

要約(オリジナル)

An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42\% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders’ role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

arxiv情報

著者	Ruiyu Wang,Zheyu Zhuang,Shutong Jin,Nils Ingelhag,Danica Kragic,Florian T. Pokorny
発行日	2025-05-14 11:40:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー