Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

要約

ロボットvisuo-motorポリシー生成に利益をもたらすためにアクションを観察することから視覚的表現を学ぶことは、人間の認知機能と知覚に非常に似ている有望な方向です。
これに動機付けられ、さらに心理的理論に触発され、人間がオブジェクトベースの方法でシーンを処理することを示唆する心理的理論に触発され、他の作品とは異なり、結合した方法でセマンティックセグメンテーションと視覚表現生成を実行するオブジェクト中心のエンコーダーを提案します。
これを実現するために、スロット注意メカニズムを活用し、大規模なドメイン外データセットで前提条件のSolVモデルを使用して、人間のアクションビデオデータで微調整をブートストラップします。
シミュレートされたロボットタスクを通じて、視覚的表現が強化と模倣学習トレーニングを強化し、セマンティックセグメンテーションとエンコーディングのための統合アプローチの有効性を強調できることを実証します。
さらに、ドメイン外データセットで前処理されたモデルを悪用すると、このプロセスに利益をもたらす可能性があり、人間の行動を描いたデータセットでの微調整は、まだドメイン外ではあるが、ロボットタスクとの密接な整合によりパフォーマンスを大幅に改善できることを示しています。
これらの調査結果は、注釈付きまたはロボット固有のアクションデータセットへの依存を減らす機能と、トレーニングを加速し、一般化を改善するために既存の視覚エンコーダーに基づいて構築する可能性を示しています。

要約(オリジナル)

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions — although still out-of-domain — , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

arxiv情報

著者	Nikos Giannakakis,Argyris Manetas,Panagiotis P. Filntisis,Petros Maragos,George Retsinas
発行日	2025-05-27 09:56:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー