Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

要約

急速に進化するロボット工学の分野では、複数のモダリティの融合を促進できる方法が必要です。
具体的には、有形のオブジェクトと対話する場合、視覚データと触覚データを効果的に組み合わせることが、物理世界の複雑な力学を理解し、ナビゲートするための鍵となり、変化する環境に対するより微妙で適応的な対応を可能にします。
それにもかかわらず、これら 2 つの感覚モダリティを統合する初期の研究の多くは、人間によってラベル付けされたデータセットを利用した教師あり手法に依存していました。この論文では、対照学習を利用して視覚と触覚を自己教師ありの方法で統合する新しい方法論である MViTac を紹介します。
両方の感覚入力を利用することで、MViTac は表現を学習するためにモダリティ内およびモダリティ間の損失を活用し、その結果、材料特性の分類が強化され、より適切な把握予測が可能になります。
一連の実験を通じて、私たちの方法の有効性と、既存の最先端の自己教師ありおよび教師あり手法に対するその優位性を示します。
私たちの方法論を評価する際には、材料の分類と成功予測の把握という 2 つの異なるタスクに焦点を当てます。
私たちの結果は、線形プローブ評価によって証明されるように、MViTac が改良されたモダリティエンコーダーの開発を促進し、より堅牢な表現を生み出すことを示しています。

要約(オリジナル)

The rapidly evolving field of robotics necessitates methods that can facilitate the fusion of multiple modalities. Specifically, when it comes to interacting with tangible objects, effectively combining visual and tactile sensory data is key to understanding and navigating the complex dynamics of the physical world, enabling a more nuanced and adaptable response to changing environments. Nevertheless, much of the earlier work in merging these two sensory modalities has relied on supervised methods utilizing datasets labeled by humans.This paper introduces MViTac, a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion. By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. Through a series of experiments, we showcase the effectiveness of our method and its superiority over existing state-of-the-art self-supervised and supervised techniques. In evaluating our methodology, we focus on two distinct tasks: material classification and grasping success prediction. Our results indicate that MViTac facilitates the development of improved modality encoders, yielding more robust representations as evidenced by linear probing assessments.

arxiv情報

著者	Vedant Dave,Fotios Lygerakis,Elmar Rueckert
発行日	2024-01-22 15:11:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー