ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

要約

触覚センシングは、テクスチャ、コンプライアンス、力などの視覚的知覚を補完するローカルな本質的な情報を提供します。
視覚能力表現学習の最近の進歩にもかかわらず、課題はこれらのモダリティを融合し、事前に訓練されたビジョン言語モデルに大きく依存せずにタスクと環境を一般化することに残っています。
さらに、既存の方法は位置のエンコーディングを研究せず、それにより、細粒の視覚能力相関をキャプチャするために必要なマルチスケールの空間的推論を見落とします。
Vitapesを紹介します。これは、視覚的および触覚的な入力データを堅牢に統合して、視覚操作の知覚のためのタスクに依存しない表現を学習するためのトランスベースのフレームワークを紹介します。
私たちのアプローチは、クロスモーダルキューをモデル化しながら、モーダル内構造をキャプチャするために、新しいマルチスケールの位置エンコードスキームを活用します。
以前の作業とは異なり、visuotactile融合における証明された保証を提供します。エンコーディングは、これらの特性を経験的に検証して、我々のエンコーディングが無視、剛性駆動、および情報提供であることを示します。
複数の大規模な現実世界のデータセットでの実験は、Vitapeがさまざまな認識タスクにわたって最先端のベースラインを上回るだけでなく、目に見えないドメインのないシナリオにゼロショットの一般化を示していることを示しています。
さらに、ロボットの把握タスクでVitapesのトランスファーラーニング強度を示します。そこでは、把握成功を予測する上で最先端のベースラインよりも優れています。
プロジェクトページ：https：//sites.google.com/view/vitapes

要約(オリジナル)

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

arxiv情報

著者	Fotios Lygerakis,Ozan Özdenizci,Elmar Rückert
発行日	2025-05-26 14:19:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー