VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

要約

ビジョン言語モデルは大幅に進歩していますが、特に視覚的に支配的なピックアンドプレイスシナリオを超えて拡張される連絡先が豊富なタスクでは、言語条件付きロボット操作への応用は依然として不足していません。
このギャップを埋めるために、クロスモーダル言語の接地を通じて視覚的および触覚入力を効果的に統合することにより、接触集約型シナリオで堅牢なポリシー生成を可能にする新しいフレームワークである視覚触覚言語アクションモデルを紹介します。
低コストのマルチモーダルデータセットは、シミュレーション環境で構築されており、指先挿入タスク用に特別に設計された視覚触覚アクション導入ペアを含んでいます。
さらに、VTLAモデルの回帰様監督を提供するために、直接選好最適化（DPO）を導入し、分類ベースの次のトークン予測損失と連続ロボットタスクの間のギャップを効果的に埋めます。
実験結果は、VTLAモデルが従来の模倣学習方法（拡散ポリシーなど）と既存のマルチモーダルベースライン（TLA/VLA）を上回り、目に見えないPEG形状で90％以上の成功率を達成することを示しています。
最後に、実世界のペグインホール実験を実施して、提案されたVTLAモデルの例外的なSIM2realパフォーマンスを実証します。
補足ビデオと結果については、プロジェクトのWebサイトhttps：//sites.google.com/view/vtlaをご覧ください。

要約(オリジナル)

While vision-language models have advanced significantly, their application in language-conditioned robotic manipulation is still underexplored, especially for contact-rich tasks that extend beyond visually dominant pick-and-place scenarios. To bridge this gap, we introduce Vision-Tactile-Language-Action model, a novel framework that enables robust policy generation in contact-intensive scenarios by effectively integrating visual and tactile inputs through cross-modal language grounding. A low-cost, multi-modal dataset has been constructed in a simulation environment, containing vision-tactile-action-instruction pairs specifically designed for the fingertip insertion task. Furthermore, we introduce Direct Preference Optimization (DPO) to offer regression-like supervision for the VTLA model, effectively bridging the gap between classification-based next token prediction loss and continuous robotic tasks. Experimental results show that the VTLA model outperforms traditional imitation learning methods (e.g., diffusion policies) and existing multi-modal baselines (TLA/VLA), achieving over 90% success rates on unseen peg shapes. Finally, we conduct real-world peg-in-hole experiments to demonstrate the exceptional Sim2Real performance of the proposed VTLA model. For supplementary videos and results, please visit our project website: https://sites.google.com/view/vtla

arxiv情報

著者	Chaofan Zhang,Peng Hao,Xiaoge Cao,Xiaoshuai Hao,Shaowei Cui,Shuo Wang
発行日	2025-05-14 17:29:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー