RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

要約

より複雑なマルチモーダルの相互作用と操作タスクに向けて前進するロボットテクノロジーとして、高度なビジョン言語モデル（VLMS）の統合がこの分野の重要なドライバーになりました。
現在の方法での進歩にもかかわらず、3D環境内で深さとRGB情報を融合し、言語指示に導かれたタスクを実行することに課題が続きます。
これらの課題に対応して、既存のロボフラミンゴフレームワークを強化しました。これは、VLMSに深さデータを組み込んでロボット操作のパフォーマンスを大幅に改善するRoboflamingo-Plusを導入しました。
私たちの研究は、事前に訓練された視覚変圧器（VIT）とリサンプリング技術を統合することにより、RGBと深度情報の微妙な融合を達成し、この組み合わせたデータを優れたマルチモーダル理解のための言語的キューと密接に連携させます。
Roboflamingo-Plusの斬新さは、深度データ処理のための入力の適応、深さの特徴抽出のための事前に訓練された再サンプラーを活用し、最適な特徴統合のための横断メカニズムを採用しています。
これらの改善により、Roboflamingo-Plusは3D環境を深く理解するだけでなく、挑戦的な設定で複雑で言語誘導タスクを簡単に実行できます。
実験結果は、Roboflamingo-Plusが現在の方法で10〜20％のロボット操作を高め、大きな進歩を示すことを示しています。
コードとモデルの重みは、Roboflamingo-Plusで公開されています。

要約(オリジナル)

As robotic technologies advancing towards more complex multimodal interactions and manipulation tasks, the integration of advanced Vision-Language Models (VLMs) has become a key driver in the field. Despite progress with current methods, challenges persist in fusing depth and RGB information within 3D environments and executing tasks guided by linguistic instructions. In response to these challenges, we have enhanced the existing RoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporates depth data into VLMs to significantly improve robotic manipulation performance. Our research achieves a nuanced fusion of RGB and depth information by integrating a pre-trained Vision Transformer (ViT) with a resampling technique, closely aligning this combined data with linguistic cues for superior multimodal understanding. The novelty of RoboFlamingo-Plus lies in its adaptation of inputs for depth data processing, leveraging a pre-trained resampler for depth feature extraction, and employing cross-attention mechanisms for optimal feature integration. These improvements allow RoboFlamingo-Plus to not only deeply understand 3D environments but also easily perform complex, language-guided tasks in challenging settings. Experimental results show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% over current methods, marking a significant advancement. Codes and model weights are public at RoboFlamingo-Plus.

arxiv情報

著者	Sheng Wang
発行日	2025-03-25 10:01:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー