DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

要約

この研究では、オープンボキャブラリーの指示に従って、日用品を指定された家具まで運ぶことができる家庭用サービスロボット (DSR) を開発することを目指しています。
画像検索設定でオープンボキャブラリー命令を使用してモバイル操作タスクを処理する既存の方法はほとんどなく、ほとんどはターゲットオブジェクトとレセプタクルの両方を識別しません。
我々は、マルチモーダル基礎モデルに基づいた単一モデルを使用してターゲットオブジェクトとレセプタクルの両方の画像を取得できるようにするデュアルモードマルチモーダルランキングモデル（DM2RM）を提案します。
大規模な言語モデルを介してモードトークンとフレーズの識別を活用し、予測ターゲットに基づいて埋め込み空間を切り替える切り替えメカニズムを導入します。
DM2RM を評価するために、数百の建物規模の環境から収集された現実世界の画像と、参照表現を含むクラウドソーシングされた命令を含む新しいデータセットを構築します。
評価結果は、提案された DM2RM が画像検索設定の標準指標の点で以前のアプローチよりも優れていることを示しています。
さらに、フェッチアンドキャリーアクションを含む標準化された現実世界の DSR プラットフォームでの DM2RM のアプリケーションを実証し、ゼロショット転送設定にもかかわらず 82% のタスク成功率を達成しました。
デモビデオ、コード、その他の資料は https://kkrr10.github.io/dm2rm/ で入手できます。

要約(オリジナル)

In this study, we aim to develop a domestic service robot (DSR) that, guided by open-vocabulary instructions, can carry everyday objects to the specified pieces of furniture. Few existing methods handle mobile manipulation tasks with open-vocabulary instructions in the image retrieval setting, and most do not identify both the target objects and the receptacles. We propose the Dual-Mode Multimodal Ranking model (DM2RM), which enables images of both the target objects and receptacles to be retrieved using a single model based on multimodal foundation models. We introduce a switching mechanism that leverages a mode token and phrase identification via a large language model to switch the embedding space based on the prediction target. To evaluate the DM2RM, we construct a novel dataset including real-world images collected from hundreds of building-scale environments and crowd-sourced instructions with referring expressions. The evaluation results show that the proposed DM2RM outperforms previous approaches in terms of standard metrics in image retrieval settings. Furthermore, we demonstrate the application of the DM2RM on a standardized real-world DSR platform including fetch-and-carry actions, where it achieves a task success rate of 82% despite the zero-shot transfer setting. Demonstration videos, code, and more materials are available at https://kkrr10.github.io/dm2rm/.

arxiv情報

著者	Ryosuke Korekata,Kanta Kaneda,Shunya Nagashima,Yuto Imai,Komei Sugiura
発行日	2024-08-15 03:34:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー