Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

要約

人間とロボットの間でコントロールが共有されている支援的なテレオ操作は、多様で構造化されていない環境で効率的かつ直感的な人間のロボットコラボレーションを可能にします。
現実世界の支援的なテレオ操作の中心的な課題は、ロボットがユーザー制御入力からの幅広い人間の意図を推測し、ユーザーが正しいアクションで支援することです。
既存の方法は、シンプルで定義されたシナリオに限定されているか、トレーニング時のタスク固有のデータ分布に限定されているため、実際の支援に対するサポートが制限されます。
リアルタイムの意図の推論と柔軟なスキル実行のために、事前に訓練された視覚言語モデル（VLMS）に組み込まれた常識的な知識を活用する支援的な視聴システムであるCasperを紹介します。
Casperには、新しいオブジェクトとシーンの一般化された理解のためのオープンワールド認識モジュール、Commonsenseの推論メカニズムが、テレオペレーションされたユーザー入力のスニペットを解釈するためのコモンセンスな推論を活用するVLMを駆動する意図的推論メカニズム、および多様な飼育動物操作タスクをサポートするための以前の支援的なテレオ覚醒システムの範囲を拡大するスキルライブラリを取り入れています。
人間の研究やシステムアブレーションを含む広範な経験的評価は、キャスパーがタスクのパフォーマンスを改善し、人間の認知負荷を削減し、直接的なテレオ操作と補助的なテレオ操作ベースラインよりも高いユーザーの満足度を達成することを示しています。

要約(オリジナル)

Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.

arxiv情報

著者	Huihan Liu,Rutav Shah,Shuijing Liu,Jack Pittenger,Mingyo Seo,Yuchen Cui,Yonatan Bisk,Roberto Martín-Martín,Yuke Zhu
発行日	2025-06-17 17:06:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー