2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

要約

オブジェクトと相互作用する場合、人間は、意図したアクション、つまりオブジェクトのアフォーダンス領域に対してどのオブジェクトの領域が実行可能であるかについて事実上、効果的に推論します。
また、実行されるタスクに基づいて、オブジェクト領域の微妙な違いや、1つまたは2つの手を使用する必要があるかどうかを説明することもできます。
ただし、現在の視力ベースのアフォーダンス予測方法は、多くの場合、問題を素朴なオブジェクトパーツセグメンテーションに減らします。
この作業では、人間の活動ビデオデータセットからアフォーダンスデータを抽出するためのフレームワークを提案します。
抽出された2handsデータセットには、実行されるアクティビティのナレーションとして、正確なオブジェクトアフォーダンス領域セグメンテーションとアフォーダンスクラスラベルが含まれています。
データはまた、両手の行動、つまり、1つ以上のオブジェクトを調整し、相互作用する両手を説明します。
VLMベースのアフォーダンス予測モデルである2Handedafforderを提示し、データセットで訓練され、さまざまなアクティビティのアフォーダンス地域セグメンテーションのベースラインよりも優れたパフォーマンスを示します。
最後に、予測されたアフォーダンス領域が実行可能であることを示しています。つまり、ロボット操作シナリオのデモンストレーションを通じて、タスクを実行するエージェントが使用できることを示します。

要約(オリジナル)

When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.

arxiv情報

著者	Marvin Heidinger,Snehal Jauhri,Vignesh Prasad,Georgia Chalvatzaki
発行日	2025-03-13 06:35:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー