SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories

要約

人間がオブジェクトを把握するとき、彼らは自然に心の中で軌跡を形成して、特定のタスクのためにそれを操作します。
ハンドオブジェクトの相互作用のモデリングは、物理世界で効果的に動作することを学習するために、ロボットおよび具体化されたAIシステムを前進させる重要な可能性を秘めています。
視力を紹介します。これは、単一の画像と簡単な言語ベースのタスク説明から現実的で肉体的にもっともらしい3Dハンドオブジェクト相互作用の軌跡を生成することに焦点を当てた新しいタスクです。
ハンドオブジェクトの軌道の以前の作業は、通常、ターゲットオブジェクトへの明示的な接地がないテキスト入力に依存しているか、3Dオブジェクトメッシュへのアクセスを想定しています。
データベースから最も類似した3Dオブジェクトメッシュを取得し、新しい推論時間拡散ガイダンスを介して幾何学的なハンドオブジェクト相互作用の制約を施行することにより、このタスクに取り組む新しい拡散ベースの画像テキスト条件付けされた生成モデルである視力融合を提案します。
HOI4DおよびH2Oデータセットにモデルをベンチマークし、この新しいタスクに関連するベースラインを適応させます。
実験は、生成された軌道の多様性と品質、および手観オブジェクトの相互作用ジオメトリメトリックの優れたパフォーマンスを示しています。

要約(オリジナル)

When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image and a brief language-based task description. Prior work on hand-object trajectory generation typically relies on textual input that lacks explicit grounding to the target object, or assumes access to 3D object meshes, which are often considerably more difficult to obtain than 2D images. We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database and enforcing geometric hand-object interaction constraints via a novel inference-time diffusion guidance. We benchmark our model on the HOI4D and H2O datasets, adapting relevant baselines for this novel task. Experiments demonstrate our superior performance in the diversity and quality of generated trajectories, as well as in hand-object interaction geometry metrics.

arxiv情報

著者	Alexey Gavryushin,Alexandros Delitzas,Luc Van Gool,Marc Pollefeys,Kaichun Mo,Xi Wang
発行日	2025-05-29 17:11:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー