HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

要約

自然言語の形式で高レベルの口語的なタスク仕様が与えられたシーンで、人間の手の将来のインタラクション軌跡をどのように予測できるでしょうか?
この論文では、古典的な手の軌道予測タスクを、明示的または暗黙的な言語クエリを含む 2 つのタスクに拡張します。
私たちが提案するタスクには、人間の日常活動についての広範な理解と、現在のシーンからの手がかりが与えられた場合に次に何が起こるべきかについて推論する能力が必要です。
また、提案された 2 つのタスク、バニラハンド予測 (VHP) と推論ベースのハンド予測 (RBHP) を評価するための新しいベンチマークも開発します。
私たちは、視覚言語モデル (VLM) の高レベルの世界知識と推論能力を、低レベルの自己中心的な手の軌跡の自己回帰的な性質と統合することによって、これらのタスクの解決を可能にします。
私たちのモデル、HandsOnVLM は、テキスト応答を生成し、自然言語の会話を通じて将来の手の軌道を生成できる新しい VLM です。
私たちの実験では、提案されたタスクに関して、HandsOnVLM が既存のタスク固有の手法や他の VLM ベースラインよりも優れたパフォーマンスを示し、提供されたコンテキストに基づいて低レベルの人間の手の軌跡を推論するために世界の知識を効果的に利用できる能力を実証しました。
当社の Web サイトにはコードと詳細なビデオ結果が含まれています https://www.chenbao.tech/handsonvlm/

要約(オリジナル)

How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to two tasks involving explicit or implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what should be happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. Our website contains code and detailed video results https://www.chenbao.tech/handsonvlm/

arxiv情報

著者	Chen Bao,Jiarui Xu,Xiaolong Wang,Abhinav Gupta,Homanga Bharadhwaj
発行日	2024-12-18 15:19:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー