GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

要約

現在のジェスチャ認識システムは主に、事前定義されたセット内のジェスチャを識別することに焦点を当てており、これらのジェスチャを対話型 GUI 要素またはシステム機能に接続する際にギャップが残されています (たとえば、「親指を立てる」ジェスチャを「いいね」ボタンにリンクするなど)。
大規模言語モデル (LLM) を活用した、新しいゼロショットジェスチャ理解およびグラウンディングフレームワークである GestureGPT を紹介します。
ジェスチャーの説明は、ジェスチャービデオからの手のランドマーク座標に基づいて定式化され、デュアルエージェント対話システムに入力されます。
ジェスチャエージェントはこれらの記述を解読し、コンテキストエージェントが編成して提供するインタラクションコンテキスト (インターフェイス、履歴、視線データなど) についてクエリします。
反復的な交換の後、ジェスチャーエージェントはユーザーの意図を識別し、それを対話型機能に基礎づけます。
私たちは、公開されているファーストビューおよびサードビューのジェスチャデータセットを使用してジェスチャ記述モジュールを検証し、ビデオストリーミングとスマートホーム IoT 制御という 2 つの現実世界の設定でシステム全体をテストしました。
最高のゼロショットトップ 5 グラウンディング精度は、ビデオストリーミングでは 80.11%、スマートホームタスクでは 90.78% であり、新しいジェスチャ理解パラダイムの可能性を示しています。

要約(オリジナル)

Current gesture recognition systems primarily focus on identifying gestures within a predefined set, leaving a gap in connecting these gestures to interactive GUI elements or system functions (e.g., linking a ‘thumb-up’ gesture to a ‘like’ button). We introduce GestureGPT, a novel zero-shot gesture understanding and grounding framework leveraging large language models (LLMs). Gesture descriptions are formulated based on hand landmark coordinates from gesture videos and fed into our dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context (e.g., interface, history, gaze data), which a context agent organizes and provides. Following iterative exchanges, the gesture agent discerns user intent, grounding it to an interactive function. We validated the gesture description module using public first-view and third-view gesture datasets and tested the whole system in two real-world settings: video streaming and smart home IoT control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video streaming and 90.78% for smart home tasks, showing potential of the new gesture understanding paradigm.

arxiv情報

著者	Xin Zeng,Xiaoyu Wang,Tengxiang Zhang,Chun Yu,Shengdong Zhao,Yiqiang Chen
発行日	2023-10-20 04:13:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー