Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

要約

人間の意図ベースのシステムにより、ロボットはユーザーアクションを認識して解釈して人間と対話し、積極的に行動に適応することができます。
したがって、意図の予測は、人間が設計した環境で社会ロボットとの自然な相互作用を作成する上で極めて重要です。
この論文では、物理ロボットとの共同オブジェクト分類タスクで人間の意図を推測するために、大規模な言語モデル（LLM）を使用して調べます。
階層アーキテクチャのユーザーの意図を予測するための環境状態とユーザーの口頭キューを備えた、ハンドジェスチャー、ボディポーズ、表情など、ユーザーの非言語的手がかりを統合する新しいマルチモーダルアプローチを提案します。
5つのLLMの評価は、言語および非言語的ユーザーの手がかりについて推論する可能性を示しており、ソーシャルロボットとのタスクについて協力しながら、意図の予測をサポートするために、コンテキストの理解と現実世界の知識を活用しています。
ビデオ：https：//youtu.be/tbjhfauzohi

要約(オリジナル)

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI

arxiv情報

著者	Hassan Ali,Philipp Allgeuer,Stefan Wermter
発行日	2025-04-08 10:48:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー