TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

要約

人間とロボットのコラボレーションが進むにつれて、効果的なロボット制御には、自然で柔軟な通信方法が不可欠です。
単一のモダリティまたは厳格なルールに依存している従来の方法は、騒々しいまたは不整合されたデータや、事前に定義されたオブジェクト名に完全に適合しないオブジェクトの説明（たとえば、「その赤いオブジェクトを選ぶ」）に苦労しています。
融合音声とジェスチャー入力に基づいてロボット操作のために構造化されたアクションコマンドを推進するトランスベースの推論モデルであるTransformergerを紹介します。
私たちのアプローチは、マルチモーダルデータを単一の統一文に融合し、言語モデルによって処理されます。
不確実性を処理するために確率的な埋め込みを採用し、コンテキストシーンの理解を統合して曖昧な参照を解決します（たとえば、複数のオブジェクトまたは「この」のような曖昧な言葉の手がかりを指すジェスチャー）。
シミュレートされた現実世界の実験で変圧器を評価し、ノイズへの堅牢性、不整合、および欠落情報を示します。
私たちの結果は、Transformergerが、より堅牢で柔軟な人間のロボットコミュニケーションを可能にする、より多くの文脈的知識を必要とするシナリオで、決定論的なベースラインよりも優れていることを示しています。
コードとデータセットは、http：//imitrob.ciirc.cvut.cz/publications/transformergerで入手できます。

要約(オリジナル)

As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. ‘Pick that red object’). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like ‘this’). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.

arxiv情報

著者	Petr Vanc,Karla Stepanova
発行日	2025-04-02 13:15:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー