SATO: Stable Text-to-Motion Framework

要約

Text to Motion モデルは堅牢ですか?
Text to Motion モデルの最近の進歩は主に、特定のアクションのより正確な予測に由来しています。
ただし、テキストモダリティは通常、事前トレーニングされた Contrastive Language-Image Pretraining (CLIP) モデルのみに依存します。
私たちの研究により、テキストモーションモデルに関する重大な問題が明らかになりました。その予測は一貫性のない出力を示すことが多く、意味的に類似または同一のテキスト入力が提示された場合、ポーズが大きく異なったり、不正確になったりする結果になります。
この論文では、この不安定性の根本的な原因を解明するための分析を行い、モデル出力の予測不可能性とテキストエンコーダーモジュールの不安定な注意パターンとの間の明確な関連性を確立します。
したがって、この問題に対処することを目的とした正式なフレームワークを導入します。これを Stable Text-to-Motion Framework (SATO) と呼びます。
SATO は 3 つのモジュールで構成されており、それぞれが安定した注意、安定した予測、精度と堅牢性のトレードオフのバランスの維持に特化しています。
注意と予測の安定性を満たすSATOを構築するための方法論を提案します。
モデルの安定性を検証するために、HumanML3D と KIT-ML に基づく新しいテキスト同義語摂動データセットを導入しました。
結果は、SATO が高精度のパフォーマンスを維持しながら、同義語やその他のわずかな摂動に対して大幅に安定していることを示しています。

要約(オリジナル)

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

arxiv情報

著者	Wenshuo Chen,Hongru Xiao,Erhang Zhang,Lijie Hu,Lei Wang,Mengyuan Liu,Chen Chen
発行日	2024-05-02 16:50:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SATO: Stable Text-to-Motion Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー