MotionGPT: Human Motion as a Foreign Language

要約

事前トレーニングされた大規模な言語モデルの進歩は進んでいますが、言語と動きなどの他のマルチモーダルデータの統一モデルを構築する探求は依然として困難であり、これまでのところ手つかずです。
幸いなことに、人間の動作は人間の言語に似た意味結合を示し、ボディランゲージの一種として認識されることがよくあります。
言語データと大規模なモーションモデルを融合することで、モーション関連タスクのパフォーマンスを向上させるモーション言語の事前トレーニングが可能になります。
この洞察に基づいて、私たちは、複数のモーション関連タスクを処理するための、統合された多用途でユーザーフレンドリーなモーション言語モデルである MotionGPT を提案します。
具体的には、人間の動きに離散ベクトル量子化を採用し、単語トークンの生成プロセスと同様に、3D モーションをモーショントークンに転送します。
この「動作語彙」に基づいて、人間の動作を特定の言語として扱い、統合された方法で動作とテキストの両方に対して言語モデリングを実行します。
さらに、プロンプト学習に触発されて、モーション言語データを組み合わせて MotionGPT を事前トレーニングし、プロンプトベースの質疑応答タスクで微調整します。
広範な実験により、MotionGPT がテキスト駆動のモーション生成、モーションキャプション、モーション予測、およびその間のモーションを含む複数のモーションタスクで最先端のパフォーマンスを達成することが実証されました。

要約(オリジナル)

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this ‘motion vocabulary’, we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

arxiv情報

著者	Biao Jiang,Xin Chen,Wen Liu,Jingyi Yu,Gang Yu,Tao Chen
発行日	2023-06-26 15:53:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MotionGPT: Human Motion as a Foreign Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー