Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations

要約

LLMの最近の進歩は、具体化されたアプリケーションでモーション生成モデルに革命をもたらしました。
LLMタイプの自動回帰モーション生成モデルは、トレーニングスケーラビリティの恩恵を受けますが、トークン予測目標と人間の好みの間には矛盾があります。
その結果、トークン予測の目的でのみ事前に訓練されたモデルは、人間が好むものから逸脱する行動を生成し、トレーニング後の選好アラインメントを人間が優先する動きを生成するために重要になります。
残念ながら、トレーニング後のアライメントには、特にマルチエージェント設定では、注釈を付けるのに費用がかかる、事前に訓練されたモデルによって生成される動きの広範な優先ランキングが必要です。
最近、トレーニング前のデモンストレーションを活用して、トレーニング後のアラインメントの優先データを拡大できるように生成することに関心が高まっています。
ただし、これらの方法はしばしば敵対的な仮定を採用し、事前に訓練されたすべてのモデル生成サンプルを未処理の例として扱います。
この敵対的なアプローチは、モデル自身の世代間の優先ランキングによって提供される貴重な信号を見落とし、最終的にアライメントの有効性を低下させ、潜在的に整合した行動につながる可能性があります。
この作業では、生成されたすべてのサンプルを等しく悪いと扱う代わりに、トレーニング前のデモンストレーションにエンコードされた暗黙の好みを活用して、事前に訓練されたモデルの世代間で優先ランキングを構築し、より微妙な優先選好アライメントガイダンスをゼロの人間コストで提供します。
大規模なトラフィックシミュレーションにアプローチを適用し、事前に訓練されたモデルの生成された動作のリアリズムを改善する上でその有効性を実証します。

要約(オリジナル)

Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer, making post-training preference alignment crucial for producing human-preferred motions. Unfortunately, post-training alignment requires extensive preference rankings of motions generated by the pre-trained model, which are costly to annotate, especially in multi-agent settings. Recently, there has been growing interest in leveraging pre-training demonstrations to scalably generate preference data for post-training alignment. However, these methods often adopt an adversarial assumption, treating all pre-trained model-generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model’s own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre-training demonstrations to construct preference rankings among the pre-trained model’s generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of pre-trained model’s generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation-based models by relying solely on implicit feedback from pre-training demonstrations, without additional post-training human preference annotations or high computational costs.

arxiv情報

著者	Ran Tian,Kratarth Goel
発行日	2025-03-25 23:02:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー