SDPO: Segment-Level Direct Preference Optimization for Social Agents

要約

大規模言語モデル（LLM）を搭載したソーシャルエージェントは、人間の社会的行動をシミュレートすることができるが、複雑なゴール指向の社会的対話を扱うには不十分である。直接選好最適化(DPO)は、様々なエージェントタスクにおいて、LLMの振る舞いを人間の選好に合わせるのに効果的であることが証明されている。マルチターン対話のための既存のDPOベースのアプローチは、ターンレベルの手法とセッションレベルの手法に分けられる。ターンレベルの手法は、個々のターンにのみ焦点を当てるため、粒度が細かすぎる。一方、セッションレベルの手法は、粒度が粗すぎるため、しばしば学習ノイズが混入する。これらの限界に対処するため、我々はセグメントレベル直接選好最適化(SDPO)を提案する。これは、トレーニングノイズを最小限に抑えながら、マルチターンエージェントの動作を最適化するために、インタラクション内の特定のキーセグメントに焦点を当てる。SOTOPIAベンチマークでの評価では、SDPOでチューニングされたエージェントは、既存のDPOベースの手法とGPT-4oのような独自のLLMの両方を一貫して上回ることが実証され、SDPOがLLMベースのエージェントの社会的知性を向上させる可能性を強調している。コードとデータはhttps://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO。

要約(オリジナル)

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

arxiv情報

著者	Aobo Kong,Wentao Ma,Shiwan Zhao,Yongbin Li,Yuchuan Wu,Ke Wang,Xiaoqian Liu,Qicheng Li,Yong Qin,Fei Huang
発行日	2025-01-03 14:09:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

SDPO: Segment-Level Direct Preference Optimization for Social Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー