AIPO: Improving Training Objective for Iterative Preference Optimization

要約

Preference Optimization (PO) は、Large Language Model (LLM) を調整するための Proximal Policy Optimization (PPO) の代替選択肢として人気が高まっています。
LLM を合成データまたは部分合成データと繰り返し調整する最近の研究では、学術環境と Llama3 などの独自のトレーニング済みモデルの両方に対する PO トレーニングのスケールアップにおいて有望な結果が示されています。
成功にもかかわらず、私たちの研究は、PO に存在する長さの悪用の問題が、プロセスの反復的な性質のため、反復優先最適化 (IPO) ではさらに深刻であることを示しています。
この研究では、合成データを使用した反復的な優先度の最適化を研究します。
反復優先最適化パイプラインを構築する過程での発見と分析を共有します。
より具体的には、反復優先度の最適化中の長さの利用の問題について議論し、反復優先度の最適化のためのトレーニング目標、すなわち合意を意識した反復優先度最適化 (AIPO) を提案します。
私たちの手法の有効性を実証するために、包括的な実験を実施し、MT-Bench、AlpacaEval 2.0、および Arena-Hard で最先端のパフォーマンスを達成しました。
実装とモデルのチェックポイントは https://github.com/bytedance/AIPO で利用できるようになります。

要約(オリジナル)

Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard. Our implementation and model checkpoints will be made available at https://github.com/bytedance/AIPO.

arxiv情報

著者	Yaojie Shen,Xinyao Wang,Yulei Niu,Ying Zhou,Lexin Tang,Libo Zhang,Fan Chen,Longyin Wen
発行日	2024-09-13 14:03:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AIPO: Improving Training Objective for Iterative Preference Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー