Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

要約

セルフプレイ手法は、さまざまなドメインにわたってモデルの機能を強化することに目覚ましい成功を収めていることが実証されています。
ヒューマンフィードバックからの強化学習 (RLHF) のコンテキストでは、セルフプレイは大規模言語モデル (LLM) のパフォーマンスを向上させるだけでなく、
好みに基づいた 2 人用の定数合計ゲーム。
しかし、既存の方法は、平均反復収束のみを保証し、高い記憶コストと推論コストがかかるか、正則化されたゲームの NE に収束するため、人間の真の好みを正確に反映できません。
この論文では、元のゲームの NE への最終反復収束を達成し、既存の方法の制限を効果的に克服できる新しいアプローチである磁気優先最適化 (MPO) を紹介します。
MPO は磁気ミラー降下 (MMD) に基づいて構築されており、線形収束速度を達成するため、LLM の微調整に特に適しています。
私たちのアルゴリズムが理論的に健全であり、実際に実行可能であることを保証するために、理論的な洞察を RLHF 設定に適応させる、シンプルでありながら効果的な実装を紹介します。
実証結果は、MPO が LLM のパフォーマンスを大幅に向上できることを示しており、調整におけるセルフプレイ手法の可能性を強調しています。

要約(オリジナル)

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.

arxiv情報

著者	Mingzhi Wang,Chengdong Ma,Qizhi Chen,Linjian Meng,Yang Han,Jiancong Xiao,Zhaowei Zhang,Jing Huo,Weijie J. Su,Yaodong Yang
発行日	2024-12-20 16:26:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー