Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

要約

Direct Preference Optimization (DPO) は、推論、要約、位置合わせなどの下流タスクにおける大規模言語モデル (LLM) のパフォーマンスを大幅に向上させるのに効果的です。
DPO は、優先データと不優先データのペアを使用して、ある応答を別の応答よりも選択する \textit{相対} 確率をモデル化します。
この研究では、まず、優先クラスと非優先クラス間の相対確率が増加する限り、標準 DPO 損失がモデルの優先例の尤度の \textit{減少}につながる可能性があることを理論的に示します。
次に、一般的なデータセット、特に補完のペア間の編集距離が短いデータセットで LLM を微調整するときにこの現象が発生することを経験的に示します。
これらの洞察を使用して、この故障モードを回避する新しい損失関数およびトレーニング手順である DPO-Positive (DPOP) を設計します。
驚くべきことに、完了間の編集距離が長いデータセットを含む、さまざまなデータセットおよび下流タスクにわたって DPOP が DPO よりも大幅に優れていることもわかりました。
DPOP による微調整により、最先端のオープンソース性能を実現する Smaug-34B および Smaug-72B を作成、リリースしています。
特に、Smaug-72B は、HuggingFace Open LLM Leaderboard の他のオープンソースモデルよりも 2\% 近く優れており、平均精度 80\% を超える最初のオープンソース LLM になります。

要約(オリジナル)

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the \textit{relative} probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a \textit{reduction} of the model’s likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP significantly outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. By fine-tuning with DPOP, we create and release Smaug-34B and Smaug-72B, which achieve state-of-the-art open-source performance. Notably, Smaug-72B is nearly 2\% better than any other open-source model on the HuggingFace Open LLM Leaderboard and becomes the first open-source LLM to surpass an average accuracy of 80\%.

arxiv情報

著者	Arka Pal,Deep Karkhanis,Samuel Dooley,Manley Roberts,Siddartha Naidu,Colin White
発行日	2024-02-20 18:42:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー