Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

要約

直接選好最適化(DPO)は、推論、要約、アライメントなどの下流タスクにおいて、大規模言語モデル(LLM)の性能を大幅に向上させるのに有効である。DPOは、優先データと非優先データのペアを用いて、ある応答を他の応答より選ぶ相対確率をモデル化する。本研究ではまず、標準的なDPOの損失は、好ましいクラスと好ましくないクラスの間の相対確率が増加する限り、好ましい例に対するモデルの尤度を減少させる可能性があることを理論的に示す。次に、この現象は、一般的なデータセット、特に補完のペア間の編集距離が低いデータセットでLLMを微調整する際に起こることを経験的に示す。これらの知見を利用して、この失敗モードを回避する新しい損失関数と学習手順であるDPO-Positive(DPOP)を設計する。驚くべきことに、DPOPは、補完間の編集距離が高いデータセットを含む、様々なデータセットや下流タスクにおいて、DPOや他の微調整手順を上回ることがわかった。さらに、DPOPでチューニングされたモデルは、MT-Benchのようなファインチューニングデータに依存しないベンチマークにおいて、DPOでチューニングされたモデル（他のすべてが同じ）を上回ることがわかった。最後に、DPOPを使用してSmaug-34BとSmaug-72Bを作成し、オープンソース化しました。後者は、HuggingFace Open LLM Leaderboardで平均80%の精度を超えた最初のオープンソースLLMとなりました。

要約(オリジナル)

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model’s likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

arxiv情報

著者	Arka Pal,Deep Karkhanis,Samuel Dooley,Manley Roberts,Siddartha Naidu,Colin White
発行日	2024-07-03 13:46:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー