Understanding the Logic of Direct Preference Alignment through Logic

要約

DPO などの最近の直接好み調整アルゴリズム (DPA) は、大規模な言語モデルを人間の好みに合わせて調整する上で大きな期待を集めています。
これにより、元の DPO 損失の多くの新しいバリアントの開発が動機付けられましたが、これらの最近の提案間の違いを理解し、新しい DPA 損失関数を開発することは、基礎となるセマンティクスを推論するための技術的および概念的なフレームワークが欠如しているため、依然として困難です。
これらのアルゴリズムの。
この論文では、離散推論問題の観点から DPA 損失を形式化することで、この問題を解決しようと試みます。
具体的には、既存の DPA 損失を考慮して、そのセマンティクスを特徴付ける記号式を体系的に導出できるかどうかを尋ねます。
2 つの損失の意味は相互にどのように関連していますか?
我々は、単一モデルおよび参照モデルベースのアプローチの優先損失を特徴付けるための新しい形式を提案し、一般的に使用される多くの DPA バリアントの記号形式を特定します。
さらに、この優先学習の形式的な見方が、DPA 損失状況の規模と構造の両方にどのように新たな光を当て、最近の損失提案間の関係を厳密に特徴付けるだけでなく、状況を体系的に調査して新しい情報を導き出すことも可能にする方法を示します。
第一原理からの損失関数。
私たちのフレームワークと調査結果が、人間の AI 調整に取り組む人々に有益な指針を提供するのに役立つことを願っています。

要約(オリジナル)

Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic expression that characterizes its semantics? How do the semantics of two losses relate to each other? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.

arxiv情報

著者	Kyle Richardson,Vivek Srikumar,Ashish Sabharwal
発行日	2024-12-23 16:23:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding the Logic of Direct Preference Alignment through Logic

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー