Understanding the Logic of Direct Preference Alignment through Logic

要約

DPOなどの最近の直接選好アライメントアルゴリズム（DPA）は、大規模な言語モデルを人間の好みに合わせることに大きな期待を示しています。
これは、元のDPO損失の多くの新しいバリアントの開発を動機付けていますが、これらのDPA損失関数の開発と同様に、これらの最近の提案の違いを理解することは、これらのアルゴリズムの基礎となるセマンティクスについて推論するための技術的および概念的な枠組みの欠如を考えると困難なままです。
この論文では、個別の推論問題の観点からDPAの損失を正式にすることにより、これを改善しようとします。
具体的には、既存のDPA損失を考えると、そのセマンティクスを特徴付ける象徴的なプログラムを体系的に導き出すことができますか？
単一モデルおよび参照モデルベースのアプローチの優先損失を特徴付けるための新しい形式を提案し、多くの一般的に使用されるDPAバリアントのシンボリック形式を特定します。
さらに、この好みの学習に関するこの正式な見解は、DPA損失の状況のサイズと構造の両方に新たな光を当て、最近の損失提案間の関係を厳密に特徴付けるだけでなく、景観を体系的に探求し、第一原理から新しい損失関数を導き出すことを可能にします。
私たちのフレームワークと調査結果が、人間のAIの調整に取り組んでいる人々に有用なガイダンスを提供するのに役立つことを願っています。

要約(オリジナル)

Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.

arxiv情報

著者	Kyle Richardson,Vivek Srikumar,Ashish Sabharwal
発行日	2025-03-27 17:30:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding the Logic of Direct Preference Alignment through Logic

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー