Aligning language models with human preferences

要約

膨大な量のテキストデータでトレーニングされた言語モデル (LM) は、要約の生成、質問への回答、コードの生成などの高度なスキルを習得できます。
しかし、それらはまた、人間の好みに反する行動も示します。たとえば、不快なコンテンツ、虚偽を生成したり、社会的偏見を永続させたりする可能性があります。
この論文では、LM を人間の好みに合わせるためのいくつかのアプローチを検討します。
まず、LM の調整はベイズ推論、つまり人間の好みに関する証拠に基づいて事前 (ベースの事前訓練済み LM) を条件付けすることとみなすことができると主張します (第 2 章)。
人間の好みに応じた条件付けは、さまざまな方法で実装できます。
第 3 章では、スコア関数によって与えられるフィードバックを使用して事前学習済み LM を微調整する 2 つのアプローチ、つまりヒューマンフィードバックからの強化学習 (RLHF) と分布マッチングの関係を調査します。
RLHF は分布マッチングの特殊なケースと見なすことができますが、分布マッチングは厳密にはより一般的なものであることを示します。
第 4 章では、分布マッチングを条件付き言語モデルに拡張する方法を示します。
最後に、第 5 章では、別の根本を探ります。それは、事前トレーニング中にすでに人間の好みに基づいて LM を条件付けすることです。
私は、人間によるフィードバックを最初から関与させる方が、教師付き微調整中にのみ使用するよりも効果的である傾向があることを示しました。
全体として、これらの結果は、RLHF とは異なる、または RLHF を補完する位置合わせ技術の余地を強調しています。

要約(オリジナル)

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

arxiv情報

著者	Tomasz Korbak
発行日	2024-04-18 12:55:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning language models with human preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー