Decoding-time Realignment of Language Models

要約

言語モデルを人間の嗜好に合わせることは、モデルのエラーやバイアスを減らすために非常に重要である。人間のフィードバックからの強化学習（RLHF）のようなアライメント技術は、一般的に、人間の嗜好の報酬と、アライメントされていないモデルに近づくことを促す近接正則化項との間のトレードオフを最適化する。正則化の適切なレベルを選択することは非常に重要である。正則化が不十分であると、報酬のハッキングによりモデルの能力が低下する可能性があり、正則化が過剰であるとアライメントが阻害される。最適な正則化レベルを見つける従来の方法では、正則化の強さを変えて複数のモデルを再トレーニングする必要があります。しかしこのプロセスは、特に大規模なモデルの場合、リソースを大量に消費する。この課題を解決するために、我々は、再トレーニングを行うことなく、アライメントされたモデルにおいて異なる正則化の強さを探索し、評価する簡単な方法である、デコーディング時間再アライメント（Decoding-time realignment：DeRa）を提案する。DeRaはアライメントの度合いを制御することができ、ユーザはアライメントされていないモデルとアライメントされたモデルの間をスムーズに移行することができる。また、検証データセットを使用して効果的な正則化の強さを特定できるため、ハイパーパラメータのチューニングの効率も向上します。

要約(オリジナル)

Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), are typically cast as optimizing a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. Selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. Traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths. This process, however, is resource-intensive, especially for large models. To address this challenge, we propose decoding-time realignment (DeRa), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. DeRa enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. It also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.

arxiv情報

著者	Tianlin Liu,Shangmin Guo,Leonardo Bianco,Daniele Calandriello,Quentin Berthet,Felipe Llinares,Jessica Hoffmann,Lucas Dixon,Michal Valko,Mathieu Blondel
発行日	2024-02-05 13:31:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Decoding-time Realignment of Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー