Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

要約

LLM の適用範囲が広くなり、遍在性が高まっているため、LLM の応答をユーザーや関係者の好みに合わせる必要性が高まっています。
LLM パラメータを微調整して良好な調整を実現する、多くの優先最適化アプローチが提案されています。
ただし、このようなパラメーター調整は、多くのタスクでモデルのパフォーマンスに干渉することが知られています。
さらに、このような状況では、ユーザーの好みの変化に対応するのは困難です。
報酬モデルガイダンスを使用したデコード時間の調整により、推論時間の増加を犠牲にしてこれらの問題が解決されます。
しかし、そのような方法のほとんどは、報酬の探索と活用の間で適切なバランスをとることができず、多くの場合、これら 2 つの側面が混同されて定式化されていることが原因で、適切に調整された応答を与えることができません。
これを改善するために、これら 2 つの側面を切り離し、進化的な方法で実装します。探索は、変異した命令を解読することによって強制され、搾取は、報酬の少ない世代を報酬の多い世代に定期的に置き換えることとして表されます。
経験的証拠は、この戦略が、広く受け入れられている 2 つのアライメントベンチマークである AlpacaEval 2 および MT-Bench での多くの優先順位の最適化およびデコード時アライメントのアプローチよりも優れていることを示しています。
私たちの実装は https://darwin-alignment.github.io で入手できます。

要約(オリジナル)

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward — often due to the conflated formulation of these two aspects – to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

arxiv情報

著者	Chia-Yu Hung,Navonil Majumder,Ambuj Mehrish,Soujanya Poria
発行日	2024-06-24 17:36:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー