Variational Best-of-N Alignment

要約

BoN（Best-of-N）は、言語モデルを人間の嗜好に合わせるための、よく使われる効果的なアルゴリズムである。このアルゴリズムは、推論時に言語モデルからN個のサンプルが抽出され、報酬モデルによって判断された最も報酬の高いサンプルが出力として返される。推論時にBoNをより効率的にするための1つの戦略は、推論時にBoNが行うことを模倣するように言語モデルを微調整することである。これを実現するために、BoNアルゴリズムによって誘導される分布を導出する。そして、BoN分布に対する後方KLダイバージェンスを最小化するように言語モデルを微調整することを提案する。我々のアプローチは平均場変分推論に類似しているため、変分BoN（vBoN）と呼ぶ。制御された生成タスクと要約タスクの実験から、BoNは最も効果的なアライメント手法であり、我々のBoNへの変分近似はBoNに最も近い性能を達成し、標準的なKL制約付きRL目的語を用いて微調整されたモデルを凌駕することが示された。統制された生成タスクでは、vBoNは他のアライメント手法と比較して、報酬とKL発散のパレートフロンティア上に頻繁に現れる。要約タスクでは、vBoNは様々なサンプリング温度で高い報酬値を達成している。

要約(オリジナル)

Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on controlled generation and summarization tasks show that BoN is the most effective alignment method, and our variational approximation to BoN achieves the closest performance to BoN and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, vBoN appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, vBoN achieves high reward values across various sampling temperatures.

arxiv情報

著者	Afra Amini,Tim Vieira,Elliott Ash,Ryan Cotterell
発行日	2025-03-03 11:08:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Variational Best-of-N Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー