LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

要約

報酬モデル (RM) は、LLM を人間の好みに合わせる上で重要な役割を果たし、推論や反復トレーニング中に出力をランク付けすることでパフォーマンスを向上させます。
ただし、RM が新しいタスクにどの程度一般化するかは、事前にはわからないことがよくあります (たとえば、一部の RM は、数学的推論よりも創造的な文章の採点に優れている場合があります)。
したがって、LLM のトレーニング中に固定 RM を 1 つだけ使用することは最適ではない可能性があります。
さらに、複数の RM を使用して LLM を同時に最適化すると、異なる RM からの信号が競合するため、計算量が非常に多くなり、困難になる可能性があり、パフォーマンスが低下する可能性があります。
これらの課題に対処するために、LASeR (報酬を適応的に選択する学習) を導入します。これは、複数の RM を使用して LLM を反復的にトレーニングし、各インスタンスに最も適した RM を選択して利用して、出力をランク付けし、マルチアームとして構成された嗜好データを生成します。
山賊問題。
常識的および数学的推論タスクに関する結果は、LASeR が複数の RM を最適化することで反復的な LLM 最適化を向上させ、アンサンブル RM スコアを使用したトレーニングよりも 3 つのデータセットにわたる Llama-3-8B の絶対平均精度を 2.67% 向上させると同時に、優れたトレーニングを示していることを示しています。
効率（例：2倍のスピードアップ）。
さらに、指示に従うプロンプトのベンチマークである WildChat では、Llama-3-8B LASeR を使用すると、複数の RM を順次最適化する場合と比較して、AlpacaEval の勝率が 71.45% に達することがわかりました。
ロングコンテキスト生成タスクに拡張すると、Llama-3-8B では、ベストオブ n サンプリングを使用した場合、LASeR はランダムな RM 選択よりも単一および複数ドキュメントの QA で平均 2.64 F1 と 2.42 F1 の改善を達成することがわかりました。
。
LASeR はノイズの多い報酬に対して堅牢であり、複数の設定に一般化します。
最後に、LASeR の RM 選択は、基礎となるタスクまたはインスタンスに応じて変化し、LASeR を使用して軽減できる複数の RM からの競合する設定の存在を確認します。

要約(オリジナル)

Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR’s RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

arxiv情報

著者	Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
発行日	2024-10-02 16:46:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー