Inference Time Alignment with Reward-Guided Tree Search

要約

推論時の計算方法は、追加の計算リソースを活用して優れた結果を達成することにより、大規模言語モデル (LLM) のパフォーマンスを向上させます。
Best-of-N サンプリング、多数決投票、ツリー検索アルゴリズムの変形などの一般的な手法は、LLM のパフォーマンスを向上させるのに効果的であることが証明されています。
これらのアプローチは、増加した計算リソースを戦略的に交換してモデル応答を改善します。
この研究では、報酬モデルのガイダンスを利用して、報酬に基づくツリー検索を通じてアライメントを実現する推論時間アライメント手法である DARWIN を提案しました。
経験的証拠は、私たちの方法が、広く受け入れられている 2 つのアライメントベンチマーク AlpacaEval 2 および MT-Bench において、Best-of-N や ARGS などの他の推論時間アライメント手法よりも優れていることを示しています。
さらに、推論時のアプローチが両方のベンチマークで優先調整されたモデルに匹敵するパフォーマンスを達成することを示し、推論中のパフォーマンスを向上させるために推論時のコンピューティングをトレードすることの有効性を強調しています。
https://github.com/declare-lab/darwin でコードをリリースしました。

要約(オリジナル)

Inference-time computation methods enhance the performance of Large Language Models (LLMs) by leveraging additional computational resources to achieve superior results. Common techniques, such as Best-of-N sampling, Majority Voting, and variants of tree-search algorithms have proven to be effective in boosting the performance of LLMs. These approaches strategically trade increased computational resources for improved model responses. In this work, we proposed DARWIN, an inference-time alignment method that leverages the guidance of a reward model to achieve alignment through a reward-guided tree search. Empirical evidences indicates that our method outperforms other inference-time alignment methods such as Best-of-N and ARGS on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Furthermore, we show that our inference-time approach achieves performance comparable to preference-tuned models on both benchmarks, highlighting the effectiveness of trading inference-time compute for enhanced performance during inference. We have released our codes at https://github.com/declare-lab/darwin.

arxiv情報

著者	Chia-Yu Hung,Navonil Majumder,Ambuj Mehrish,Soujanya Poria
発行日	2024-11-26 12:13:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inference Time Alignment with Reward-Guided Tree Search

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー