Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

要約

大規模言語モデル (LLM) は優れたパフォーマンスを示しますが、再トレーニングすることなく人間の好みにすぐに適応できる柔軟性に欠けています。
この作業では、推論中に LLM 出力を人間の好みに合わせて調整するフレームワークであるテスト時好みの最適化 (TPO) を導入し、モデルパラメーターを更新する必要性を排除します。
TPO は、純粋に数値的な報酬に依存するのではなく、報酬シグナルをテキストの批評に変換し、それをテキストの報酬として使用して、応答を反復的に改良します。
指示への従うこと、好みの調整、安全性、および数学をカバーするベンチマークの評価により、TPO が人間の好みとの調整を徐々に改善することが明らかになりました。
特に、わずか数回の TPO ステップの後、最初に調整されていない Llama-3.1-70B-SFT モデルは、調整された対応する Llama-3.1-70B-Instruct を超えることができます。
さらに、TPO は、推論中の検索幅と深さの両方に応じて効率的に拡張されます。
ケーススタディを通じて、TPO が LLM の生来の能力を活用して報酬シグナルを解釈し、それに基づいて行動する方法を説明します。
私たちの調査結果は、TPO がテスト時の優先順位の最適化のための実用的で軽量な代替手段であることを確立し、その場で調整を実現します。
私たちのコードは https://github.com/yafuly/TPO で公開されています。

要約(オリジナル)

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

arxiv情報

著者	Yafu Li,Xuyang Hu,Xiaoye Qu,Linjie Li,Yu Cheng
発行日	2025-01-22 14:15:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー