Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

要約

推論能力の進歩により、大規模な言語モデル（LLMS）は複雑なタスクに優れています。
ただし、既存の方法は、推論の有効性と計算効率の間のトレードオフを見落としており、多くの場合、チェーンとトークンを無駄にする不必要に長い推論を奨励しています。
これに対処するために、LLMSの情報理論強化微調整フレームワークであるThink（L2T）を学習することを提案し、モデルがトークンを少なくする最適な推論を達成させるようにします。
具体的には、L2Tは各クエリ応答相互作用を複数のエピソードの階層セッションとして扱い、普遍的な密なプロセス報酬を提案します。つまり、パラメーターのエピソードごとの情報ゲインを定量化し、追加の注釈やタスク固有の評価者を必要としません。
PACベイズの境界とフィッシャー情報マトリックスに基づいて、この報酬を迅速に推定する方法を提案します。
理論分析では、推定精度が高いと計算の複雑さが大幅に低下することが示されています。
各エピソードの貢献に直ちに報酬を与え、過度の更新にペナルティを科すことで、L2Tは強化学習を介してモデルを最適化して、各エピソードの使用を最大化し、効果的な更新を達成します。
さまざまな推論ベンチマークとベースモデルの経験的結果は、さまざまなタスクにわたるL2Tの利点を示しており、推論の有効性と効率の両方を高めます。

要約(オリジナル)

Large language models (LLMs) excel at complex tasks thanks to advances in reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode’s contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.

arxiv情報

著者	Jingyao Wang,Wenwen Qiang,Zeen Song,Changwen Zheng,Hui Xiong
発行日	2025-05-15 15:40:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー