Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization


異なる推論予算がある場合、当社の最高のモデルでは、それぞれ2.16ドルxおよび$ 4.32 $ x in inconerference予算を使用して、Math500を使用して、$ 4.14 $ \%および$ 5.74 $ \%の絶対改善($ 8.08 $ \%および$ 11.2 $ \%の相対改善)を持つことができます。
、llama3.1 8b指示に関連しています。


Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through self-correction and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to “understand” the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a $4.14$\% and $5.74$\% absolute improvement ($8.08$\% and $11.2$\% relative improvement) on MATH500 using $2.16$x and $4.32$x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately $2$x those of self-consistency under the same budgets.


著者 Zishun Yu,Tengyu Xu,Di Jin,Karthik Abinav Sankararaman,Yun He,Wenxuan Zhou,Zhouhao Zeng,Eryk Helenowski,Chen Zhu,Sinong Wang,Hao Ma,Han Fang
発行日 2025-01-31 16:06:26+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.AI パーマリンク