Learning to Reason via Self-Iterative Process Feedback for Small Language Models

要約

小規模言語モデル (SLM) は、大規模言語モデル (LLM) よりも効率的でコスト効率が高く、カスタマイズ可能ですが、推論などの特定の領域ではパフォーマンスが低下することがよくあります。
監視付き微調整や蒸留など、SLM の推論を強化するためのこれまでの方法は、コストのかかる外部信号に依存することが多く、その結果、SLM は限られた監視信号に対して過剰に自信を持ち、その能力が制限されてしまいました。
したがって、この研究により、SLM は自己反復フィードバックから推論する方法を学習できるようになります。
オッズ比優先最適化 (ORPO) を組み合わせることで、SLM 自体が生成する正および負の信号を使用して SLM を微調整し、調整します。
さらに、サンプリングベースの推論シミュレーションとプロセス報酬モデルによる、選好調整における報酬のプロセス監視を導入します。
教師あり微調整 (SFT) と比較して、私たちの方法は Gemma-2B のパフォーマンスを GSM8K で 12.43 (Acc)、MBPP で 3.95 (Pass@1) 向上させます。
さらに、提案された方法は、MMLU_Math および HumanEval に関する優れたドメイン外汎化機能も実証しました。

要約(オリジナル)

Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs’ reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.

arxiv情報

著者	Kaiyuan Chen,Jin Wang,Xuejie Zhang
発行日	2024-12-11 14:05:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー