Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models

要約

小規模な言語モデルと大規模な言語モデル間の推論能力の調整は、主に、堅牢な大規模言語モデル (LLM) から生成されたデモンストレーションを使用した教師あり微調整 (SFT) によって行われます。
これらのアプローチはよりパフォーマンスの高いモデルを提供しますが、トレーニングは提供されたデモンストレーションのみに依存するため、十分に強力な一般化能力は示されません。
この論文では、より小さい言語モデルにその能力を自己調整させる自己調整命令チューニング手法を提案します。
私たちのアプローチは 2 段階のプロセスに基づいており、最初に LLM が提供するデモンストレーションの命令チューニングを介して LLM と小型言語モデル (SLM) の間で推論能力が伝達され、その後、指示されたモデルが好みの最適化戦略を通じて能力を自己磨きます。
。
特に、第 2 フェーズでは、Direct Preference Optimization アルゴリズムに基づいて改良ヒューリスティックを操作します。ここでは、生成された応答を自動的にサンプリングし、LLM からのグランドトゥルースを使用して報酬を提供することで、SLM が一連の推論パスを提供するように誘導されます。
常識的および数学的推論タスクで得られた結果は、このアプローチがドメイン内シナリオとドメイン外シナリオの両方で命令チューニングを大幅に上回り、より小規模な言語モデルとより大きな言語モデルの推論能力を調整することを示しています。

要約(オリジナル)

The alignments of reasoning abilities between smaller and larger Language Models are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger Language Models.

arxiv情報

著者	Leonardo Ranaldi,Andrè Freitas
発行日	2024-05-01 09:10:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー