Preference Optimization for Reasoning with Pseudo Feedback

要約

直接選好最適化（DPO）などの優先最適化手法は、数学的推論やコーディングなどのドメインでの大きな言語モデル（LLMS）の推論能力を強化するために頻繁に採用され、通常は監視された微調整に続きます。
これらの方法は、好みのペアを生成するためにタスクを推論するために高品質のラベルに依存しています。
ただし、人間が検証したラベルを使用した推論データセットの可用性は限られています。
この研究では、関連するテストケースに対する評価として推論問題に対する解決策をフレーミングすることにより、推論タスクのための擬似フィードバックを生成するための新しいアプローチを紹介します。
テストケースに基づいて2つの形式の擬似フィードバックを調査します。1つは、フロンティアLLMSによって生成され、もう1つはマルチテストケースに自己整合性を拡大することによって生成されます。
優先最適化のために擬似フィードバックを使用して、数学的推論とコーディングタスクの両方について実験を実施し、両方のタスクで改善を観察します。
具体的には、MathStral-7Bを基本モデルとして使用して、58.3から68.6に数学の結果を改善し、nuninamath-72bとGPT-4-Turbo-106-previewの両方を上回ります。
GSM8Kと大学の数学では、スコアはそれぞれ85.6から90.3、34.3から42.3に増加します。
DeepSeek-Coder-7B-V1.5に基づいて、LiveCodebench（21.1から）で24.6のスコアを達成し、Claude-3-Haikuを上回っています。

要約(オリジナル)

Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

arxiv情報

著者	Fangkai Jiao,Geyang Guo,Xingxing Zhang,Nancy F. Chen,Shafiq Joty,Furu Wei
発行日	2025-02-14 09:32:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preference Optimization for Reasoning with Pseudo Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー