Preference Optimization for Reasoning with Pseudo Feedback

要約

Direct Preference Optimization (DPO) などの好みの最適化手法は、数学的推論やコーディングなどの分野で大規模言語モデル (LLM) の推論機能を強化するために、通常は教師付き微調整に続いて頻繁に使用されます。
これらの方法は、プリファレンスペアを生成する推論タスクの高品質ラベルに依存しています。
ただし、人間が検証したラベルを含む推論データセットの利用可能性は限られています。
この研究では、問題を推論するための解決策のラベル付けを、関連するテストケースに対する評価として枠組み化することで、推論タスクに対する疑似フィードバックを生成する新しいアプローチを紹介します。
テストケースに基づいた 2 つの形式の疑似フィードバックを調査します。1 つはフロンティア LLM によって生成され、もう 1 つは自己一貫性を複数のテストケースに拡張することによって生成されます。
私たちは、好みの最適化のための疑似フィードバックを使用して、数学的推論とコーディングタスクの両方で実験を実施し、両方のタスクにわたる改善を観察しました。
具体的には、Mathstral-7B をベースモデルとして使用し、MATH 結果を 58.3 から 68.6 に改善し、NuminaMath-72B と GPT-4-Turbo-1106-preview の両方を上回りました。
GSM8K と College Math では、スコアがそれぞれ 85.6 から 90.3 に、34.3 から 42.3 に増加しました。
Deepseek-coder-7B-v1.5 をベースに構築すると、LiveCodeBench でスコア 24.6 (21.1 から) を達成し、Claude-3-Haiku を上回りました。

要約(オリジナル)

Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

arxiv情報

著者	Fangkai Jiao,Geyang Guo,Xingxing Zhang,Nancy F. Chen,Shafiq Joty,Furu Wei
発行日	2024-11-25 12:44:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preference Optimization for Reasoning with Pseudo Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー