Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

要約

優先最適化の最近の進歩は、大規模な言語モデル（LLM）の数学的推論能力を改善する重要な可能性を示しています。
現在のアプローチは、回答の正確性や一貫性などの結果ベースの基準を通じて高品質のペアワイズ優先データを活用していますが、応答の内部論理的一貫性を基本的に無視しています。
これを克服するために、優先選択のための二重定量的メトリックを確立する新しいフレームワークである確率一貫した優先順位（PCPO）を提案します。
広範な実験は、PCPOが、多様なLLMとベンチマークにわたって既存の結果のみの基準アプローチを常に上回ることを示しています。
私たちのコードは、https://github.com/yunqiaoyang/pcpoで公開されています。

要約(オリジナル)

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

arxiv情報

著者	Yunqiao Yang,Houxing Ren,Zimu Lu,Ke Wang,Weikang Shi,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li
発行日	2025-05-29 15:20:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー