Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

要約

最近の思考連鎖 (CoT) デコードの開発により、大規模言語モデル (LLM) が複雑な問題解決のための明示的な論理的推論パスを生成できるようになりました。
ただし、研究によると、これらのパスは必ずしも意図的で最適なものではありません。
思考ツリー (ToT) 方法では、ツリー検索を使用して推論空間を広範囲に探索し、CoT デコードでは見落とす可能性のあるより適切な推論パスを見つけます。
ただし、この検討には、推論の複雑さが大幅に増加するという代償が伴います。
この研究では、ToT によって構築された検索ツリーを活用して LLM を微調整することで、CoT が同等以上のパフォーマンスを達成し、それによって実質的な推論の負担が回避されることを実証します。
これは、ツリー検索プロセスの固有の優先情報を使用して、CoT 推論パスの各ステップを ToT のステップと一致させるように LLM が微調整される、優先チェーン最適化 (CPO) によって実現されます。
広範な実験結果は、CPO が質問応答、事実検証、算術推論などのさまざまな複雑な問題を解決する際の LLM パフォーマンスを大幅に向上させ、その有効性を実証していることを示しています。
私たちのコードは https://github.com/sail-sg/CPO で入手できます。

要約(オリジナル)

The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at https://github.com/sail-sg/CPO.

arxiv情報

著者	Xuan Zhang,Chao Du,Tianyu Pang,Qian Liu,Wei Gao,Min Lin
発行日	2024-06-13 14:07:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー