Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

要約

大規模言語モデル (LLM) は、数学的推論の問題を解決する能力が向上していることを示しています。
ただし、現在のオープンソース LLM の多くは、依然として中間推論ステップで計算エラーや意味理解エラーを起こすことがよくあります。
この研究では、最終的な答えを集計する前に、プログラムベースの検証をヒューリスティックとして使用して、不正確な可能性のある推論パスを除外するシンプルかつ効果的なフレームワークである PROVE を提案します。
バニラの多数決に依存する代わりに、私たちのアプローチは、対応するプログラム出力が生成されたソリューションと矛盾するソリューションを拒否し、Python プログラムによって検証されたもののみを集計します。
私たちは、7 つの数学ベンチマークにわたって、0.5B から 13B パラメーターまでのさまざまなモデルファミリとサイズの 13 個のオープンソース LLM で広範な実験を実施しました。
私たちは、PROVE が、すべてのデータセットとモデルサイズにわたって数学的推論タスクを解決するためのヒューリスティックとして、バニラの多数決よりも一貫して優れたパフォーマンスを発揮することを実証します。
特に、PROVE は GSM8K ベンチマークの精度を Qwen2-0.5B-Instruct で 48.85% から 53.83%、Llama-3.2-1B-Instruct で 65.66% から 73.01%、Gemma-2-2b で 73.39% から 79.61% に向上させています。
-it、Llama-2-7B-chat では 41.32% から 59.51% に。
コードは https://github.com/declare-lab/prove で入手できます。

要約(オリジナル)

Large language models (LLMs) have shown increasing proficiency in solving mathematical reasoning problems. However, many current open-source LLMs often still make calculation and semantic understanding errors in their intermediate reasoning steps. In this work, we propose PROVE, a simple yet effective framework that uses program-based verification as a heuristic to filter out potentially incorrect reasoning paths before aggregating the final answers. Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution, aggregating only those validated by Python programs. We conducted extensive experiments on 13 open-source LLMs from various model families and sizes, ranging from 0.5B to 13B parameters, across seven math benchmarks. We demonstrate that PROVE consistently outperforms vanilla majority voting as a heuristic for solving mathematical reasoning tasks across all datasets and model sizes. Notably, PROVE increases accuracy on the GSM8K benchmark from 48.85% to 53.83% for Qwen2-0.5B-Instruct, from 65.66% to 73.01% for Llama-3.2-1B-Instruct, from 73.39% to 79.61% for Gemma-2-2b-it, and from 41.32% to 59.51% for Llama-2-7B-chat. Our codes are available at https://github.com/declare-lab/prove.

arxiv情報

著者	Vernon Y. H. Toh,Deepanway Ghosal,Soujanya Poria
発行日	2024-10-16 14:24:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー