CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

要約

大規模言語モデル (LLM) が推論を批判し洗練する能力は、評価、フィードバックの提供、自己改善に応用するために非常に重要です。
このペーパーでは、さまざまなタスクにわたって LLM の推論を批判および修正する能力を評価するために設計された包括的なベンチマークである CriticBench を紹介します。
CriticBench には、数学的、常識的、記号的、コーディング、アルゴリズムの 5 つの推論ドメインが含まれます。
これは 15 のデータセットをコンパイルし、3 つの LLM ファミリからの応答を組み込みます。
CriticBench を利用して、生成、批評、修正推論、つまり GQC 推論における 17 個の LLM のパフォーマンスを評価および分析します。
私たちの調査結果は次のことを明らかにしました: (1) GQC 能力には直線的な関係があり、批評に焦点を当てたトレーニングによりパフォーマンスが著しく向上します。
(2) 修正効果のタスク依存の変動。ロジック指向のタスクの方が修正しやすい。
(3) GQC 知識の不一致は、モデルのサイズが大きくなるにつれて減少します。
(4) 興味深いモデル間の批判力学。強いモデルは弱いモデルを批判するのが得意ですが、弱いモデルは自己批判において驚くほど強いモデルを上回る可能性があります。
私たちは、LLM の微妙な批判と正しい推論に関するこれらの洞察が、LLM 批判と自己改善におけるさらなる研究を促進することを願っています。

要約(オリジナル)

The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs’ abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

arxiv情報

著者	Zicheng Lin,Zhibin Gou,Tian Liang,Ruilin Luo,Haowei Liu,Yujiu Yang
発行日	2024-03-08 15:15:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー