Competition-Level Problems Are Effective Evaluators of LLMs

要約

大規模言語モデル(LLM)は印象的な推論能力を実証しているが、最近、これらの能力と潜在的なデータ汚染問題について継続的な議論がある。本論文は、LLMの推論能力を評価することを目的とし、特にCodeforcesにおける最近の競技レベルのプログラミング問題を解くことを目的とする。まず、GPT-4のゼロショット性能について、問題のリリース時間、難易度、エラーの種類など様々な観点から総合的に評価する。驚くべきことに、GPT-4のゼロショット性能は、2021年9月以降の問題において、全ての難易度と問題の種類において一貫して崖のような減少を経験しており、これは潜在的なデータ汚染と、既存のLLMが未知の複雑な推論問題を解くための課題を示している。我々はさらに、ファインチューニング、思考連鎖プロンプト、問題記述の簡略化など、様々なアプローチを検討したが、残念ながらどれも一貫して課題を軽減することはできなかった。我々の研究を通して、LLMの真の推論能力を評価するためのこの優れたデータソースの重要性を強調し、将来、より強力な推論能力と優れた汎化能力を持つLLMの開発を促進する。

要約(オリジナル)

Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4’s peiceived zero-shot performance on this task, considering various aspects such as problems’ release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.

arxiv情報

著者	Yiming Huang,Zhenghao Lin,Xiao Liu,Yeyun Gong,Shuai Lu,Fangyu Lei,Yaobo Liang,Yelong Shen,Chen Lin,Nan Duan,Weizhu Chen
発行日	2023-12-04 18:58:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Competition-Level Problems Are Effective Evaluators of LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー