S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

要約

S1-Benchを紹介します。S1-Benchは、大規模な推論モデルを評価するために設計された新しいベンチマークである「LRMS）パフォーマンスを紹介します。
LRMは、明示的な思考チェーンを通じて複雑な推論タスクで大きなブレークスルーを達成しましたが、深い分析的思考への依存は、システム1思考機能を制限する可能性があります。
さらに、そのような機能を必要とするタスクでのLRMSのパフォーマンスを評価するために、ベンチマークの欠如が現在存在しています。
このギャップを埋めるために、S1-Benchは、このようなタスクでのLRMSのパフォーマンスを評価するために特別に設計された複数のドメインと言語にわたって、単純で多様で自然に明確な質問のセットを提示します。
22 LRMの包括的な評価により、有意な効率が低いことが明らかになり、出力は従来の小型LLMの平均よりも15.5倍長くなります。
さらに、LRMはしばしば正解を早期に識別しますが、不必要な審議を続け、一部のモデルでは多数のエラーを生成します。
これらの調査結果は、現在のLRMの厳格な推論パターンを強調し、タスクの複雑さに適切に適応できるバランスの取れたデュアルシステム思考機能を達成するために必要な実質的な開発を強調しています。

要約(オリジナル)

We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs’ performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs’ performance in such tasks. Our comprehensive evaluation of 22 LRMs reveals significant lower efficiency tendencies, with outputs averaging 15.5 times longer than those of traditional small LLMs. Additionally, LRMs often identify correct answers early but continue unnecessary deliberation, with some models even producing numerous errors. These findings highlight the rigid reasoning patterns of current LRMs and underscore the substantial development needed to achieve balanced dual-system thinking capabilities that can adapt appropriately to task complexity.

arxiv情報

著者	Wenyuan Zhang,Shuaiyi Nie,Xinghua Zhang,Zefeng Zhang,Tingwen Liu
発行日	2025-04-14 16:13:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー