EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

要約

言語モデルが既存の推論ベンチマークをマスターするため、認知フロンティアを評価するための新しい課題が必要です。
パズル解決イベントは、幅広い高度な推論と知識機能をテストする挑戦的なマルチモーダル問題の豊富なリポジトリであり、フロンティア言語モデルを評価するためのユニークなテストベッドになります。
エニグマーバルを紹介します。これは、暗黙の知識統合とマルチステップの演ductiveな推論を実行するモデルの能力を調査するパズル競争とイベントから派生した問題と解決策のデータセットです。
既存の推論や知識のベンチマークとは異なり、パズル解決課題モデルは、一見無関係な情報の間の隠されたつながりを発見し、ソリューションパスを明らかにします。
ベンチマークは、さまざまな複雑さの1184のパズルで構成されています。それぞれが通常、熟練したソルバーのチームが完了するのに数日から数日までのチームを必要とします – 効率的な評価を可能にする明確で検証可能なソリューションを使用します。
最先端の言語モデルは、人類の最後の試験など、他の困難なベンチマークよりもさらに低いこれらのパズルで非常に低い精度を達成し、構造化されていないおよび横方向の推論を必要とする問題に挑戦した場合のモデルの欠点を明らかにします。

要約(オリジナル)

As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models’ ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity — each typically requiring teams of skilled solvers hours to days to complete — with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity’s Last Exam, unveiling models’ shortcomings when challenged with problems requiring unstructured and lateral reasoning.

arxiv情報

著者	Clinton J. Wang,Dean Lee,Cristina Menghini,Johannes Mols,Jack Doughty,Adam Khoja,Jayson Lynch,Sean Hendryx,Summer Yue,Dan Hendrycks
発行日	2025-02-14 16:40:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー