CryptoX : Compositional Reasoning Evaluation of Large Language Models

要約

構成の推論能力は、大規模な言語モデルLLMの一般化と知性の出現にとって重要であると長い間考えられてきました。
ただし、多くの推論関連のベンチマークにもかかわらず、LLMSの構成推論能力は、既存のベンチマークではほとんど研究または定量化されません。
この論文では、Cryptoxを紹介します。Cryptoxは、既存のベンチマークと暗号化を初めて組み合わせてLLMSの構成推論能力を定量化する評価フレームワークを紹介します。
Cryptoxに基づいて、Cryptobenchを構築します。これにより、これらの原則を体系的な評価のためにいくつかのベンチマークに統合します。
Cryptobenchを使用して広く使用されているオープンソースおよびクローズドソースLLMについて詳細な実験を行い、オープンソースとクローズドソースLLMの間に大きなギャップが明らかになります。
さらに、徹底的な機械的解釈可能性実験を実施して、LLMSの組成推論の内部メカニズムを明らかにし、問題の分解、問題サブ問題の推論、および副次的な結論を要約します。
Cryptobenchに基づいた分析を通じて、組成の推論を独立して研究することの価値を強調し、LLMSの組成的推論能力を高める必要性を強調します。

要約(オリジナル)

The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs’ compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

arxiv情報

著者	Jiajun Shi,Chaoren Wei,Liqun Yang,Zekun Moore Wang,Chenghao Yang,Ge Zhang,Stephen Huang,Tao Peng,Jian Yang,Zhoufutu Wen
発行日	2025-03-12 13:17:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CryptoX : Compositional Reasoning Evaluation of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー