Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

要約

定量的推論はデータを分析するための重要なスキルですが、そのような能力の評価は依然として限られています。
このギャップに対処するために、実世界のデータを使用した統計的および因果的推論における大規模言語モデルの機能を評価することを目的とした、データによる定量的推論 (QRData) ベンチマークを導入します。
このベンチマークは、教科書、オンライン学習教材、学術論文からのデータシートを伴う 411 の質問からなる慎重に構築されたデータセットで構成されています。
データとテキストに関するモデルの定量的推論能力を比較するために、290 個のテキストのみの質問の補助セット、つまり QRText でベンチマークを強化しました。
私たちは、自然言語推論、プログラムベースの推論、思考連鎖、思考プログラム、ReAct、コードインタープリタアシスタントなどのエージェント推論手法をさまざまなモデルで評価します。
最強モデルGPT-4は精度58%を達成しており、改善の余地は大きい。
オープンソースモデルの中で、2T トークンで事前トレーニングされたコード LLM である Deepseek-coder-instruct は、37% という最高の精度を獲得しています。
分析の結果、モデルはデータ分析と因果推論で困難に直面し、因果知識と提供されたデータを同時に使用するのに苦労していることが明らかになりました。
コードとデータは https://github.com/xxxiaol/QRData にあります。

要約(オリジナル)

Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models’ capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models’ quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously. Code and data are in https://github.com/xxxiaol/QRData.

arxiv情報

著者	Xiao Liu,Zirui Wu,Xueqing Wu,Pan Lu,Kai-Wei Chang,Yansong Feng
発行日	2024-02-27 16:15:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー