How well do LLMs reason over tabular data, really?

要約

大規模な言語モデル（LLM）は自然言語のタスクに優れていますが、表形式データに対する推論能力についてはあまり知られていません。
以前の分析では、表形式クエリでのLLMの現実的なパフォーマンスを不十分に反映する評価戦略を考案します。
さらに、LLMSの堅牢性について、表形式入力の現実的な変動に対する理解は限られています。
したがって、私たちは次のように尋ねます：LLMSが表形式データを概して推論することができますか？そして、2つの質問に焦点を当てることができます1）表形式のLLMの表形式の推論能力は、表形式入力の実際の特性に対して堅牢に堅牢です。
最近の表の推論ベンチマークに基づいて、最初にその多肢選択迅速な評価戦略の表面欠点と、SacrebleuやBert-Soreなどの一般的に使用されるフリーフォームテキストメトリックを使用します。
LLM-as-a-a-a-judgeの手順により、より信頼性の高いパフォーマンスの洞察が得られ、LLMSの表形式の推論パフォーマンスの重大な赤字が明らかになることを示します。
次に、実際の3つの一般的な特性を反映した表形式の入力を拡張します。1）欠損値、2）重複したエンティティ、3）構造変動。
実験は、汎用LLMの表形式の推論能力がこれらの変動に苦しんでおり、現実的な表面入力に対する堅牢性を改善することの重要性を強調していることを示しています。

要約(オリジナル)

Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM’s realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM’s performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.

arxiv情報

著者	Cornelius Wolff,Madelon Hulsebos
発行日	2025-06-02 15:39:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How well do LLMs reason over tabular data, really?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー