DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

要約

大規模言語モデル (LLM) と大規模視覚言語モデル (LVLM) は、優れた言語/視覚推論能力を実証しており、ショッピングアシスタントや AI ソフトウェアエンジニアなどの対象アプリケーション用のエージェントを構築するという最近の傾向に火をつけています。
最近、データサイエンス領域におけるパフォーマンスを調査するために、多くのデータサイエンスベンチマークが提案されています。
ただし、既存のデータサイエンスベンチマークは、設定が簡略化されているため、現実世界のデータサイエンスアプリケーションと比較すると依然として不十分です。
このギャップを埋めるために、現実的なタスクでデータサイエンスエージェントを評価するように設計された包括的なベンチマークである DSBench を紹介します。
このベンチマークには、Eloquence および Kaggle コンペティションから提供された 466 のデータ分析タスクと 74 のデータモデリングタスクが含まれています。
DSBench は、長いコンテキスト、マルチモーダルなタスクの背景、大規模なデータファイルとマルチテーブル構造を使用した推論、エンドツーエンドのデータモデリングタスクの実行を網羅することにより、現実的な設定を提供します。
最先端の LLM、LVLM、およびエージェントの評価では、ほとんどのタスクに苦戦しており、最高のエージェントでもデータ分析タスクの 34.12% しか解決せず、34.74% の相対パフォーマンスギャップ (RPG) を達成していることがわかりました。
これらの発見は、より実用的でインテリジェントで自律的なデータサイエンスエージェントの開発におけるさらなる進歩の必要性を強調しています。

要約(オリジナル)

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

arxiv情報

著者	Liqiang Jing,Zhehui Huang,Xiaoyang Wang,Wenlin Yao,Wenhao Yu,Kaixin Ma,Hongming Zhang,Xinya Du,Dong Yu
発行日	2024-09-12 02:08:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー