Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

要約

現実世界のエンタープライズテキストからSQLワークフローには、さまざまなデータベースシステムにわたる複雑なクラウドまたはローカルデータ、さまざまな方言の複数のSQLクエリ、およびデータ変換から分析までの多様な操作が含まれます。
エンタープライズレベルのデータベースユースケースから派生した632の実世界のテキストからSQLへのワークフローの問題を含む評価フレームワークであるSpider 2.0を紹介します。
Spider 2.0のデータベースは、1,000列以上を含む多くの場合、BigQueryやSnowflakeなどのローカルまたはクラウドデータベースシステムに保存される実際のデータアプリケーションから供給されています。
Spider 2.0の問題を解決するには、データベースメタデータ、方言のドキュメント、さらにはプロジェクトレベルのコードベースを理解して検索する必要があることが多いことを示しています。
この課題では、モデルが複雑なSQLワークフロー環境と対話し、非常に長いコンテキストを処理し、複雑な推論を実行し、多様な操作で複数のSQLクエリを生成することを求めています。
私たちの評価は、O1-Previewに基づいて、コードエージェントフレームワークは、クモ1.0で91.2％、鳥の73.0％と比較して、タスクの21.3％のみを正常に解決することを示しています。
Spider 2.0の結果は、言語モデルがコード生成（特に以前のテキストからSQLのベンチマークで顕著なパフォーマンス）を実証している一方で、実際のエンタープライズ使用に適切なパフォーマンスを達成するために大幅な改善が必要であることを示しています。
Spider 2.0の進捗は、実際のエンタープライズ設定のインテリジェントで自律的なコードエージェントを開発するための重要なステップを表しています。
当社のコード、ベースラインモデル、およびデータは、https：//spider2-sql.github.ioで入手できます。

要約(オリジナル)

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation — especially in prior text-to-SQL benchmarks — they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at https://spider2-sql.github.io

arxiv情報

著者	Fangyu Lei,Jixuan Chen,Yuxiao Ye,Ruisheng Cao,Dongchan Shin,Hongjin Su,Zhaoqing Suo,Hongcheng Gao,Wenjing Hu,Pengcheng Yin,Victor Zhong,Caiming Xiong,Ruoxi Sun,Qian Liu,Sida Wang,Tao Yu
発行日	2025-03-17 16:10:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー