FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

要約

Text-to-SQL テクノロジーは、さまざまな業界で自然言語を SQL クエリに変換するために重要になっており、技術者以外のユーザーでも複雑なデータ操作を実行できるようになります。
これらのシステムがより高度になるにつれて、正確な評価方法の必要性が高まっています。
しかし、最も有望な評価指標である実行精度 (EX) では、人間による評価と比較して、依然としてかなりの部分の偽陽性と偽陰性が示されていることがわかりました。
そこで、このホワイトペーパーでは、SQL クエリの人間の専門家レベルの評価をエミュレートする大規模言語モデル (LLM) を使用して text-to-SQL システムを評価する新しいアプローチである FLEX (False-Less EXecution) を紹介します。
私たちの方法は、人間の専門家の判断とかなり高い一致を示し、コーエンのカッパを 61 から 78.17 に改善しました。
FLEX を使用して Spider および BIRD ベンチマークの最高パフォーマンスのモデルを再評価すると、パフォーマンスランキングに大きな変化が見られ、偽陽性補正により平均パフォーマンスが 3.15 低下し、偽陰性への対処により 6.07 向上しました。
この研究は、text-to-SQL システムのより正確かつ微妙な評価に貢献し、この分野における最先端のパフォーマンスについての理解を再構築する可能性があります。

要約(オリジナル)

Text-to-SQL technology has become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, we found that the Execution Accuracy (EX), the most promising evaluation metric, still shows a substantial portion of false positives and negatives compared to human evaluation. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our method shows significantly higher agreement with human expert judgments, improving Cohen’s kappa from 61 to 78.17. Re-evaluating top-performing models on the Spider and BIRD benchmarks using FLEX reveals substantial shifts in performance rankings, with an average performance decrease of 3.15 due to false positive corrections and an increase of 6.07 from addressing false negatives. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.

arxiv情報

著者	Heegyu Kim,Taeyang Jeon,Seunghwan Choi,Seungtaek Choi,Hyunsouk Cho
発行日	2024-10-01 05:55:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー