Performance Evaluation of Large Language Models in Statistical Programming

要約

大規模な言語モデル（LLM）のプログラミング機能は、自動コード生成に革命をもたらし、自動統計分析のための新しい道を開きました。
ただし、これらの生成されたコードの有効性と品質は、広く採用する前に体系的に評価する必要があります。
その卓越性の高まりにもかかわらず、LLMSによって生成された統計コードの包括的な評価は、文献では依然として不足しています。
このホワイトペーパーでは、統計分析のためのSASプログラミングのドメインで、CHATGPTの2つのバージョンとLlamaの1つのバージョンを含むLLMのパフォーマンスを評価します。
私たちの研究では、多様な統計的トピックとデータセットを含む一連の統計分析タスクを利用しています。
各タスクには、問題の説明、データセット情報、および人間が検証したSASコードが含まれます。
正確性、有効性、読みやすさ、実行可能性、および出力結果の精度に基づいて、人間の専門家評価を通じてLLMによって生成されたSASコードの品質の包括的な評価を実施します。
評価スコアの分析により、LLMは構文的に正しいコードを生成する際の有用性を示しているが、深いドメインの理解を必要とするタスクと闘い、冗長または誤った結果を生成する可能性があることが明らかになります。
この研究は、統計プログラミングにおけるLLMの機能と制限に関する貴重な洞察を提供し、統計分析のためのAIアシストコーディングシステムの将来の進歩のガイダンスを提供します。

要約(オリジナル)

The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

arxiv情報

著者	Xinyi Song,Kexin Xie,Lina Lee,Ruizhe Chen,Jared M. Clark,Hao He,Haoran He,Jie Min,Xinlei Zhang,Simin Zheng,Zhiyang Zhang,Xinwei Deng,Yili Hong
発行日	2025-02-18 18:37:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Performance Evaluation of Large Language Models in Statistical Programming

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー