The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

要約

大規模言語モデル(LLM)のコードに対する評価は、主にHumanEval(Chen et al., 2021)を含む静的ベンチマークに依存している。LLMがプログラマーのアシスタントとしてますます使用されるようになっているため、我々は、既存のベンチマークでの向上が、LLMを使用したコーディングにおけるプログラマーの生産性向上（コーディングに費やした時間を含む）につながるかどうかを研究する。静的ベンチマークに加え、LLMの有用性を測定するプロキシとして使用される可能性のあるプリファレンスメトリックス、例えばコード受容率やコピー率の有用性を調査する。そのために、オートコンプリートやチャット・サポートを通じて、LLMがプログラマーを支援する能力を測定するウェブ・インターフェース、RealHumanEvalを紹介する。RealHumanEvalを使ったユーザー調査(N=213)を実施し、ユーザーはベースモデルの性能が異なる6つのLLMと対話した。静的ベンチマークは人間をループに組み込んでいないにもかかわらず、ベンチマークの性能向上がプログラマーの生産性向上につながることがわかった。対照的に、プログラマーの嗜好は実際のパフォーマンスと相関しないことがわかり、より優れた人間中心のプロキシシグナルの必要性を動機づけた。また、RealHumanEvalをオープンソース化することで、新しいモデルの人間中心評価を可能にし、コードモデルの改善努力を促進する研究データを提供する。

要約(オリジナル)

Evaluation of large language models (LLMs) for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), which measure the ability of LLMs to generate complete code that passes unit tests. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks translate to gains in programmer productivity when coding with LLMs, including time spent coding. In addition to static benchmarks, we investigate the utility of preference metrics that might be used as proxies to measure LLM helpfulness, such as code acceptance or copy rates. To do so, we introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=213) using RealHumanEval in which users interacted with six LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional — a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better, human-centric proxy signals. We also open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

arxiv情報

著者	Hussein Mozannar,Valerie Chen,Mohammed Alsobay,Subhro Das,Sebastian Zhao,Dennis Wei,Manish Nagireddy,Prasanna Sattigeri,Ameet Talwalkar,David Sontag
発行日	2024-04-03 15:20:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー