Are Large Language Models Memorizing Bug Benchmarks?

要約

大規模言語モデル (LLM) は、コード生成、バグ検出、修復などのさまざまなソフトウェアエンジニアリングタスクに不可欠なものになっています。
これらのドメインでのモデルのパフォーマンスを評価するために、ソフトウェアプロジェクトからの実際のバグを含む多数のバグベンチマークが開発されています。
しかし、ソフトウェアエンジニアリングコミュニティ内で懸念が高まっているのは、データ漏洩のリスクにより、これらのベンチマークが真の LLM パフォーマンスを確実に反映していない可能性があるということです。
この懸念にもかかわらず、潜在的な漏洩の影響を定量化するために限られた研究が実施されています。
このペーパーでは、一般的な LLM を体系的に評価し、広く使用されているバグベンチマークからデータ漏洩に対する脆弱性を評価します。
潜在的な漏洩を特定するために、一般的に使用されるトレーニングデータセット内のベンチマークメンバーシップの調査や、負の対数尤度や N グラム精度の分析など、複数の指標を使用します。
私たちの調査結果では、特定のモデル、特に codegen-multi は、Defects4J のような広く使用されているベンチマークで暗記の顕著な証拠を示す一方、LLaMa 3.1 のような大規模なデータセットでトレーニングされた新しいモデルは限定的な漏洩の兆候を示すことが示されています。
これらの結果は、モデルの機能を適切に評価するには、慎重なベンチマークの選択と堅牢な指標の採用の必要性を浮き彫りにしています。

要約(オリジナル)

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

arxiv情報

著者	Daniel Ramos,Claudia Mamede,Kush Jain,Paulo Canelas,Catarina Gamboa,Claire Le Goues
発行日	2024-11-20 13:46:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are Large Language Models Memorizing Bug Benchmarks?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー