Design choices made by LLM-based test generators prevent them from finding bugs

要約

大規模言語モデル (LLM) を使用した自動テストケース生成のための研究ツールや商用ツールが増加しています。
このペーパーでは、Codium CoverAgent や CoverUp などの最近の LLM ベースのテスト生成ツールが効果的にバグを発見したり、欠陥のあるコードを意図せず検証できるかどうかを批判的に検証します。
バグはテストケースが失敗することによってのみ露呈することを考慮して、テストオラクルが合格するように設計されている場合、これらのツールはソフトウェアテストの意図された目的を本当に達成できるのか? という疑問を検討します。
実際に人間が作成したバグのあるコードを入力として使用して、これらのツールを評価すると、LLM で生成されたテストがバグの検出に失敗する可能性があり、さらに驚くべきことに、LLM の設計が、生成されたテストスイートのバグを検証してバグを拒否することによって状況を悪化させる可能性があることがわかります。
明らかにするテスト。
これらの発見は、LLM ベースのテスト生成ツールの背後にある設計の妥当性と、ソフトウェアの品質とテストスイートの信頼性への影響について重要な疑問を引き起こします。

要約(オリジナル)

There is an increasing amount of research and commercial tools for automated test case generation using Large Language Models (LLMs). This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code. Considering bugs are only exposed by failing test cases, we explore the question: can these tools truly achieve the intended objectives of software testing when their test oracles are designed to pass? Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests. These findings raise important questions about the validity of the design behind LLM-based test generation tools and their impact on software quality and test suite reliability.

arxiv情報

著者	Noble Saji Mathews,Meiyappan Nagappan
発行日	2024-12-18 18:33:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Design choices made by LLM-based test generators prevent them from finding bugs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー