Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

要約

コード匂いの検出のための最も効果的な大きな言語モデルを決定することは、複雑な課題を提示します。
この研究では、この問題に取り組むために構造化された方法論と評価マトリックスを導入し、既知の臭いと一貫して注釈を付けられたコードサンプルのキュレーションデータセットを活用します。
データセットは、4つの著名なプログラミング言語Java、Python、JavaScript、およびC ++にまたがっています。
クロス言語の比較を可能にします。
Precision、Recall、およびF1スコアを評価メトリックとして使用して、2つの最先端のLLMS、Openai GPT 4.0およびDeepSeek-V3をベンチマークします。
分析では、全体的なパフォーマンス、カテゴリレベルのパフォーマンス、個々のコード臭いタイプのパフォーマンスの3つのレベルの詳細について説明します。
さらに、GPT 4.0のトークンベースの検出アプローチとDeepSeek V3が採用したパターンマッチング手法と比較することにより、費用対効果を探ります。
この研究には、Sonarqubeなどの従来の静的分析ツールに関連するコスト分析も含まれています。
調査結果は、自動化されたコード臭い検出のための効率的で費用対効果の高いソリューションを選択する際に開業医に貴重なガイダンスを提供します

要約(オリジナル)

Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection

arxiv情報

著者	Ahmed R. Sadik,Siddhata Govind
発行日	2025-04-22 16:44:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー