Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

要約

コード大規模言語モデル (コード LLM) は実際のアプリケーションで採用されることが増えているため、それらを評価することが重要です。
従来の精度では、一連の個別タスクに対するコード LLM のパフォーマンスが評価されますが、さまざまなタスクにわたるコード LLM の自己一貫性は無視されます。
直感的には、信頼できるモデルは、独自のコードの自然言語仕様を生成するとき、および独自の仕様のコードを生成するときに、自己一貫性がある必要があります。
自己一貫性の維持に失敗すると、自然言語とプログラミング言語の基礎となる共有セマンティクスの理解が不足していることが明らかになり、モデルの信頼性が損なわれます。
この論文では、まず Code LLM の自己一貫性を正式に定義し、次にモデルの自己一貫性と従来の精度を同時に効果的かつ効率的に評価するフレームワーク IdentityChain を設計します。
私たちは 11 個のコード LLM を研究し、それらが自己一貫性を維持できていないことを示しました。これは確かに従来の精度とは異なる側面です。
さらに、IdentityChain を使用して現在のモデルで特定された 3 つの主要な弱点を実証することで、IdentityChain をモデルデバッグツールとして使用してコード LLM の弱点を明らかにできることを示します。
私たちのコードは https://github.com/marcusm117/IdentityChain で入手できます。

要約(オリジナル)

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.

arxiv情報

著者	Marcus J. Min,Yangruibo Ding,Luca Buratti,Saurabh Pujar,Gail Kaiser,Suman Jana,Baishakhi Ray
発行日	2024-01-16 14:03:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー