Patched RTC: evaluating LLMs for diverse software development tasks

要約

この文書では、さまざまなソフトウェア開発タスクに適用される大規模言語モデル (LLM) の新しい評価手法である、特にバグ修正、コードレビュー、ドキュメントの更新などの「アウターループ」アクティビティに焦点を当てたパッチ付きラウンドトリップコレクトネス (パッチ付き RTC) を紹介します。
。
Patched RTC は、元の Round-Trip Correctness メソッドを拡張して、LLM およびダウンストリームタスクで動作するようにし、人間の介入なしでモデル応答の一貫性と堅牢性を測定する自己評価フレームワークを提供します。
この研究では、パッチを適用した RTC スコアとタスク固有の精度メトリクスとの相関関係を実証し、オープンドメインタスク評価における LLM-as-Judge パラダイムの代替手段として提示しています。
Patched RTC をパッチワークと呼ばれるオープンソースフレームワークに実装し、さまざまなパッチフローにわたる推論中の透過的な評価を可能にします。
さまざまなソフトウェア開発タスクにわたって GPT-3.5 モデルと GPT-4 モデルを比較した実験により、Patched RTC がモデルのパフォーマンスとタスクの難易度を効果的に区別していることが明らかになりました。
この論文では、モデルの精度向上に対する一貫性プロンプトの影響についても調査しており、Patched RTC が複雑なソフトウェア開発ワークフローのプロンプト改良とモデル選択をガイドできることを示唆しています。

要約(オリジナル)

This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on ‘outer loop’ activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.

arxiv情報

著者	Asankhaya Sharma
発行日	2024-07-23 15:12:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Patched RTC: evaluating LLMs for diverse software development tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー