PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

要約

ユーザー指示を仕上げるために大規模言語モデル (LLM) への依存が高まっているため、現実の状況での複雑なタスクの完了に対する LLM の堅牢性を包括的に理解する必要があります。
この重要なニーズに対処するために、ユーザーの PPT タスク命令とソフトウェアバージョンに対する LLM の堅牢性を測定する PowerPoint タスク完了堅牢性ベンチマーク (PPTC-R) を提案します。
具体的には、文、意味、および多言語レベルでユーザー指示を攻撃することにより、敵対的なユーザー指示を構築します。
ソフトウェアバージョンに対する言語モデルの堅牢性を評価するために、提供される API の数を変更して、最新バージョンと以前のバージョンの両方の設定をシミュレートします。
続いて、これらの堅牢性設定を組み込んだベンチマークを使用して、3 つのクローズドソース LLM と 4 つのオープンソース LLM をテストし、タスク完了のための LLM の API 呼び出しに偏差がどのような影響を与えるかを評価します。
私たちのベンチマークでは、特にバージョン更新と多言語設定において、GPT-4 が最高のパフォーマンスと強力な堅牢性を示していることがわかりました。
ただし、すべての LLM は、複数の課題 (マルチターンなど) に同時に直面すると堅牢性を失い、大幅なパフォーマンスの低下につながることがわかりました。
ベンチマークにおける LLM の堅牢性の動作とエラーの理由をさらに分析します。これにより、研究者がタスク完了における LLM の堅牢性を理解し、より堅牢な LLM とエージェントを開発するための貴重な洞察が得られます。
コードとデータは \url{https://github.com/ZekaiGalaxy/PPTCR} でリリースされています。

要約(オリジナル)

The growing dependence on Large Language Models (LLMs) for finishing user instructions necessitates a comprehensive understanding of their robustness to complex task completion in real-world situations. To address this critical need, we propose the PowerPoint Task Completion Robustness benchmark (PPTC-R) to measure LLMs’ robustness to the user PPT task instruction and software version. Specifically, we construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. To assess the robustness of Language Models to software versions, we vary the number of provided APIs to simulate both the newest version and earlier version settings. Subsequently, we test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates these robustness settings, aiming to evaluate how deviations impact LLMs’ API calls for task completion. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark, particularly in the version update and the multilingual settings. However, we find that all LLMs lose their robustness when confronted with multiple challenges (e.g., multi-turn) simultaneously, leading to significant performance drops. We further analyze the robustness behavior and error reasons of LLMs in our benchmark, which provide valuable insights for researchers to understand the LLM’s robustness in task completion and develop more robust LLMs and agents. We release the code and data at \url{https://github.com/ZekaiGalaxy/PPTCR}.

arxiv情報

著者	Zekai Zhang,Yiduo Guo,Yaobo Liang,Dongyan Zhao,Nan Duan
発行日	2024-03-06 15:33:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー