RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

要約

ツール学習は、大規模言語モデル (LLM) と物理世界の間の相互作用の重要な手段として幅広い関心を集めています。
現在の研究は主に、現実世界の避けられないノイズに直面したときの安定性を無視しながら、よく構造化された環境でツールを利用するLLMの能力を強調しています。
このギャップを埋めるために、ツール学習における LLM の堅牢性を評価するためのマルチレベルベンチマークである RoTBench を導入します。
具体的には、さまざまなレベルのノイズ (つまり、クリーン、軽度、中程度、重度、ユニオン) を特徴とする 5 つの外部環境を確立し、ツールの選択、パラメーターの特定、
そしてコンテンツの充実。
6 つの広く使用されているモデルを使った実験は、ツール学習における LLM の堅牢性を強化する緊急の必要性を強調しています。
たとえば、手動精度に大きな変化がない場合でも、GPT-4 のパフォーマンスは 80.00 から 58.10 に大幅に低下します。
さらに驚くべきことに、GPT ファミリに固有のノイズ補正機能は、逆説的に、軽度のノイズに直面した場合の適応性を妨げます。
これらの発見を踏まえて、私たちはツール学習における LLM の堅牢性を強化するためにトレーニング環境の多様性を強化する戦略である RoTTuning を提案します。
コードとデータは https://github.com/Junjie-Ye/RoTBench で入手できます。

要約(オリジナル)

Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs’ capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model’s resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.

arxiv情報

著者	Junjie Ye,Yilong Wu,Songyang Gao,Sixian Li,Guanyu Li,Xiaoran Fan,Qi Zhang,Tao Gui,Xuanjing Huang
発行日	2024-01-16 12:45:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー