The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

要約

複数の命令に従うことは、大規模言語モデル (LLM) にとって重要な能力です。
この能力の評価には、(i) 複数の命令間の一貫性の制限、(ii) 命令の順序がモデルのパフォーマンスに影響を与える位置の偏り、および (iii) 客観的に検証可能なタスクの欠如という重大な課題が伴います。
これらの問題に対処するために、逐次命令追従 (SIFo) タスクを通じて複数の命令に従うモデルの能力を評価するように設計されたベンチマークを導入します。
SIFo では、複数の命令が正常に完了したかどうかは、最後の命令のみを調べることで検証できます。
私たちのベンチマークは、4 つのタスク (テキストの変更、質問応答、数学、およびセキュリティルールのフォロー) を使用して指示のフォローを評価し、それぞれが逐次的な指示のフォローのさまざまな側面を評価します。
クローズドソースとオープンソースの両方で人気のある LLM を評価したところ、SIFo タスクでは、より最近の大規模なモデルが、古い小規模なモデルよりも大幅に優れたパフォーマンスを示し、ベンチマークの有効性が検証されました。
すべてのモデルは、一連の命令に従うのに苦労しており、今日の言語モデルの堅牢性が重要に欠如していることを示唆しています。

要約(オリジナル)

Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models’ abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark’s effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today’s language models.

arxiv情報

著者	Xinyi Chen,Baohao Liao,Jirui Qi,Panagiotis Eustratiadis,Christof Monz,Arianna Bisazza,Maarten de Rijke
発行日	2024-06-28 15:34:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー