LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

要約

大規模言語モデルは従来、大規模な命令データセットに基づいて微調整されてきました。
しかし、最近の研究では、以下の汎用的な指導には小規模で高品質のデータセットで十分であることが示唆されています。
この微調整のベストプラクティスに関するコンセンサスの欠如は、LLM 評価に対するアプローチが急速に多様化していることが部分的に原因です。
この研究では、少量の多様な微調整サンプルが、従来のパープレキシティベースの NLP ベンチマークと、オープンエンドのモデルベースの評価の両方でパフォーマンスを向上できるかどうかを尋ねます。
1,000 サンプルから 60,000 サンプルまでのさまざまなサイズの命令微調整データセットに基づいて、オープンソース MPT-7B および MPT-30B モデルを微調整します。
(1) 従来の NLP ベンチマークと (2) モデルベースの評価の両方で良好なパフォーマンスを達成するには、1k ～ 6k の命令微調整サンプルのサブセットで十分であることがわかりました。
最後に、教科書スタイルのデータセットとオープンエンドの QA 微調整データセットを混合すると、両方の評価パラダイムでパフォーマンスが最適化されることを示します。

要約(オリジナル)

Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.

arxiv情報

著者	Aditi Jha,Sam Havens,Jeremey Dohmann,Alex Trott,Jacob Portes
発行日	2023-11-22 03:37:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー