How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

要約

大規模言語モデル (LLM) 機能の予測可能性を調査します。さまざまなモデルファミリ、パラメーターの数、タスク、およびコンテキスト内サンプルの数を使用した過去の実験の記録が与えられた場合、新しい実験構成での LLM のパフォーマンスを正確に予測できますか?
この質問に答えることは、LLM ユーザー (例: どのモデルを試すかの決定)、開発者 (例: 代表的なタスクの評価の優先順位付け)、および研究コミュニティ (例: さらなる調査が必要な予測困難な機能の特定) にとって実践的な意味を持ちます。
BIG-bench の実験記録をもとに性能予測問題を研究します。
ランダムなトレーニングとテストの分割では、MLP ベースの予測子は 95% を超える $R^2$ スコアを達成し、実験記録内に学習可能なパターンが存在することを示しています。
次に、フルセットのパフォーマンスを最大限に回復できる BIG ベンチタスクの有益なサブセットである「スモールベンチ」を検索する問題を定式化します。
$3\time$ 小さいながら、新しいモデルファミリを評価するための BIG-bench Hard と同じくらい有益なサブセットが見つかりました。
さらに、MLP ベースの予測器によって学習されたタスク表現をクラスタリングし、クラスターの重心に近いタスクを選択することで、競合するサブセットを見つけます。これは、「スモールベンチ」を構築する際のタスクの多様性の重要性を強調しています。

要約(オリジナル)

We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM performance on new experiment configurations? Answering this question has practical implications for LLM users (e.g., deciding which models to try), developers (e.g., prioritizing evaluation on representative tasks), and the research community (e.g., identifying hard-to-predict capabilities that warrant further investigation). We study the performance prediction problem on experiment records from BIG-bench. On a random train-test split, an MLP-based predictor achieves an $R^2$ score greater than 95%, indicating the presence of learnable patterns within the experiment records. We then formulate the problem of searching for ‘small-bench,’ an informative subset of BIG-bench tasks from which the performance on the full set can be maximally recovered. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3\times$ smaller. Additionally, we find competitive subsets by clustering task representations learned by our MLP-based predictor and selecting tasks close to cluster centroids, highlighting the importance of task diversity in constructing ‘small-bench.’

arxiv情報

著者	Qinyuan Ye,Harvey Yiyun Fu,Xiang Ren,Robin Jia
発行日	2023-10-31 17:27:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー