Stronger Random Baselines for In-Context Learning

要約

言語モデルのコンテキスト内学習分類のパフォーマンスを評価する場合、データセットのサイズが小さいこと、検証セットを使用した広範なプロンプト選択、およびほぼランダムなパフォーマンスにつながる意図的に困難なタスクにより、課題が生じます。
標準のランダムベースライン (ラベルを均一にランダムに推測する際の期待精度) は、評価セットが 1 回だけ使用される場合、またはデータセットが大きい場合には安定しています。
検証セットの再利用と、より強力なランダムベースラインを使用した既存の小規模データセットの一般的な手法、つまり複数のランダム分類子全体で期待される最大精度を考慮します。
16 個の BIG ベンチ Lite タスクに適用された 6 つの量子化言語モデル全体で最適なプロンプトデモンストレーションを選択すると、標準ベースラインを超える数ショットの結果の 20\% 以上が、このより強力なランダムベースラインを超えません。
ホールドアウトテストセットが利用可能な場合、この強力なベースラインは、標準ベースラインよりもホールドアウトパフォーマンスの優れた予測因子となり、不必要なテストセット評価を回避します。
この最大ランダムベースラインは、標準ベースラインの簡単に計算できるドロップイン置換を提供します。

要約(オリジナル)

Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline — the expected accuracy of guessing labels uniformly at random — is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20\% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.

arxiv情報

著者	Gregory Yauney,David Mimno
発行日	2024-04-19 17:30:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stronger Random Baselines for In-Context Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー