Data Similarity is Not Enough to Explain Language Model Performance

要約

大規模な言語モデルは、すべてではありませんが、多くの下流タスクで高いパフォーマンスを実現します。
一般に、事前トレーニングデータとタスクデータの間の相互作用がこの差異を決定すると想定されます。つまり、モデルの事前トレーニングデータにより近いデータを含むタスクは、そのモデルにとって容易であると想定されます。
Pile および C4 事前トレーニングデータセットとダウンストリームベンチマークの大規模な比較を通じて、分布およびサンプル固有の類似性尺度 (埋め込みベース、トークンベース、およびモデルベース) が言語モデルのパフォーマンスと相関するかどうかをテストします。
類似性は多言語データセットのパフォーマンスと相関関係がありますが、他のベンチマークでは、驚くべきことに類似性メトリクスが精度と相関しておらず、さらには相互にさえ相関していないことがわかりました。
これは、事前トレーニングデータと下流タスクの関係が、一般に想定されているよりも複雑であることを示唆しています。

要約(オリジナル)

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model’s pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.

arxiv情報

著者	Gregory Yauney,Emily Reif,David Mimno
発行日	2023-11-15 14:48:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Similarity is Not Enough to Explain Language Model Performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー