Paloma: A Benchmark for Evaluating Language Model Fit

要約

言語モデル (LM) は通常、トレーニングから得られたモノリシックデータに関する困惑を報告します。
暗黙的または明示的に、このデータは言語のドメイン$\unicode{x2013}$さまざまな分布で構成されています。
Perplexity Analysis for Language Model Assessment (Paloma) では、あるディストリビューションでのパープレキシティを他のディストリビューションに推定するのではなく、nytimes.com から Reddit の r/depression に至るまでの 585 のテキストドメインに対する LM フィットを測定しています。
私たちはベンチマークへの提出を募り、事前トレーニングからのベンチマーク汚染の除去などのガイドラインの遵守に基づいて比較可能性によって結果を整理します。
送信では、パラメーターとトレーニングトークンの数を記録して、これらのコストの尺度の関数としてパフォーマンスのパレート効率を比較することもできます。
一般的なコーパスで事前トレーニングされた 6 つのベースラインからの結果をベンチマークに入力します。
ケーススタディでは、Common Crawl を超えるデータなしで事前トレーニングを行うと、多くのドメインに一貫性のない適合が生じることが判明するなど、Paloma で可能な分析を示します。

要約(オリジナル)

Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com to r/depression on Reddit. We invite submissions to our benchmark and organize results by comparability based on compliance with guidelines such as removal of benchmark contamination from pretraining. Submissions can also record parameter and training token count to make comparisons of Pareto efficiency for performance as a function of these measures of cost. We populate our benchmark with results from 6 baselines pretrained on popular corpora. In case studies, we demonstrate analyses that are possible with Paloma, such as finding that pretraining without data beyond Common Crawl leads to inconsistent fit to many domains.

arxiv情報

著者	Ian Magnusson,Akshita Bhagia,Valentin Hofmann,Luca Soldaini,Ananya Harsh Jha,Oyvind Tafjord,Dustin Schwenk,Evan Pete Walsh,Yanai Elazar,Kyle Lo,Dirk Groeneveld,Iz Beltagy,Hannaneh Hajishirzi,Noah A. Smith,Kyle Richardson,Jesse Dodge
発行日	2023-12-16 19:12:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Paloma: A Benchmark for Evaluating Language Model Fit

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー