Evaluating Large Language Models for Generalization and Robustness via Data Compression

要約

大規模な言語モデルを評価するための既存の方法は、データの汚染、プロンプトに対する敏感さ、ベンチマーク作成のコストの高さなどの課題に直面しています。
これに対処するために、トレーニングのカットオフ後にモデルの予測能力がどのように一般化するかをテストする、可逆データ圧縮ベースの評価アプローチを提案します。
具体的には、2017 年から 2023 年までの 83 か月にわたる包括的なテストデータを収集し、モデルのトレーニングデータのカットオフに従ってデータをトレーニング期間とテスト期間に分割します。
1) 目に見えないデータの一般化の尺度として、テスト期間の圧縮パフォーマンスを測定します。
2) 堅牢性の尺度としての、トレーニング期間とテスト期間の間のパフォーマンスのギャップ。
私たちの実験では、Wikipedia、ニュース記事、コード、arXiv 論文、マルチモーダルデータなどのソース上で、さまざまなサイズの 14 の代表的な大規模言語モデルをテストします。
多くのモデルの圧縮率は、締切日を過ぎると大幅に低下しますが、Mistral や Llama-2 などのモデルは、パフォーマンスと堅牢性のバランスが取れていることがわかります。
また、結果は、モデルがニュースやコードデータに対して一般化するのに苦労しているが、arXiv 論文では特にうまく機能することを示唆しています。
また、コンテキストサイズとトークン化の実装が全体的な圧縮パフォーマンスに大きな影響を与えることもわかりました。

要約(オリジナル)

Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models’ predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models’ training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.

arxiv情報

著者	Yucheng Li,Yunhao Guo,Frank Guerin,Chenghua Lin
発行日	2024-02-01 18:56:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Large Language Models for Generalization and Robustness via Data Compression

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー