Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

要約

ニューラル機械翻訳モデルは、非常にデータと計算量を必要とします。
ただし、すべてのデータポイントがモデルのトレーニングと一般化に等しく貢献するわけではありません。
低価値のデータポイントを削除するデータプルーニングには、モデルのパフォーマンスを大幅に低下させることなく、コンピューティングバジェットを大幅に削減できるという利点があります。
このペーパーでは、新しいデータプルーニング手法である Checkpoints Across Time (CAT) を提案します。これは、初期のモデルトレーニングダイナミクスを活用して、モデルのパフォーマンスに最も関連するデータポイントを特定します。
COMET-QE、LASER、LaBSE などのいくつかのデータプルーニング技術に対して CAT をベンチマークします。
CAT は、複数のテストセットでインドヨーロッパ語のベンチマークを上回るパフォーマンスを示していることがわかりました。
英語からドイツ語、英語からフランス語、英語からスワヒリ語の翻訳タスクに適用すると、CAT はトレーニングデータの最大 50% を削減しながら、完全なデータセットを使用した場合と同等のパフォーマンスを達成します。
CAT が選択したデータポイントを検査したところ、CAT は長い文や、ユニークな単語や珍しい単語を含む文を優先する傾向があることがわかりました。

要約(オリジナル)

Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.

arxiv情報

著者	Everlyn Asiko Chimoto,Jay Gala,Orevaoghene Ahia,Julia Kreutzer,Bruce A. Bassett,Sara Hooker
発行日	2024-06-21 12:30:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー