Investigating the Impact of Data Selection Strategies on Language Model Performance

要約

データの選択は、特にトレーニングデータセットを目的のターゲット分布に合わせる場合、言語モデルのパフォーマンスを向上させるために重要です。
この研究では、さまざまなデータ選択方法と特徴タイプがモデルのパフォーマンスに及ぼす影響を調査します。
データサブセットの選択が下流のタスクに影響を与えるかどうか、n グラム特徴がターゲット分布との整合性を向上させるかどうか、埋め込みベースのニューラル特徴が補完的な利点を提供するかどうかを評価します。
ベースラインのランダム選択手法と分布調整アプローチを使用した比較実験を通じて、データ選択戦略とモデルトレーニングの有効性の間の相互作用についての洞察を提供します。
この研究のすべてのコードは、\href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github リポジトリ} にあります。

要約(オリジナル)

Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature types on model performance. We evaluate whether selecting data subsets can influence downstream tasks, whether n-gram features improve alignment with target distributions, and whether embedding-based neural features provide complementary benefits. Through comparative experiments using baseline random selection methods and distribution aligned approaches, we provide insights into the interplay between data selection strategies and model training efficacy. All code for this study can be found on \href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github repository}.

arxiv情報

著者	Jiayao Gu,Liting Chen,Yihong Li
発行日	2025-01-07 14:38:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating the Impact of Data Selection Strategies on Language Model Performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー