Noisy Self-Training with Synthetic Queries for Dense Retrieval

要約

既存のニューラル検索モデルは、トレーニングデータが豊富で、トレーニングデータが増加するにつれてパフォーマンスが向上し続ける場合に有望な結果を示しますが、高品質の注釈付きデータを収集するには法外なコストがかかります。
この目的を達成するために、合成クエリと組み合わせた新しいノイズの多い自己トレーニングフレームワークを導入し、外部モデルに依存せずに自己進化的にニューラルレトリーバーを改善できることを示します。
実験結果は、私たちの方法が、一般ドメイン (MS-MARCO など) とドメイン外 (BEIR など) の両方の検索ベンチマークにおいて、既存の方法よりも一貫して改善していることを示しています。
低リソース設定に関する追加の分析により、ラベル付きトレーニングデータがわずか 30% であるにもかかわらず、私たちの方法がデータ効率が高く、競合ベースラインを上回るパフォーマンスを示していることが明らかになりました。
リランカートレーニングのフレームワークをさらに拡張すると、提案された方法が一般的であり、さまざまなドメインのタスクでさらなる利点が得られることが実証されています。\footnote{ソースコードは \url{https://github.com/Fantabulous-J/Self-Training で入手できます。
-DPR}}

要約(オリジナル)

Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}}

arxiv情報

著者	Fan Jiang,Tom Drummond,Trevor Cohn
発行日	2023-11-27 06:19:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Noisy Self-Training with Synthetic Queries for Dense Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー