German Text Embedding Clustering Benchmark

要約

この研究では、さまざまなドメインでのドイツ語テキスト埋め込みのクラスタリングのパフォーマンスを評価するベンチマークを導入します。
このベンチマークは、テキストのグループ化 (トピックモデリングなど) を必要とするタスクでクラスタリングニューラルテキスト埋め込みの使用が増加していることと、既存のベンチマークでのドイツ語リソースの必要性によって推進されています。
さまざまなクラスタリングアルゴリズムの結果に基づいて評価された、事前トレーニングされた一連の単言語モデルおよび多言語モデルの初期分析を提供します。
結果には、強力なパフォーマンスの単言語モデルと多言語モデルが含まれます。
埋め込みの次元を削減すると、クラスタリングをさらに改善できます。
さらに、ドイツの BERT モデルに対して継続的な事前トレーニングを行う実験を実施し、この追加トレーニングの利点を推定します。
私たちの実験によると、短いテキストではパフォーマンスが大幅に向上する可能性があります。
すべてのコードとデータセットは公開されています。

要約(オリジナル)

This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.

arxiv情報

著者	Silvan Wehrli,Bert Arnrich,Christopher Irrgang
発行日	2024-01-05 08:42:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

German Text Embedding Clustering Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー