MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

要約

eXtreme マルチラベルテキスト分類 (XMC) は、非常に大規模なラベルセット (たとえば、数百万のラベル) からの関連ラベルをテキストサンプルに割り当てる分類器をトレーニングすることを指します。
私たちは、XMC 用の効率的なテキストラベルマッチングフレームワークである MatchXML を提案します。
スパース用語頻度 – 逆文書頻度 (TF-IDF) 特徴から生成されたラベル埋め込みには、いくつかの制限があることがわかりました。
したがって、スキップグラムモデルによって意味論的な密なラベル埋め込みを効果的にトレーニングするために、label2vec を提案します。
次に、高密度ラベルの埋め込みを使用して、クラスタリングによる階層ラベルツリーが構築されます。
事前トレーニングされたエンコーダー Transformer を微調整する際に、マルチラベルテキスト分類を 2 部グラフのテキストラベルマッチング問題として定式化します。
次に、微調整された Transformer から高密度のテキスト表現を抽出します。
微調整された密なテキストの埋め込みに加えて、事前トレーニングされたセンテンストランスフォーマーから静的な密な文の埋め込みも抽出します。
最後に、線形ランカーは、スパース TF-IDF 機能、微調整された密なテキスト表現、および静的な密な文の特徴を利用してトレーニングされます。
実験結果は、MatchXML が 6 つのデータセットのうち 5 つで最先端の精度を達成することを示しています。
速度に関しては、MatchXML は 6 つのデータセットすべてで競合するメソッドよりも優れています。
私たちのソースコードは https://github.com/huiyegit/MatchXML で公開されています。

要約(オリジナル)

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.

arxiv情報

著者	Hui Ye,Rajshekhar Sunderraman,Shihao Ji
発行日	2024-03-11 14:50:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー