Two to Five Truths in Non-Negative Matrix Factorization

要約

この論文では、非負行列因数分解を使用してトピックモデルを構築する際の、カウントの行列に対する行列スケーリングの役割を検討します。
非負行列因数分解の品質を大幅に向上させることができる、グラフの正規化ラプラシアン (NL) からインスピレーションを得たスケーリングを紹介します。
この結果は、\cite{Priebe:2019} のスペクトルグラフクラスタリング作業の結果と類似しており、著者らは、隣接スペクトル埋め込み (ASE) スペクトルクラスタリングがコアと周辺領域のパーティションを発見する可能性が高く、ラプラシアンスペクトル埋め込み (LSE) が発見する可能性が高いことを証明しました。
アフィニティパーティションを検出します。
テキスト分析では、非負行列因数分解 (NMF) が通常、共起の「コンテキスト」と「用語」の数の行列に対して使用されます。
LSE からインスピレーションを得たマトリックススケーリングは、さまざまなデータセットのテキストトピックモデルに大幅な改善をもたらします。
NMF の行列スケーリングにより、ヒューマンアノテーションが利用可能な 3 つのデータセットでトピックモデルの品質が大幅に向上するという劇的な違いを示します。
クラスターの類似性を測定する調整済みランドインデックス (ARI) を使用すると、ASE の類似物であるカウントを使用した場合と比較して、Twitter データでは 50\% の増加、ニュースグループデータセットでは 200\% 以上の増加が見られます。
Document Understanding Conference からのデータなどのクリーンなデータの場合、NL は ASE よりも 40\% 以上の改善をもたらします。
最後に、この現象の分析と、このスケーリングと他の行列スケーリング手法との関連について説明します。

要約(オリジナル)

In this paper, we explore the role of matrix scaling on a matrix of counts when building a topic model using non-negative matrix factorization. We present a scaling inspired by the normalized Laplacian (NL) for graphs that can greatly improve the quality of a non-negative matrix factorization. The results parallel those in the spectral graph clustering work of \cite{Priebe:2019}, where the authors proved adjacency spectral embedding (ASE) spectral clustering was more likely to discover core-periphery partitions and Laplacian Spectral Embedding (LSE) was more likely to discover affinity partitions. In text analysis non-negative matrix factorization (NMF) is typically used on a matrix of co-occurrence “contexts” and “terms’ counts. The matrix scaling inspired by LSE gives significant improvement for text topic models in a variety of datasets. We illustrate the dramatic difference a matrix scalings in NMF can greatly improve the quality of a topic model on three datasets where human annotation is available. Using the adjusted Rand index (ARI), a measure cluster similarity we see an increase of 50\% for Twitter data and over 200\% for a newsgroup dataset versus using counts, which is the analogue of ASE. For clean data, such as those from the Document Understanding Conference, NL gives over 40\% improvement over ASE. We conclude with some analysis of this phenomenon and some connections of this scaling with other matrix scaling methods.

arxiv情報

著者	John M. Conroy,Neil P Molino,Brian Baughman,Rod Gomez,Ryan Kaliszewski,Nicholas A. Lines
発行日	2023-09-05 16:14:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Two to Five Truths in Non-Negative Matrix Factorization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー