Detecting out-of-distribution text using topological features of transformer-based language models

要約

トポロジカルデータ分析 (TDA) をトランスフォーマーベースの言語モデルのアテンションマップに適用することで、配布範囲外 (OOD) テキストサンプルの検出を試みます。
トランスフォーマーベースの言語モデルである BERT での分布外検出のための提案された TDA ベースのアプローチを評価し、BERT CLS 埋め込みに基づくより伝統的な OOD アプローチと比較します。
私たちの TDA アプローチは、配信内データ (ハフポストの政治やエンターテイメントのニュース記事) とドメイン外のサンプル (IMDB レビュー) を区別する点で CLS 埋め込みアプローチよりも優れていることがわかりましたが、その有効性はドメイン外に近づくと低下します。
(CNN/デイリーメール) または同じドメイン (ハフポストのビジネスニュース記事) データセット。

要約(オリジナル)

We attempt to detect out-of-distribution (OOD) text samples though applying Topological Data Analysis (TDA) to attention maps in transformer-based language models. We evaluate our proposed TDA-based approach for out-of-distribution detection on BERT, a transformer-based language model, and compare the to a more traditional OOD approach based on BERT CLS embeddings. We found that our TDA approach outperforms the CLS embedding approach at distinguishing in-distribution data (politics and entertainment news articles from HuffPost) from far out-of-domain samples (IMDB reviews), but its effectiveness deteriorates with near out-of-domain (CNN/Dailymail) or same-domain (business news articles from HuffPost) datasets.

arxiv情報

著者	Andres Pollano,Anupam Chaudhuri,Anj Simmons
発行日	2023-11-22 02:04:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Detecting out-of-distribution text using topological features of transformer-based language models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー