Contextual Categorization Enhancement through LLMs Latent-Space

要約

Wikipedia などの大規模なテキストデータセットにおける分類の意味品質を管理するには、複雑さとコストの点で大きな課題が生じます。
この論文では、Wikipedia データセットおよびそれに関連するカテゴリ内のテキストから意味論的な情報を潜在空間に蒸留するために、トランスフォーマーモデルを活用することを提案します。
次に、これらのエンコーディングに基づいてさまざまなアプローチを検討し、カテゴリの意味論的同一性を評価および強化します。
私たちのグラフィカルなアプローチは凸包を利用しており、階層的なアプローチには Hierarchical Navigable Small Worlds (HNSW) を利用しています。
次元削減によって引き起こされる情報損失の解決策として、次の数学的解決策を調整します。それは、テキストカテゴリの高次元エンコーディング間のユークリッド距離によって駆動される指数関数的減衰関数です。
この関数は、コンテキストカテゴリを中心に構築されたフィルターを表し、特定の再検討確率 (RP) でアイテムを取得します。
高 RP アイテムの取得は、データベース管理者がコンテキストフレームワーク内で推奨事項を提供し外れ値を特定することにより、データのグループ化を改善するためのツールとして機能します。

要約(オリジナル)

Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.

arxiv情報

著者	Zineddine Bettouche,Anas Safi,Andreas Fischer
発行日	2024-04-25 09:20:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Contextual Categorization Enhancement through LLMs Latent-Space

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー