ClusterChat: Multi-Feature Search for Corpus Exploration

要約

大規模なテキストCorporaの探索は、生物医学、金融、および法的領域に大きな課題を提示し、膨大な量の文書が継続的に公開されています。
キーワードベースの検索などの従来の検索方法は、多くの場合、文書を単独で取得し、コーパス全体の傾向と関係を簡単に検査する能力を制限します。
ClusterChat（デモビデオとソースコードは、https：//github.com/achouhan93/clusterchatで入手できます。
400万の要約PubMedデータセットに関する2つのケーススタディでシステムを検証し、ClusterChatが大規模なドキュメントコレクションのスケーラビリティと応答性を維持しながら、コンテキストを意識した洞察を提供することによりコーパス探査を強化することを実証します。

要約(オリジナル)

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user’s ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.

arxiv情報

著者	Ashish Chouhan,Saifeldin Mandour,Michael Gertz
発行日	2025-06-17 14:18:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ClusterChat: Multi-Feature Search for Corpus Exploration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー