Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language

要約

多くの特定の用語を使用するドメイン固有言語は、多くの場合、低リソース言語のカテゴリーに分類されます。
狭いドメインでテストデータセットを収集するには時間がかかり、ドメインの知識とアノテーションタスクのトレーニングを備えた熟練した人材が必要です。
この研究は、プロセス産業の低リソースのドメイン固有のドイツ語でのセマンティック検索を評価するためのテストデータセットの自動収集の課題に取り組んでいます。
私たちのアプローチは、自動クエリ生成からクエリとドキュメントのペアのスコア再評価までのエンドツーエンドのアノテーションパイプラインを提案します。
ドイツの化学分野でトレーニングされたテキストエンコーダーの不足を克服するために、一般的な知識のデータセットでトレーニングされた「弱い」テキストエンコーダーのアンサンブルの原理を探索します。
さまざまなモデルからの個々の関連性スコアを組み合わせて、ドキュメント候補と LLM によって生成された関連性スコアを取得し、クエリとドキュメントの整合性に関するコンセンサスを達成することを目指します。
評価結果は、アンサンブル手法が人間によって割り当てられた関連性スコアとの整合性を大幅に向上させ、コーダー間の一致性と精度メトリクスの両方で個別のモデルを上回るパフォーマンスを示していることを示しています。
これらの発見は、アンサンブル学習が意味論的検索システムを特殊な低リソース言語に効果的に適応させ、ドメイン固有のコンテキストにおけるリソース制限に対する実用的な解決策を提供できることを示唆しています。

要約(オリジナル)

Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of ‘weak’ text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.

arxiv情報

著者	Anastasia Zhukova,Christian E. Matt,Bela Gipp
発行日	2024-12-13 09:47:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー