CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

要約

大規模な言語モデル（LLM）は幅広いドメインを変換していますが、特に統合、要約、知識抽出などの複雑なオープンエンドタスクでは、出力を検証することは依然として重要な課題です。
これに対処するために、CheckEmbed（CE）：シンプルでスケーラブルで正確な検証方法を紹介します。
CEは、SFR-embedding-Mistralのような強力な最新の埋め込みLLMモデルを使用して、各LLMの回答を単一の埋め込みベクトルに減らします。
BertscoreやSelfCheckgptなどの以前の方法は、Bertのような弱いエンコーダーに依存しており、トークンや文の粒度で動作することを強制しました。
対照的に、CEは、回答全レベルで直接高速で意味的に豊富な比較を実行し、精度とスケーラビリティの両方の主要な制限を克服します。
クラシックテキストの得点者（BLEUなど）、安定性ベースの方法（SelfCheckgptなど）、および生成評価者（LLM-As-A-Judgeなど）を含む13の検証ベースラインで包括的な設計と時間の複雑さ分析を実施します。
経験的結果は、CEが閉じたタスクとオープンエンドの両方のタスクの両方で幻覚を確実に検出することを示しています。
さらに、CEがテキストを超えてビジョンなどの他のモダリティに一般化し、実用的で多目的な検証フレームワークとして確立するという証拠を提示します。

要約(オリジナル)

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

arxiv情報

著者	Maciej Besta,Lorenzo Paleari,Marcin Copik,Robert Gerstenberger,Ales Kubicek,Piotr Nyczyk,Patrick Iff,Eric Schreiber,Tanja Srindran,Tomasz Lehmann,Hubert Niewiadomski,Torsten Hoefler
発行日	2025-06-04 14:57:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー