Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

要約

大規模言語モデルの急速な進歩により、非構造化テキストデータの処理と要約に関して、驚くべき機能が解放されました。
これは、LLM が重要なテーマやセンチメントを効率的に抽出できると期待される、アンケート回答などの豊富で自由なデータセットの分析に影響を及ぼします。
しかし、組織がテキストフィードバックを理解するためにこれらの強力な AI システムにますます注目するようになると、これらのテキストベースのデータセットに含まれる視点を正確に表現する LLM を信頼できるかという重要な疑問が生じます。
LLM は人間のような要約を生成することに優れていますが、その出力が元の応答の実際の内容から誤って乖離してしまうリスクがあります。
LLM によって生成された出力とデータに存在する実際のテーマとの間に不一致があると、意思決定に欠陥が生じ、組織に広範囲にわたる影響を与える可能性があります。
この研究では、他の LLM によって生成された要約のテーマの整合性を評価するための審査員モデルとしての LLM の有効性を調査します。
私たちは、Amazon の Titan Express、Nova Pro、Meta の Llama が LLM 審査員を務め、Anthropic Claude モデルを利用して自由回答形式の調査回答からテーマ別の概要を生成しました。
裁判官としての LLM アプローチは、コーエンのカッパ、スピアマンのロー、クリッペンドルフのアルファを使用した人間の評価と比較され、従来の人間中心の評価方法に代わるスケーラブルな代替手段であることが検証されました。
私たちの調査結果は、裁判官としての LLM が人間の評価者に匹敵するスケーラブルなソリューションを提供する一方、人間は依然として微妙な文脈固有のニュアンスの検出に優れている可能性があることを明らかにしています。
この研究は、AI 支援テキスト分析に関する知識の増大に貢献します。
私たちは制限について議論し、将来の研究のための推奨事項を提供し、さまざまな状況やユースケースにわたって LLM 判定モデルを一般化する際には慎重に検討する必要があることを強調します。

要約(オリジナル)

Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon’s Titan Express, Nova Pro, and Meta’s Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen’s kappa, Spearman’s rho, and Krippendorff’s alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.

arxiv情報

著者	Rewina Bedemariam,Natalie Perez,Sreyoshi Bhaduri,Satya Kapoor,Alex Gil,Elizabeth Conjar,Ikkei Itoku,David Theil,Aman Chadha,Naumaan Nayyar
発行日	2025-01-14 14:49:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー