Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

要約

平易な言語の要約（PLS）は、臨床医と患者間の効果的なコミュニケーションを促進するために不可欠です。
大規模な言語モデル（LLM）は最近、PLSの生成を自動化する際に有望を示していますが、健康情報の理解をサポートする上での有効性は不明のままです。
一般に、以前の評価は、理解可能性を直接測定しない自動スコア、または限られた一般化可能性を備えた便利なサンプルからの主観的なリッカートスケールの評価に依存しています。
これらのギャップに対処するために、150人の参加者を持つAmazon Mechanical Turkを使用して、LLM生成PLSの大規模なクラウドソーシング評価を実施しました。
単純さ、情報性、一貫性、忠実さに焦点を当てた主観的なリッカートスケールの評価を通じてPLSの品質を評価しました。
客観的な複数選択の理解と読者の理解の尺度を思い出します。
さらに、10の自動評価メトリックと人間の判断の間のアラインメントを調べました。
我々の調査結果は、LLMが主観的評価で人間が書いたものと区別できないPLSを生成できるが、人間が執筆したPLSSは非常に優れた理解につながることを示しています。
さらに、自動化された評価メトリックは、人間の判断を反映することができず、PLSを評価するための適合性に疑問を投げかけます。
これは、読者の好みと理解の結果の両方に基づいて、LLM生成PLSを体系的に評価した最初の研究です。
私たちの調査結果は、表面レベルの品質を超えて移動する評価フレームワークと、素人の理解に明示的に最適化する生成方法の必要性を強調しています。

要約(オリジナル)

Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

arxiv情報

著者	Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen
発行日	2025-05-15 15:31:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー