Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

要約

大規模言語モデル (LLM) は電子医療記録 (EHR) ワークフローに統合されているため、実装前にパフォーマンスを評価するには検証済みの機器が不可欠です。
プロバイダーのドキュメント品質を高めるための既存の手段は、LLM で生成されたテキストの複雑さに適していないことが多く、実世界のデータでの検証が不足しています。
Provider Documentation Summarization Quality Instrument (PDSQI-9) は、LLM によって生成された臨床概要を評価するために開発されました。
複数の文書の概要は、複数の LLM (GPT-4o、Mixtral 8x7b、および Llama 3-8b) を使用して、複数の専門分野にわたる実際の EHR データから生成されました。
検証には、実質的妥当性についてのピアソン相関、構造的妥当性についての因子分析とクロンバックのアルファ、一般化可能性についての評価者間信頼性（ICC およびクリッペンドルフのアルファ）、内容の妥当性についての準デルファイプロセス、および内容の妥当性についての高品質の概要と低品質の概要の比較が含まれます。
判別の妥当性。
7 人の医師評価者が 779 の要約を評価し、8,329 の質問に回答し、評価者間の信頼性について 80% 以上の検出力を達成しました。
PDSQI-9 は、強力な内部一貫性 (クロンバックのアルファ = 0.879; 95% CI: 0.867-0.891) と高い評価者間信頼性 (ICC = 0.867; 95% CI: 0.867-0.868) を実証し、構造的妥当性と一般化可能性を裏付けています。
因子分析により、分散の 58% を説明する 4 因子モデルが特定され、組織化、明瞭さ、正確さ、実用性を表しました。
実質的な妥当性は、簡潔 (rho = -0.200、p = 0.029) および組織化 (rho = -0.190、p = 0.037) の音符の長さとスコアの間の相関関係によって裏付けられました。
判別式の妥当性により、高品質の要約と低品質の要約が区別されました (p < 0.001)。 PDSQI-9 は堅牢な構成の妥当性を実証し、LLM で生成された要約を評価し、医療ワークフローへの LLM のより安全な統合を促進するための臨床現場での使用をサポートします。

要約(オリジナル)

As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach’s alpha for structural validity, inter-rater reliability (ICC and Krippendorff’s alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach’s alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high- from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.

arxiv情報

著者	Emma Croxford,Yanjun Gao,Nicholas Pellegrino,Karen K. Wong,Graham Wills,Elliot First,Miranda Schnier,Kyle Burton,Cris G. Ebby,Jillian Gorskic,Matthew Kalscheur,Samy Khalil,Marie Pisani,Tyler Rubeor,Peter Stetson,Frank Liao,Cherodeep Goswami,Brian Patterson,Majid Afshar
発行日	2025-01-15 17:47:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー