NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

要約

この作業では、国際的な看護コードから蒸留された5つのコアバリューディメンションで構成される看護価値アライメントの最初のベンチマークを紹介します。
ベンチマークは、さまざまな層の3つの病院で5か月の縦断的フィールドスタディを通じて収集された1,100の実世界の看護行動インスタンスで構成されています。
これらのインスタンスには、5人の臨床看護師が注釈を付けられ、その後、倫理極性が逆になったLLM生成の反事実を拡張します。
各オリジナルのケースは、付属されたバリオールバージョンとペアリングされているため、簡単なレベルのデータセットを構成する2,200のラベル付きインスタンスが得られます。
敵対的な複雑さを高めるために、各インスタンスは、コンテキストのキューと微妙な誤解を招く信号を埋め込むダイアログベースの形式にさらに変換され、ハードレベルのデータセットが生成されます。
23の最先端（SOTA）LLMを看護価値との調整について評価します。
私たちの調査結果は、3つの重要な洞察を明らかにしています。（1）DeepSeek-V3は、簡単なレベルのデータセット（94.55）で最高のパフォーマンスを達成します。Claude3.5Sonnetは、ハードレベルのデータセット（89.43）の他のモデルよりも優れており、医療LLMを大幅に上回ります。
（2）正義は一貫して評価するのが最も困難な看護価値の次元です。
（3）コンテキスト内学習により、アライメントが大幅に向上します。
この作業は、臨床環境で価値に敏感なLLMS開発の基盤を提供することを目的としています。
データセットとコードは、https：//huggingface.co/datasets/ben012345/nurvaluesで入手できます。

要約(オリジナル)

This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The benchmark comprises 1,100 real-world nursing behavior instances collected through a five-month longitudinal field study across three hospitals of varying tiers. These instances are annotated by five clinical nurses and then augmented with LLM-generated counterfactuals with reversed ethic polarity. Each original case is paired with a value-aligned and a value-violating version, resulting in 2,200 labeled instances that constitute the Easy-Level dataset. To increase adversarial complexity, each instance is further transformed into a dialogue-based format that embeds contextual cues and subtle misleading signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA) LLMs on their alignment with nursing values. Our findings reveal three key insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2) Justice is consistently the most difficult nursing value dimension to evaluate; and (3) in-context learning significantly improves alignment. This work aims to provide a foundation for value-sensitive LLMs development in clinical settings. The dataset and the code are available at https://huggingface.co/datasets/Ben012345/NurValues.

arxiv情報

著者	Ben Yao,Qiuchi Li,Yazhou Zhang,Siyu Yang,Bohan Zhang,Prayag Tiwari,Jing Qin
発行日	2025-05-13 16:46:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー