Sample Size in Natural Language Processing within Healthcare Research

要約

サンプルサイズの計算は、ほとんどのデータベースの分野において不可欠なステップです。
十分な量のサンプルがあれば母集団の代表性が保証され、推定値の精度が決まります。
これは、フリーテキストを使用して予測を生成し、テキストのインスタンスを分類する自然言語処理などの機械学習手法を使用する研究を含む、ほとんどの定量的研究に当てはまります。
ヘルスケア分野では、以前に収集されたデータの十分なコーパスが不足していることが、新しい研究のサンプルサイズを決定する際の制限要因となる可能性があります。
このペーパーでは、ヘルスケア分野のテキスト分類タスクのサンプルサイズに関する推奨事項を提示することで、この問題に対処しようとしています。
ベスイスラエルディーコネスメディカルセンターの救命救急記録の MIMIC-III データベースでトレーニングされたモデルを使用して、データベース内で最も一般的な診断コードである不特定本態性高血圧症の有無として文書を分類しました。
シミュレーションは、さまざまなサンプルサイズとクラスの割合に対してさまざまな分類器を使用して実行されました。
これは、合併症については言及されていない、糖尿病のデータベース内の比較的一般的ではない診断コードに対して繰り返されました。
K 最近傍分類器を使用した場合はサンプルサイズが小さいほど良い結果が得られ、サポートベクターマシンと BERT モデルを使用した場合はサンプルサイズが大きいほど良い結果が得られました。
全体として、適切なパフォーマンス指標を提供するには、サンプルサイズが 1000 を超えていれば十分でした。
この研究内で実施されたシミュレーションは、テキストの医療データの分類器を構築する際に、適切なサンプルサイズとクラスの割合を選択し、予想されるパフォーマンスを予測するための推奨事項として使用できるガイドラインを提供します。
ここで使用される方法論は、他のデータセットを使用したサンプルサイズ推定計算用に変更できます。

要約(オリジナル)

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets.

arxiv情報

著者	Jaya Chaturvedi,Diana Shamsutdinova,Felix Zimmer,Sumithra Velupillai,Daniel Stahl,Robert Stewart,Angus Roberts
発行日	2023-09-05 13:42:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sample Size in Natural Language Processing within Healthcare Research

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー