Evaluating Large Language Models for Public Health Classification and Extraction Tasks

要約

大規模な言語モデル（LLM）の進歩は、公衆衛生を含むさまざまなドメインにわたって人間の専門家をサポートする可能性に大きな関心をもたらしました。
この作業では、無料テキストの分類と抽出を含む公衆衛生タスクのLLMの自動評価を提示します。
6つの外部注釈付きデータセットと7つの新しい内部注釈付きデータセットを組み合わせて、LLMSを評価して、健康負担、疫学的リスク要因、公衆衛生介入に関連するテキストを処理します。
ゼロショット内コンテキスト学習を使用して、すべてのタスクで11のオープンウェイトLLM（7〜1230億パラメーター）を評価します。
llama-3.3-70b-instructが最高のパフォーマンスモデルであり、8/16タスクで最高の結果を達成していることがわかります（Micro-F1スコアを使用）。
すべてのオープンウェイトLLMSが連絡先分類など、いくつかの困難なタスクで60％Micro-F1をスコア以下でスコアリングし、すべてのLLMがGI疾患分類などの他の人で80％を超えるMicro-F1を達成しているというタスク全体で大きなばらつきが見られます。
11のタスクのサブセットについては、3つのGPT-4およびGPT-4Oシリーズモデルも評価し、llama-3.3-70b-instructに匹敵する結果を見つけます。
全体として、これらの最初の結果に基づいて、LLMが公衆衛生の専門家がさまざまな無料のテキストソースから情報を抽出し、公衆衛生の監視、研究、介入をサポートするための有用なツールである可能性があるという有望な兆候を見つけます。

要約(オリジナル)

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 11 tasks, we also evaluate three GPT-4 and GPT-4o series models and find comparable results to Llama-3.3-70B-Instruct. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

arxiv情報

著者	Joshua Harris,Timothy Laurence,Leo Loman,Fan Grayson,Toby Nonnenmacher,Harry Long,Loes WalsGriffith,Amy Douglas,Holly Fountain,Stelios Georgiou,Jo Hardstaff,Kathryn Hopkins,Y-Ling Chi,Galena Kuyumdzhieva,Lesley Larkin,Samuel Collins,Hamish Mohammed,Thomas Finnie,Luke Hounsome,Michael Borowitz,Steven Riley
発行日	2025-02-19 14:11:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー