Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

要約

現在、大規模な言語モデルの品質と機能は、自動化されたベンチマーク評価では完全に評価できません。
代わりに、自然言語生成文献からの伝統的な定性的手法を拡張した人間による評価が必要です。
最近のベストプラクティスの 1 つは、特定のモデルに対する人間の評価者の好みを把握する A/B テストフレームワークを使用することで構成されています。
この論文では、エルゼビアで実施された生物医学領域 (健康、生物学、化学/薬学) に焦点を当てた人体評価実験について説明します。
その中で、比較的小さい (135B トークン) が高度に厳選された Elsevier データセットのコレクションでトレーニングされた、大規模ではあるが大規模ではない (8.8B パラメーター) デコーダー専用の基本的なトランスフォーマーが、OpenAI の GPT-3.5-turbo および Meta の基本的な 7B パラメーターの Llama 2 モデルと比較されています。
複数の基準に対して。
結果は、たとえ IRR スコアが全体的に低かったとしても、GPT-3.5 ターボに対する選好、したがって会話能力を持つモデルに対する選好が非常に大きく、非常に大規模なデータセットでトレーニングされたことを示しています。
しかし同時に、それほど大規模ではないモデルの場合、小規模だがよく厳選されたトレーニングセットでトレーニングすることで、生物医学領域で実行可能な代替案が生まれる可能性があることを示します。

要約(オリジナル)

The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI’s GPT-3.5-turbo and Meta’s foundational 7B parameter Llama 2 model against multiple criteria. Results indicate — even if IRR scores were generally low — a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

arxiv情報

著者	Camilo Thorne,Christian Druckenbrodt,Kinga Szarkowska,Deepika Goyal,Pranita Marajan,Vijay Somanath,Corey Harper,Mao Yan,Tony Scerri
発行日	2024-09-09 10:30:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー