PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

要約

大規模言語モデル (LLM) がさまざまな分野で普及するにつれて、個々の分野に必要な知識をすべて網羅する NLP ベンチマークの改善が緊急に必要とされています。
基礎モデルの現代のベンチマークの多くは、幅広い主題を強調していますが、すべての重要な主題を提示し、それらについて必要な専門知識を網羅するという点では不十分であることがよくあります。
LLM がさまざまな主題や知識領域にわたってさまざまなパフォーマンスを示すことを考慮すると、この不足により歪んだ結果が生じています。
この問題に対処するために、大学院入試に必要な知識をすべてカバーする初の包括的な中国語評価スイートである psybench を紹介します。
psybench は、多肢選択式の質問を通じて、心理学におけるモデルの長所と短所を詳細に評価します。
私たちの調査結果では、被験者のセクションごとにパフォーマンスに大きな差があることが示されており、テストセット内の知識のバランスが取れていない場合に結果が歪むリスクが浮き彫りになっています。
特に、ChatGPT モデルだけが $70\%$ を超える平均精度に達しており、改善の余地がまだ十分にあることを示しています。
psybench がベースモデルの長所と短所を徹底的に評価し、心理学分野での実用化に役立つことを期待しています。

要約(オリジナル)

As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model’s strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models’ strengths and weaknesses and assist in practical application in the field of psychology.

arxiv情報

著者	Junlei Zhang,Hongliang He,Nirui Song,Shuyuan He,Shuai Zhang,Huachuan Qiu,Anqi Li,Lizhi Ma,Zhenzhong Lan
発行日	2023-11-17 03:17:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー