NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

要約

中国ジャーナリズムにおける編集能力の大規模言語モデル (LLM) の能力を体系的に評価するための新しい評価フレームワークである NewsBench を紹介します。
私たちが構築したベンチマークデータセットは、執筆能力の 4 つの側面と安全遵守の 6 つの側面に焦点を当てており、24 のニュース分野の 5 つの編集タスクについて、多肢選択式質問と短答式質問のタイプで手動かつ慎重に設計された 1,267 個のテストサンプルで構成されています。
パフォーマンスを測定するために、記述能力と安全遵守の観点から短答式問題の LLM 世代を評価するための、さまざまな GPT-4 ベースの自動評価プロトコルを提案します。両方とも人間の評価との高い相関関係によって検証されます。
体系的な評価フレームワークに基づいて、中国語に対応できる人気の 10 社の LLM を総合的に分析します。
実験結果は、GPT-4 と ERNIE Bot がトップパフォーマンスであることを強調していますが、クリエイティブライティングタスクにおけるジャーナリズムの安全遵守が相対的に不十分であることを明らかにしています。
また、私たちの調査結果は、機械で生成されたジャーナリズムコンテンツにおける倫理ガイダンスの強化の必要性を強調しており、LLM をジャーナリズムの基準および安全性の考慮事項に合わせる上での一歩前進となります。

要約(オリジナル)

We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.

arxiv情報

著者	Miao Li,Ming-Bin Chen,Bo Tang,Shengbin Hou,Pengyu Wang,Haiying Deng,Zhiyu Li,Feiyu Xiong,Keming Mao,Peng Cheng,Yi Luo
発行日	2024-06-04 14:50:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー