NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

要約

このペーパーでは、ノルウェーの生成言語モデル（LMS）の大規模な標準化されたベンチマークのための新しい包括的な評価スイートであるNorevalを紹介します。
Norevalは、24の高品質のヒト作成されたデータセットで構成されており、そのうち5つはゼロから作成されています。
ノルウェーの既存のベンチマークとは対照的に、Norevalはノルウェーの言語の理解と生成を対象とした幅広いタスクカテゴリをカバーし、人間のベースラインを確立し、ノルウェー語の公式の書面基準の両方に焦点を当てています：Bokm {\ aa} lとnynorsk。
すべてのデータセットと100を超える人間が書いたプロンプトのコレクションは、LM評価ハーネスに統合され、柔軟で再現可能な評価を確保します。
Norevalのデザインについて説明し、さまざまなシナリオでノルウェー語の19のオープンソースの事前訓練と指導チューニングLMSのベンチマークの結果を提示します。
当社のベンチマーク、評価フレームワーク、および注釈資料は公開されています。

要約(オリジナル)

This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets — of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokm{\aa}l and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

arxiv情報

著者	Vladislav Mikhailov,Tita Enstad,David Samuel,Hans Christian Farsethås,Andrey Kutuzov,Erik Velldal,Lilja Øvrelid
発行日	2025-04-10 13:44:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー