HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

要約

会話アシスタントとしての言語モデル (LM) は、人々がさまざまなタスクを達成するのに役立つツールとして最近人気があります。
これらは通常、さらなる命令チューニングや場合によっては優先順位の最適化手法を通じて、一般的なドメインテキストシーケンスで事前トレーニングされた LM を適応させることで得られます。
このような LM の評価は人間の判断を使用して実行されるのが理想ですが、これには拡張性がありません。
一方、補助 LM を審査員および/または知識ベースのタスクとして利用する自動評価は拡張性がありますが、会話能力や指示の順守を評価するのに苦労します。
会話アシスタントとしての LM の開発を加速するために、私たちは新しい自動評価タスク、HumanRankEval (HRE) を提案します。
これは大規模で多様かつ高品質な一連の質問で構成されており、各質問には人間が作成および採点したいくつかの回答が含まれています。
評価を実行するために、HRE は LM の分布に基づく対数尤度に基づいてこれらの回答をランク付けし、その後、対応する人間によるランキングとの相関関係を計算します。
私たちは、さまざまなサイズの事前トレーニング済みおよび命令調整済み LM を HRE がどのように効率的に分離するかを調査することで、HRE の有効性をサポートしています。
HRE は人間の判断とよく相関し、特に命令チューニング後のモデル変更に応答することを示します。

要約(オリジナル)

Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM’s distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE’s efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.

arxiv情報

著者	Milan Gritta,Gerasimos Lampouras,Ignacio Iacobacci
発行日	2024-05-15 08:47:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー