SimulBench: Evaluating Language Models with Creative Simulation Tasks

要約

SimulBench は、Linux ターミナルとして機能したり、ユーザーとテキストゲームをプレイしたりするなど、クリエイティブなシミュレーションシナリオの多様なコレクションにわたって大規模言語モデル (LLM) を評価するように設計されたベンチマークです。
これらのシミュレーションタスクは LLM の一般的なインテリジェンスの効果的な尺度として機能しますが、既存のベンチマークに組み込まれることはほとんどありません。
大きな課題は、ユーザーと AI の間のシミュレーションタスクのマルチラウンドインタラクティブな性質を維持しながら、さまざまな LLM を公平にテストするための評価フレームワークを開発することです。
この問題に取り組むには、固定 LLM をユーザーエージェントとして使用し、LLM と連携して、さまざまなタスクの下で最初にダイアログを収集することをお勧めします。
次に、さまざまなターゲット LLM を評価するために、困難な対話スクリプトが抽出されます。
\DataName{} の自動評価を容易にするために、GPT-4 が評価者として採用され、複数ターンの対話スクリプトが与えられたターゲット LLM によって生成される最終応答の品質をレビューする任務を負います。
私たちの包括的な実験は、これらのシミュレーションタスクがその独特の性質により引き続き重大な課題をもたらしていることを示し、独自のモデルと最先端のオープン LLM との間のギャップを示しています。
たとえば、GPT-4-turbo は、LLaMA-3-70b-Chat よりも 18.55\% 多いケースで優れています。

要約(オリジナル)

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55\% more cases.

arxiv情報

著者	Qi Jia,Xiang Yue,Tianyu Zheng,Jie Huang,Bill Yuchen Lin
発行日	2024-09-11 21:53:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SimulBench: Evaluating Language Models with Creative Simulation Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー