PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

要約

最近の大規模言語モデル（LLM）の評価は、基本的な自然言語タスクに対するゼロショット／数ショット能力のテストや、ツールAPIへの指示翻訳能力のテストが中心となっている。しかし、複雑なマルチモーダル環境でマルチターン、マルチモーダルな指示を仕上げるために複雑なツールを利用するLLMの評価は調査されていない。このギャップを解決するために、我々は、ユーザーの指示に基づいてPPTファイルを作成・編集するLLMの能力を評価するPowerPoint Task Completion (PPTC)ベンチマークを導入する。このベンチマークには、多様なトピックをカバーする279のマルチターンセッションと、マルチモーダル操作を含む数百の命令が含まれている。また、LLMがラベルAPIシーケンスではなく、予測ファイルに基づいて命令を終了するかどうかを評価するPPTX-Match評価システムを提案し、LLMが生成した様々なAPIシーケンスをサポートする。我々は、3つのクローズドLLMと6つのオープンソースLLMを測定した。その結果、GPT-4はシングルターン対話テストでは75.1%の精度で他のLLMを上回ったが、セッション全体の完了では課題に直面し、わずか6%のセッション精度を達成した。GPT-4は、マルチターンセッションにおけるエラーの蓄積、長いPPTテンプレート処理、マルチモダリティ知覚という3つの主なエラー原因を発見した。これらは将来のLLMとエージェントシステムに大きな課題を与える。PPTCのデータ、コード、評価システムは୧⃛(๑⃙⃘⁼̴̀꒳⁼̴́๑⃙⃘)

要約(オリジナル)

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs’ ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.

arxiv情報

著者	Yiduo Guo,Zekai Zhang,Yaobo Liang,Dongyan Zhao,Duan Nan
発行日	2023-11-03 08:06:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー