PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

要約

大規模言語モデル (LLM) の最近の評価は、基本的な自然言語タスクに対するゼロショット/少数ショット機能と、命令をツール API に変換する機能のテストに重点が置かれています。
ただし、複雑なマルチモーダル環境でマルチターン、マルチモーダル命令を完了するために複雑なツールを利用する LLM の評価は調査されていません。
このギャップに対処するために、ユーザーの指示に基づいて PPT ファイルを作成および編集する LLM の能力を評価する PowerPoint タスク完了 (PPTC) ベンチマークを導入します。
これには、さまざまなトピックをカバーする 279 のマルチターンセッションと、マルチモーダル操作を含む数百の指示が含まれています。
また、ラベル API シーケンスではなく予測ファイルに基づいて LLM が命令を完了したかどうかを評価する PPTX-Match 評価システムも提案します。これにより、LLM が生成するさまざまな API シーケンスがサポートされます。
3 つのクローズド LLM と 6 つのオープンソース LLM を測定します。
結果は、GPT-4 がシングルターン対話テストでは 75.1\% の精度で他の LLM を上回っていますが、セッション全体を完了するのに課題があり、セッション精度はわずか 6\% に達していることが示されています。
ベンチマークでは、マルチターンセッションでのエラー蓄積、長い PPT テンプレート処理、およびマルチモダリティ認識という 3 つの主なエラー原因が見つかりました。
これらは、将来の LLM およびエージェントシステムにとって大きな課題となります。
PPTC のデータ、コード、評価システムを \url{https://github.com/gydpku/PPTC} で公開しています。

要約(オリジナル)

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs’ ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.

arxiv情報

著者	Yiduo Guo,Zekai Zhang,Yaobo Liang,Dongyan Zhao,Nan Duan
発行日	2023-11-07 10:13:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー