Benchmarking Agentic Workflow Generation

要約

大規模言語モデル (LLM) は、幅広いタスクを処理する優れた能力を備えており、推論および計画タスクへの取り組みにおいて大幅な進歩をもたらしました。このプロセスでは、複雑な問題を実行可能なワークフローに分解することが重要なステップとなります。
既存のワークフロー評価フレームワークは、総合的なパフォーマンスのみに焦点を当てているか、限られたシナリオ範囲、単純なワークフロー構造、緩い評価基準などの制限に悩まされています。
この目的を達成するために、多面的なシナリオと複雑なグラフワークフロー構造を備えた統合ワークフロー生成ベンチマークである WorFBench を紹介します。
さらに、LLM エージェントのワークフロー生成機能を正確に定量化するために、サブシーケンスとサブグラフのマッチングアルゴリズムを利用する体系的な評価プロトコルである WorFEval を紹介します。
さまざまなタイプの LLM にわたる包括的な評価を通じて、LLM エージェントのシーケンス計画能力とグラフ計画能力の間に明確なギャップがあり、GPT-4 でさえ約 15% のギャップを示していることがわかりました。
また、2 つのオープンソースモデルをトレーニングし、保留されたタスクに対する一般化能力を評価します。
さらに、生成されたワークフローによって下流のタスクが強化され、推論中により短い時間で優れたパフォーマンスを達成できることがわかりました。
コードとデータセットは https://github.com/zjunlp/WorFBench で入手できます。

要約(オリジナル)

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent’s workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorFBench.

arxiv情報

著者	Shuofei Qiao,Runnan Fang,Zhisong Qiu,Xiaobin Wang,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
発行日	2024-10-30 14:49:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Agentic Workflow Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー