Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

要約

大規模言語モデル (LLM) はさまざまなタスクにわたって顕著なパフォーマンスを示し、研究者が多様な評価ベンチマークを開発するよう促しています。
ただし、既存のベンチマークは通常、LLM が個々の質問に応答する能力を測定し、現実世界のアプリケーションにおける複雑な相互作用を無視しています。
このペーパーでは、複合質問合成 (CQ-Syn) を導入して、複数のサブ質問を持つ複合質問に焦点を当てた複合 QA ベンチマークを作成します。
このベンチマークは既存の QA データセットから派生し、独自の LLM で注釈が付けられ、人間によって精度が検証されています。
これには、事実の記述、原因と結果、仮説の分析、比較と選択、評価と提案の 5 つのカテゴリが含まれます。
理解、推論、知識を含む 3 つの側面から LLM 能力を評価します。
Compound-QA を使用して 8 つのオープンソース LLM を評価したところ、複合質問に対する応答の明確なパターンが明らかになり、非複合質問に対する応答よりも著しく劣っていました。
さらに、複合質問に対する LLM のパフォーマンスを向上させるさまざまな方法を調査します。
結果は、これらのアプローチにより、複合的な質問に対するモデルの理解力と推論能力が大幅に向上することを示しています。

要約(オリジナル)

Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, existing benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. In this paper, we introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark, focusing on compound questions with multiple sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge. Our assessment of eight open-source LLMs using Compound-QA reveals distinct patterns in their responses to compound questions, which are significantly poorer than those to non-compound questions. Additionally, we investigate various methods to enhance LLMs performance on compound questions. The results indicate that these approaches significantly improve the models’ comprehension and reasoning abilities on compound questions.

arxiv情報

著者	Yutao Hou,Yajing Luo,Zhiwen Ruan,Hongru Wang,Weifeng Ge,Yun Chen,Guanhua Chen
発行日	2024-11-15 13:12:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー