FinanceBench: A New Benchmark for Financial Question Answering

要約

FinanceBench は、オープンブック財務質問応答 (QA) における LLM のパフォーマンスを評価するための、この種では初めてのテストスイートです。
これは、上場企業に関する 10,231 の質問と、対応する回答および証拠文字列で構成されています。
FinanceBench の質問は生態学的に有効であり、さまざまなシナリオをカバーしています。
これらは、最低限のパフォーマンス基準として機能するよう、明確かつ簡単に回答できるようにすることを目的としています。
FinanceBench からの 150 ケースのサンプルで 16 の最先端のモデル構成 (GPT-4-Turbo、Llama2、Claude2 を含む、ベクターストアと長いコンテキストプロンプトを含む) をテストし、その回答を手動でレビューします (n=2,400)。
ケースはオープンソースで入手できます。
既存の LLM には財務 QA に関して明らかな制限があることを示します。
特に、GPT-4-Turbo を検索システムと併用すると、質問の 81% が不正確に回答するか、回答を拒否しました。
より長いコンテキストウィンドウを使用して関連する証拠を入力するなどの拡張手法はパフォーマンスを向上させますが、遅延が増加するため企業環境では非現実的であり、より大きな財務文書をサポートすることはできません。
調査したすべてのモデルには、企業での使用への適性を制限する幻覚などの弱点があることがわかりました。

要約(オリジナル)

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.

arxiv情報

著者	Pranab Islam,Anand Kannappan,Douwe Kiela,Rebecca Qian,Nino Scherrer,Bertie Vidgen
発行日	2023-11-20 17:28:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FinanceBench: A New Benchmark for Financial Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー