SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

要約

「1人の男性と1人の女性が1年に1人の子供を産むことができる場合、1人の女性と3人の男性は0.5年で何人の子供を産むでしょうか？」という問題を考えてみましょう。
GPT-4o、GPT-o1-preview、Gemini Flash などの現在の大規模言語モデル (LLM) は、頻繁に「0.5」と答えますが、これは意味がありません。
これらのモデルは、質問の非現実的な性質を時々認識しますが、多くの場合 (10 回の試行中 8 回)、「0.5 人の子ども」という無意味な答えが得られます。
さらに、時間的な変動も観察されています。LLM が一度正しく回答すると (質問の性質が間違っていることを認識することで)、その後の回答もこの理解を反映する可能性が高くなります。
しかし、これは矛盾しています。
この種の質問は、質問自体に意図的に欠陥がある科学質問のデータセットである SciFaultyQA を開発する動機になりました。
私たちは、LLM が多くの場合、本質的な問題を認識せずにこれらの欠陥のある質問に答え続け、論理的または科学的に無効な結果を生み出していることを観察しました。
このようなパターンを分析することで、これらの欠陥のある質問を特定する際のさまざまな LLM のパフォーマンスを評価およびベンチマークするための合成データセットを生成する新しい方法を開発しました。
また、エラーを減らすための新しいアプローチも開発しました。

要約(オリジナル)

Consider the problem: “If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?’ Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer ‘0.5,’ which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of ‘0.5 child.’ Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identifying these flawed questions. We have also developed novel approaches to reduce the errors.

arxiv情報

著者	Debarshi Kundu
発行日	2024-12-16 17:11:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー