AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

要約

大規模な言語モデル（LLM）は、文献を統合し、研究の質問に答え、研究のアイデアを生成し、計算実験を実施する能力など、科学研究のアプリケーションについて調査されています。
最終的に、私たちの目標は、これらが科学者が新しい科学的洞察を引き出すのを助けることです。
科学の多くの分野では、そのような洞察はしばしばデータを処理および視覚化してそのパターンを理解することから生じます。
ただし、LLMを介した科学的ワークフローが正しい科学的洞察を伝える出力を生成するかどうかを評価することは、評価に挑戦し、過去の研究では対処されていません。
天文学ドメインでの科学的コンピューティングと視覚化の両方の最初のベンチマークであるAstrovisbenchを紹介します。
Astrovisbenchは、（1）データを処理および分析するための天文学固有のワークフローを作成し、（2）複雑なプロットを介してこれらのワークフローの結果を視覚化する両方の言語モデルの能力を判断します。
視覚化の評価では、5人の専門的な天文学者による注釈に対して検証されている新しいLLM-A-A-A-Judgeワークフローを使用しています。
Astrovisbenchを使用して、最先端の言語モデルの評価を提示し、有用なアシスタントとして天文学研究に従事する能力に大きなギャップを示します。
この評価は、物理学から生物学まで幅広いドメインの中心である視覚化ベースのワークフローの開発のための道を提供するAI科学者に強力なエンドツーエンドの評価を提供します。

要約(オリジナル)

Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model’s ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.

arxiv情報

著者	Sebastian Antony Joseph,Syed Murtaza Husain,Stella S. R. Offner,Stéphanie Juneau,Paul Torrey,Adam S. Bolton,Juan P. Farias,Niall Gaffney,Greg Durrett,Junyi Jessy Li
発行日	2025-05-28 14:54:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー