Can Large Language Models Understand Symbolic Graphics Programs?

要約

大規模言語モデル (LLM) の機能を評価することは、多くの場合困難です。その理由の 1 つは、トレーニング中に経験していないタスクを見つけるのが難しいためです。
私たちは、視覚データを手続き的に生成するグラフィックコンテンツの一般的な表現であるシンボリックグラフィックプログラムに焦点を当てるという、新しいタスクに目を向けることによって、この課題に対処するための一歩を踏み出しました。
LLM はプログラム合成に対して有望な可能性を示していますが、シンボリックグラフィックスプログラムを理解しているのでしょうか?
従来のプログラムとは異なり、シンボリックグラフィックスプログラムはグラフィックスコンテンツに変換できます。
ここでは、グラフィックスコンテンツに関連する質問に答える能力の観点から、LLM のシンボリックプログラムの理解を特徴付けます。
シンボリックプログラムだけから質問に答えるのは難しいため、このタスクは困難ですが、人体実験を通じて検証すると、対応するグラフィックコンテンツから質問に答えるのは簡単です。
シンボリックプログラムを理解するには、LLM は、レンダリングされたビジュアルコンテンツに直接アクセスせずに、対応するグラフィックコンテンツがどのように見えるかを想像する能力を備えている必要がある場合があります。
このタスクを使用して、シンボリックグラフィックスプログラムの意味を理解するための大規模なベンチマークを作成することで LLM を評価します。
このベンチマークはプログラムとグラフィックスの対応によって構築されているため、人的労力は最小限で済みます。
現在の LLM をベンチマークで評価し、プログラムの視覚的なシーンを推論する能力の予備評価を明らかにします。
このタスクにより、既存の LLM が区別され、推論に優れていると考えられるモデルのパフォーマンスが向上することがわかりました。
最後に、この能力を向上させるためのシンボリック命令チューニング (SIT) を紹介します。
具体的には、記号プログラムによって生成された質問と画像を使用して GPT4-o にクエリを実行します。
このようなデータは、LLM を微調整するために使用されます。
また、SIT データが LLM の一般的な命令追従能力を向上させることができることもわかりました。

要約(オリジナル)

Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM’s understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone — yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.

arxiv情報

著者	Zeju Qiu,Weiyang Liu,Haiwen Feng,Zhen Liu,Tim Z. Xiao,Katherine M. Collins,Joshua B. Tenenbaum,Adrian Weller,Michael J. Black,Bernhard Schölkopf
発行日	2024-08-15 17:59:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Large Language Models Understand Symbolic Graphics Programs?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー