DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation

要約

この研究では、技術文書のエンジニアリング要件を理解し、適用する際のマルチモーダル大規模言語モデル (MLLM) の習熟度を評価することを目的とした新しいベンチマークである DesignQA を紹介します。
現実世界のエンジニアリングの課題に焦点を当てて開発された DesignQA は、Formula SAE 学生コンテストから得られたテキスト設計要件、CAD 画像、エンジニアリング図面などのマルチモーダルデータを独自に組み合わせています。
多くの既存の MLLM ベンチマークとは異なり、DesignQA には、入力画像と入力ドキュメントが異なるソースから取得された、ドキュメントに基づいた視覚的な質問が含まれています。
このベンチマークは自動評価指標を備えており、要件に応じて設計する際にエンジニアが実行するタスクに基づいて、ルール理解、ルール準拠、ルール抽出のセグメントに分割されています。
私たちは、GPT-4o、GPT-4、Claude-Opus、Gemini-1.0、LLaVA-1.5 などの最先端のモデル (執筆時点) をベンチマークに対して評価し、既存のギャップを明らかにしました。
複雑なエンジニアリング文書を解釈するMLLMの能力。
テストされた MLLM は有望ではありますが、Formula SAE ドキュメントから関連するルールを確実に取得するのに苦労し、CAD 画像内の技術コンポーネントを認識する際の課題に直面し、エンジニアリング図面の分析で困難に遭遇します。
これらの発見は、技術文書に従って設計に特徴的な多面的な質問をより適切に処理できるマルチモーダルモデルの必要性を強調しています。
このベンチマークは、AI サポートのエンジニアリング設計プロセスにおける将来の進歩の基礎を築きます。
DesignQA は、https://github.com/anniedris/design_qa/ で公開されています。

要約(オリジナル)

This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs’ abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.

arxiv情報

著者	Anna C. Doris,Daniele Grandi,Ryan Tomich,Md Ferdous Alam,Mohammadmehdi Ataei,Hyunmin Cheong,Faez Ahmed
発行日	2024-08-23 17:19:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー