Enhancing Subtask Performance of Multi-modal Large Language Model

要約

マルチモーダル大規模言語モデル (MLLM) は、マルチモーダルデータを処理および推論する機能を持つ大規模言語モデル (LLM) を拡張したモデルを指します。
現在の MLLM は通常、LLM を使用してタスクを複数のサブタスクに分解することから始まり、次に個別の事前トレーニングされたモデルを使用して特定のサブタスクを完了し、最終的に LLM を利用して各サブタスクの結果を統合してタスクの結果を取得します。
実際のシナリオでは、大規模なプロジェクトを扱う場合、プロジェクトを小さなサブプロジェクトに分割し、さまざまなチームが対応するソリューションや結果を提供するのが一般的です。
次に、プロジェクト所有者は、どのソリューションまたは結果を使用するかを決定し、各サブタスク、ひいてはプロジェクト全体で可能な限り最良の結果が得られるようにします。
これに触発されたこの研究では、同じサブタスクを完了するために複数の事前トレーニングされたモデルを選択することを検討しています。
複数の事前トレーニングされたモデルの結果を組み合わせることで、最適なサブタスクの結果が得られ、MLLM のパフォーマンスが向上します。
具体的には、この研究ではまず、異なる評価アプローチに基づいて同じサブタスクに焦点を当てた複数の事前トレーニング済みモデルを選択し、次にこれらのモデルを並行して呼び出して入力データを処理し、対応するサブタスクの結果を生成します。
最後に、同じサブタスクに対する複数の事前トレーニング済みモデルの結果が LLM を使用して比較され、最良の結果がそのサブタスクの結果として選択されます。
この研究では、GPT-4 アノテーション付きデータセットと人間によるアノテーション付きデータセットを使用して広範な実験が行われています。
さまざまな評価指標の結果は、この論文で提案したアプローチの有効性を適切に示しています。

要約(オリジナル)

Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.

arxiv情報

著者	Yongqiang Zhao,Zhenyu Li,Feng Zhang,Xinhai Xu,Donghong Liu
発行日	2023-08-31 05:37:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Subtask Performance of Multi-modal Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー