Aligning Multimodal LLM with Human Preference: A Survey

要約

大規模な言語モデル（LLMS）は、タスク固有のトレーニングを必要とせずに、単純なプロンプトでさまざまな一般的なタスクを処理できます。
LLMSに基づいて構築されたマルチモーダル大手言語モデル（MLLMS）は、視覚、聴覚、およびテキストのデータを含む複雑なタスクに取り組むことで印象的な可能性を実証しています。
ただし、真実性、安全性、O1のような推論、および人間の好みとの整合に関連する重要な問題は不十分なままです。
このギャップは、さまざまなアプリケーションシナリオと最適化の目標をターゲットにしているさまざまなアライメントアルゴリズムの出現に拍車をかけました。
最近の研究では、アラインメントアルゴリズムが前述の課題を解決するための強力なアプローチであることが示されています。
この論文では、MLLMのアライメントアルゴリズムの包括的かつ体系的なレビューを提供することを目指しています。
具体的には、4つの重要な側面について説明します。（1）一般的な画像理解、マルチイメージ、ビデオ、オーディオ、拡張マルチモーダルアプリケーションなど、アラインメントアルゴリズムでカバーされているアプリケーションシナリオ。
（2）データソース、モデル応答、優先注釈など、アライメントデータセットの構築におけるコア要因。
（3）アライメントアルゴリズムの評価に使用されるベンチマーク。
（4）アライメントアルゴリズムの開発に関する潜在的な将来の方向性に関する議論。
この作業は、研究者が現場で現在の進歩を組織し、より良いアライメント方法を刺激するのを支援しようとしています。
このペーパーのプロジェクトページは、https：//github.com/bradyfu/awesome-multimodal-language-models/tree/alignmentで入手できます。

要約(オリジナル)

Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.

arxiv情報

著者	Tao Yu,Yi-Fan Zhang,Chaoyou Fu,Junkang Wu,Jinda Lu,Kun Wang,Xingyu Lu,Yunhang Shen,Guibin Zhang,Dingjie Song,Yibo Yan,Tianlong Xu,Qingsong Wen,Zhang Zhang,Yan Huang,Liang Wang,Tieniu Tan
発行日	2025-03-18 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning Multimodal LLM with Human Preference: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー