Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

要約

顔と人間は、社会的相互作用における重要な要素であり、日常の写真やビデオに広く含まれている。したがって、顔と人間を深く理解することで、マルチモーダルアシスタントは応答品質の向上と応用範囲の拡大を達成することができる。現在、マルチモーダルアシスタントのコミュニティでは、顔と人間の理解能力に関する包括的かつ科学的な評価が不足している。本論文では、まず3段階の能力を含む階層的能力分類法を提案する。次に、この分類法に基づいて、顔と人間のコミュニティで公開されているデータセットから画像と注釈を収集し、新しいベンチマークの問題を生成するための半自動データパイプラインを構築する。最後に、得られたFace-Human-Benchは、英語と中国語の両方をサポートする900の問題からなる開発セットと1800の問題からなるテストセットから構成される。このFace-Human-Benchを用いて、25の主要なマルチモーダル大規模言語モデル（MLLM）の評価を行い、能力間の相関、ターゲットの相対的位置がパフォーマンスに与える影響、Chain of Thought（CoT）プロンプトがパフォーマンスに与える影響に注目する。さらに、マルチモーダルエージェントに着想を得て、MLLMのどの能力を専門家モデルで補う必要があるのかも探求する。

要約(オリジナル)

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.

arxiv情報

著者	Lixiong Qin,Shilong Ou,Miaoxuan Zhang,Jiangning Wei,Yuhang Zhang,Xiaoshuai Song,Yuchen Liu,Mei Wang,Weiran Xu
発行日	2025-01-02 13:05:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー