Organizing Unstructured Image Collections using Natural Language

要約

非構造化ビジュアルデータをセマンティッククラスターに編成することは、コンピュータービジョンにおける重要な課題です。
従来のディープクラスタリング (DC) アプローチはデータの単一パーティションに焦点を当てていますが、マルチクラスタリング (MC) 手法は、個別のクラスタリングソリューションを明らかにすることでこの制限に対処します。
大規模言語モデル (LLM) とマルチモーダル LLM (MLLM) の台頭により、ユーザーが自然言語でクラスタリング基準を定義できるようになり、MC が強化されました。
ただし、大規模なデータセットの基準を手動で指定するのは現実的ではありません。
この研究では、大規模な画像コレクションからクラスタリング基準を自動的に検出し、人間の入力を必要とせずに解釈可能な部分構造を明らかにすることを目的としたタスク Semantic Multiple Clustering (SMC) を紹介します。
私たちのフレームワークである Text Driven Semantic Multiple Clustering (TeDeSC) は、テキストをプロキシとして使用して、大規模な画像コレクションを同時に推論し、自然言語で表現された分割基準を発見し、意味論的な下部構造を明らかにします。
TeDeSC を評価するために、COCO-4c および Food-4c ベンチマークを導入します。それぞれのベンチマークには 4 つのグループ化基準とグラウンドトゥルースの注釈が含まれています。
私たちは TeDeSC を、偏見の発見やソーシャルメディア画像の人気の分析などのさまざまなアプリケーションに適用し、画像コレクションを自動的に整理して新しい洞察を明らかにするツールとしての有用性を実証します。

要約(オリジナル)

Organizing unstructured visual data into semantic clusters is a key challenge in computer vision. Traditional deep clustering (DC) approaches focus on a single partition of data, while multiple clustering (MC) methods address this limitation by uncovering distinct clustering solutions. The rise of large language models (LLMs) and multimodal LLMs (MLLMs) has enhanced MC by allowing users to define clustering criteria in natural language. However, manually specifying criteria for large datasets is impractical. In this work, we introduce the task Semantic Multiple Clustering (SMC) that aims to automatically discover clustering criteria from large image collections, uncovering interpretable substructures without requiring human input. Our framework, Text Driven Semantic Multiple Clustering (TeDeSC), uses text as a proxy to concurrently reason over large image collections, discover partitioning criteria, expressed in natural language, and reveal semantic substructures. To evaluate TeDeSC, we introduce the COCO-4c and Food-4c benchmarks, each containing four grouping criteria and ground-truth annotations. We apply TeDeSC to various applications, such as discovering biases and analyzing social media image popularity, demonstrating its utility as a tool for automatically organizing image collections and revealing novel insights.

arxiv情報

著者	Mingxuan Liu,Zhun Zhong,Jun Li,Gianni Franchi,Subhankar Roy,Elisa Ricci
発行日	2024-10-07 17:21:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Organizing Unstructured Image Collections using Natural Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー