Why Sample Space Matters: Keyframe Sampling Optimization for LiDAR-based Place Recognition




Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.


著者 Nikolaos Stathoulopoulos,Vidya Sumathy,Christoforos Kanellakis,George Nikolakopoulos
発行日 2024-10-03 16:29:47+00:00
Learning 3D Perception from Others’ Predictions




Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units’ predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars’ predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units’ predictions.


著者 Jinsu Yoo,Zhenyang Feng,Tai-Yu Pan,Yihong Sun,Cheng Perng Phoo,Xiangyu Chen,Mark Campbell,Kilian Q. Weinberger,Bharath Hariharan,Wei-Lun Chao
発行日 2024-10-03 16:31:28+00:00
Measuring and Improving Persuasiveness of Generative Models


LLMは、人間が消費するコンテンツを生成するワークフロー(マーケティングなど)や、人間と直接対話するワークフロー(チャットボットなど)で使用されるようになってきている。検証可能な説得力のあるメッセージを生成できるこのようなシステムの開発は、社会にとってチャンスであると同時に課題でもある。一方では、このようなシステムは、広告や薬物中毒への対処のような社会的善のような領域にプラスの影響を与える可能性があり、他方では、誤った情報の拡散や政治的意見の形成に悪用される可能性がある。LLMが社会に与える影響を調整するために、我々はLLMの説得力を測定し、ベンチマークするシステムを開発する必要がある。このような動機から、我々は、生成モデルの説得力を自動的に測定するためのタスク群を含む、初の大規模ベンチマークとアリーナであるPersuasionBenchとPersuasionArenaを紹介する。我々は、LLMがより説得力のある言語を生成するのに役立つ言語パターンをどの程度知っており、活用しているかを調査する。その結果、LLMの説得力はモデルの大きさと正の相関があるが、より小さなモデルでも、より大きなモデルよりも高い説得力を持たせることができることがわかった。注目すべきは、合成データセットと自然データセットを用いた的を絞った訓練が、小規模モデルの説得力を著しく向上させ、規模依存の仮定に挑戦することである。我々の発見は、モデル開発者と政策立案者の双方に重要な示唆を与える。例えば、EUのAI法やカリフォルニア州のSB-1047は、浮動小数点演算の数に基づいてAIモデルを規制することを目指しているが、我々は、このような単純な指標だけではAIの社会的影響の全容を捉えることができないことを実証している。我々は、AIによる説得とその社会的意味合いについての理解を深めるために、https://bit.ly/measure-persuasion で利用可能な PersuasionArena と PersuasionBench を探求し、貢献することをコミュニティに呼びかける。


LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs’ impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models’ persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California’s SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI’s societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at https://bit.ly/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.


著者 Somesh Singh,Yaman K Singla,Harini SI,Balaji Krishnamurthy
発行日 2024-10-03 16:36:35+00:00
Evaluating Perceptual Distance Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data




The two-alternative forced choice (2AFC) experimental method is popular in the visual perception literature, where practitioners aim to understand how human observers perceive distances within triplets made of a reference image and two distorted versions. In the past, this had been conducted in controlled environments, with triplets sharing images, so it was possible to rank the perceived quality. This ranking would then be used to evaluate perceptual distance models against the experimental data. Recently, crowd-sourced perceptual datasets have emerged, with no images shared between triplets, making ranking infeasible. Evaluating perceptual distance models using this data reduces the judgements on a triplet to a binary decision, namely, whether the distance model agrees with the human decision – which is suboptimal and prone to misleading conclusions. Instead, we statistically model the underlying decision-making process during 2AFC experiments using a binomial distribution. Having enough empirical data, we estimate a smooth and consistent distribution of the judgements on the reference-distorted distance plane, according to each distance model. By applying maximum likelihood, we estimate the parameter of the local binomial distribution, and a global measurement of the expected log-likelihood of the measured responses. We calculate meaningful and well-founded metrics for the distance model, beyond the mere prediction accuracy as percentage agreement, even with variable numbers of judgements per triplet — key advantages over both classical and neural network methods.


著者 Alexander Hepburn,Raul Santos-Rodriguez,Javier Portilla
発行日 2024-10-03 17:10:22+00:00
Generalizing Medical Image Representations via Quaternion Wavelet Networks




Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios and robust to domain shifts. The full code is available at: https://github.com/ispamm/QWT.


著者 Luigi Sigillo,Eleonora Grassucci,Aurelio Uncini,Danilo Comminiello
発行日 2024-10-03 17:13:41+00:00
Lie Algebra Canonicalization: Equivariant Neural Operators under arbitrary Lie Groups


ロバストで汎化可能な機械学習モデルの探求は、等変量ニューラルネットワークを通して対称性を利用することへの最近の関心を駆り立てている。PDEソルバーの文脈では、最近の研究により、リー点対称性が、データと損失の増大を通じて、物理情報ニューラルネットワーク(PINN)の有用な帰納的バイアスになり得ることが示されている。にもかかわらず、このような問題に対してモデル・アーキテクチャの中で直接的に等変数を強制することは依然として困難である。これは、多くのPDEが非コンパクトな対称群を持つためであり、しばしば、その無限小世代を越えて研究されていないため、既存のほとんどの等変量アーキテクチャと互換性がない。本研究では、Lie aLgebrA Canonicalization (LieLAC)を提案する。LieLACは、対称群の無限小演算子の作用のみを利用する新しいアプローチであり、完全な群構造の知識を必要としない。これを実現するために、我々は正準化の文献における既存の理論的問題に取り組み、連続的な非コンパクト群の場合のフレーム平均化との関連を確立する。正準化の枠組みの中で動作するLieLACは、制約のない事前訓練されたモデルと容易に統合することができ、既存のモデルに入力する前に入力を正準形式に変換し、許容される対称性に従ってモデル推論のための入力を効果的に整列させる。LieLACは標準的なリー群降下スキームを利用し、事前訓練されたモデルにおける等変数を達成する。最後に、事前に訓練されたモデルを用いた不変画像分類とリー点対称等変量ニューラルPDEソルバーのタスクにおけるLieLACの有効性を示す。


The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks. In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive. This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures. In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure. To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups. Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries. LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models. Finally, we showcase LieLAC’s efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models.


著者 Zakhar Shumaylov,Peter Zaika,James Rowbottom,Ferdia Sherry,Melanie Weber,Carola-Bibiane Schönlieb
発行日 2024-10-03 17:21:30+00:00
ControlAR: Controllable Image Generation with Autoregressive Models




Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model’s efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR.


著者 Zongming Li,Tianheng Cheng,Shoufa Chen,Peize Sun,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang
発行日 2024-10-03 17:28:07+00:00
LLaVA-Critic: Learning to Evaluate Multimodal Models




We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model’s effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.


著者 Tianyi Xiong,Xiyao Wang,Dong Guo,Qinghao Ye,Haoqi Fan,Quanquan Gu,Heng Huang,Chunyuan Li
発行日 2024-10-03 17:36:33+00:00
Video Instruction Tuning With Synthetic Data




The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.


著者 Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu,Chunyuan Li
発行日 2024-10-03 17:36:49+00:00
AlzhiNet: Traversing from 2DCNN to 3DCNN, Towards Early Detection and Diagnosis of Alzheimer’s Disease


アルツハイマー病(AD)は進行性の神経変性疾患であり、高齢化社会の中で有病率が増加しているため、効果的な疾患管理のために早期かつ正確な診断が必要とされている。本研究では、2次元畳み込みニューラルネットワーク(2D-CNN)と3次元畳み込みニューラルネットワーク(3D-CNN)の両方を統合し、カスタム損失関数とボリューメトリックデータ補強を加えた、新しいハイブリッドディープラーニングフレームワークを提示することで、特徴抽出を強化し、AD診断における分類性能を向上させる。広範な実験によると、AlzhiNetは単独の2Dモデルや3Dモデルを凌駕しており、これらの補完的なデータ表現を組み合わせることの重要性を強調している。拡張された2Dスライスから得られる3Dボリュームの深さと質も、モデルの性能に大きく影響する。この結果は、最適な結果を得るためには、ハイブリッド予測における重み付け係数を注意深く選択することが不可欠であることを示している。我々のフレームワークはKaggleとMIRIADのMRIデータセットで検証され、それぞれ98.9%と99.99%の精度と100%のAUCを得た。さらに、AlzhiNetは、Alzheimer’s Kaggleデータセットにおいて、ガウスノイズ、明るさ、コントラスト、ソルト&ペッパーノイズ、カラージッター、オクルージョンなどの様々な摂動シナリオの下で研究された。その結果、AlzhiNetはResNet-18よりも摂動に強く、実世界のアプリケーションに最適であることがわかった。このアプローチは、アルツハイマー病の早期診断と治療計画における有望な進歩である。


Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with increasing prevalence among the aging population, necessitating early and accurate diagnosis for effective disease management. In this study, we present a novel hybrid deep learning framework that integrates both 2D Convolutional Neural Networks (2D-CNN) and 3D Convolutional Neural Networks (3D-CNN), along with a custom loss function and volumetric data augmentation, to enhance feature extraction and improve classification performance in AD diagnosis. According to extensive experiments, AlzhiNet outperforms standalone 2D and 3D models, highlighting the importance of combining these complementary representations of data. The depth and quality of 3D volumes derived from the augmented 2D slices also significantly influence the model’s performance. The results indicate that carefully selecting weighting factors in hybrid predictions is imperative for achieving optimal results. Our framework has been validated on the Magnetic Resonance Imaging (MRI) from Kaggle and MIRIAD datasets, obtaining accuracies of 98.9% and 99.99%, respectively, with an AUC of 100%. Furthermore, AlzhiNet was studied under a variety of perturbation scenarios on the Alzheimer’s Kaggle dataset, including Gaussian noise, brightness, contrast, salt and pepper noise, color jitter, and occlusion. The results obtained show that AlzhiNet is more robust to perturbations than ResNet-18, making it an excellent choice for real-world applications. This approach represents a promising advancement in the early diagnosis and treatment planning for Alzheimer’s disease.


著者 Romoke Grace Akindele,Samuel Adebayo,Paul Shekonya Kanda,Ming Yu
発行日 2024-10-03 17:37:18+00:00
