jarxiv | Japanese arxiv | ページ 1192

Hierarchical Contextual Manifold Alignment for Structuring Latent Representations in Large Language Models

投稿日: 2025年3月26日作成者: jarxiv

要約

潜在トークン表現の構成は、言語モデルの安定性、一般化、および文脈的一貫性を決定する上で重要な役割を果たしますが、具体化への従来のアプローチは、多くの場合、追加の計算オーバーヘッドを導入するパラメーターの変更に依存します。
コアモデルの重みを変更せずにトークンの埋め込みを再構築するために階層的アライメント法が導入され、表現分布が異なる言語コンテキストにわたって一貫性を維持することを保証しました。
実験的評価により、まれなトークンの検索、敵対的堅牢性、長距離依存の追跡の改善が実証され、潜在的な空間組織の矛盾を緩和する階層構造の利点を強調しました。
従来の微調整および埋め込み摂動方法との比較分析により、階層的再構築が表現品質の測定可能なゲインを達成しながら計算効率を維持することが明らかになりました。
アライメントプロセスを通じて導入された構造精製により、さまざまな言語タスク全体でコンテキスト安定性が向上し、トークンの近接関係の矛盾が減少し、言語生成の解釈可能性が向上しました。
詳細な計算評価により、再編成プロセスが最小限の推論オーバーヘッドを導入したことが確認され、表現の改善がモデルの効率を妥協しないことを確認しました。
この調査結果は、構造化された表現学習のより広範な重要性を強化し、階層的な埋め込みの修正が、学習前の意味論的協会を維持しながら潜在的な空間分布を改良するための効果的な戦略として役立つ可能性があることを示しています。

要約(オリジナル)

The organization of latent token representations plays a crucial role in determining the stability, generalization, and contextual consistency of language models, yet conventional approaches to embedding refinement often rely on parameter modifications that introduce additional computational overhead. A hierarchical alignment method was introduced to restructure token embeddings without altering core model weights, ensuring that representational distributions maintained coherence across different linguistic contexts. Experimental evaluations demonstrated improvements in rare token retrieval, adversarial robustness, and long-range dependency tracking, highlighting the advantages of hierarchical structuring in mitigating inconsistencies in latent space organization. The comparative analysis against conventional fine-tuning and embedding perturbation methods revealed that hierarchical restructuring maintained computational efficiency while achieving measurable gains in representation quality. Structural refinements introduced through the alignment process resulted in improved contextual stability across varied linguistic tasks, reducing inconsistencies in token proximity relationships and enhancing interpretability in language generation. A detailed computational assessment confirmed that the realignment process introduced minimal inference overhead, ensuring that representational improvements did not compromise model efficiency. The findings reinforced the broader significance of structured representation learning, illustrating that hierarchical embedding modifications could serve as an effective strategy for refining latent space distributions while preserving pre-learned semantic associations.

arxiv情報

著者	Meiquan Dong,Haoran Liu,Yan Huang,Zixuan Feng,Jianhong Tang,Ruoxi Wang
発行日	2025-03-25 13:13:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Hierarchical Lexical Manifold Projection in Large Language Models: A Novel Mechanism for Multi-Scale Semantic Representation

投稿日: 2025年3月26日作成者: jarxiv

要約

構造化された階層埋め込みの変圧器ベースのアーキテクチャへの統合により、語彙表現への洗練されたアプローチが導入され、計算効率を損なうことなくマルチスケールのセマンティック関係が保存されます。
構造化されたマニホールドにトークンをマッピングする投影メカニズムは、改善された語彙アライメントを提供し、多様な言語タスク全体で単語表現の適応性を高めます。
構造化されたエンコーディングフレームワークにより、階層的な埋め込みがさまざまな抽象化レベル全体で一貫性を維持し、ローカライズされた構文機能とグローバルなセマンティック構造の間の安定した遷移を可能にします。
実験的評価は、階層埋め込みが従来のトークン表現を一貫して上回ることを示しており、より低い計算オーバーヘッドを維持しながら言語ベンチマークの精度を向上させます。
複数のドメインにわたる比較分析は、特に構造化された語彙アライメントが不可欠な特殊な言語アプリケーションで、コンテキストの一貫性を維持する階層埋め込みの能力を強調しています。
統計的評価はさらに、階層的な埋め込みが摂動条件下で堅牢性を高めることを示し、言語構造が敵対的なテキストの変更全体で安定したままであることを保証します。
階層投影とトランスの注意メカニズムと統合により、コンテキスト適応が改善され、さまざまな言語分布に基づいてトークン表現が動的に調整されるようになります。
埋め込みの洗練された階層的な組織は、語彙モデリングのより大きな解釈可能性を提供し、多様なテキスト処理タスク全体で強化された一般化機能を促進します。

要約(オリジナル)

The integration of structured hierarchical embeddings into transformer-based architectures introduces a refined approach to lexical representation, ensuring that multi-scale semantic relationships are preserved without compromising computational efficiency. A projection mechanism that maps tokens onto a structured manifold provides improved lexical alignment, enhancing the adaptability of word representations across diverse linguistic tasks. The structured encoding framework ensures that hierarchical embeddings maintain coherence across varying abstraction levels, allowing for stable transitions between localized syntactic features and global semantic structures. Experimental evaluations indicate that hierarchical embeddings consistently outperform conventional token representations, improving accuracy in linguistic benchmarks while maintaining lower computational overhead. Comparative analysis across multiple domains highlights the ability of hierarchical embeddings to retain contextual consistency, particularly in specialized language applications where structured lexical alignment is essential. Statistical assessments further demonstrate that hierarchical embeddings exhibit enhanced robustness under perturbation conditions, ensuring that linguistic structures remain stable across adversarial text modifications. The integration of hierarchical projections with transformer attention mechanisms enables improved contextual adaptation, ensuring that token representations are dynamically adjusted based on varying linguistic distributions. The refined hierarchical organization of embeddings provides greater interpretability in lexical modeling, facilitating enhanced generalization capabilities across diverse text processing tasks.

arxiv情報

著者	Natasha Martus,Sebastian Crowther,Maxwell Dorrington,Jonathan Applethwaite,Edgar Tillinghurst,Quentin Birkenshaw,Lukas Petrov,Constance Willoughby
発行日	2025-03-25 13:16:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

投稿日: 2025年3月26日作成者: jarxiv

要約

Am-DeepSeek-R1-Distillは、高品質で挑戦的な推論問題で構成される一般的な推論タスクの考え方を備えた大規模なデータセットです。
これらの問題は、テストセットの汚染を排除するためにセマンティックな重複排除と綿密なクリーニングを受けた多数のオープンソースデータセットから収集されます。
データセット内のすべての応答は、推論モデル（主にDeepSeek-R1）から蒸留されており、厳密な検証手順を受けています。
数学的な問題は、参照回答に対してチェックすることによって検証され、コードの問題はテストケースを使用して検証され、他のタスクは報酬モデルの使用で評価されます。
このデータのバッチを使用して単純な監視付き微調整（SFT）でのみトレーニングされたAm-Distill-Qwen-32Bモデルは、4つのベンチマークでDeepSeek-R1-Distill-QWen-32Bモデルを上回りました：AIME2024、MATH-500、GPQA-Diamond、およびLivecodebench。
さらに、Am-Distill-Qwen-72Bモデルは、すべてのベンチマークでもDeepSeek-R1-Distill-Lalama-70Bモデルを上回りました。
強力な推論指向の大手言語モデル（LLM）の開発を促進する目的で、これらの140万の問題と研究コミュニティへの対応する反応をリリースしています。
データセットは、\ href {https://huggingface.co/datasets/a-m-team/am-deepseek-r1-distill-1.4m} {https://huggingface.co/datasets/a-m-team/am-deepseek-r1-distill-1.4m}に公開されました。

要約(オリジナル)

The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.

arxiv情報

著者	Han Zhao,Haotian Wang,Yiping Peng,Sitong Zhao,Xiaoyu Tian,Shuaiting Chen,Yunjie Ji,Xiangang Li
発行日	2025-03-25 13:19:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification

投稿日: 2025年3月26日作成者: jarxiv

要約

大型ビジョン言語モデル（LVLMS）は、視覚的な質問応答や画像キャプションなどのマルチモーダルタスクで顕著な機能を示しています。
ただし、視覚情報と生成されたテキストとの矛盾は、幻覚と呼ばれる現象であり、LVLMSの信頼性に関して未解決の問題のままです。
この問題に対処するために、文またはサブセンテンスレベルの幻覚を検出するために、計算上コストコストの大規模な（ビジョン）言語モデルを組み込むことを提案しました。
この作業では、無視できるコストでトークンレベルで幻覚を検出するために、軽量のバイナリ分類器であるMetatokenを紹介します。
統計分析に基づいて、LVLMSの幻覚の重要な要因を明らかにします。
Metatokenは、幻覚の較正された検出を提供するグラウンドトゥルースデータについての知識なしに、あらゆるオープンソースLVLMに適用できます。
私たちは、アプローチの有効性を実証する4つの最先端のLVLMでの方法を評価します。

要約(オリジナル)

Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a calibrated detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.

arxiv情報

著者	Laura Fieback,Jakob Spiegelberg,Hanno Gottschalk
発行日	2025-03-25 13:27:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV, cs.LG | コメントを受け付けていません

Vocabulary-level Memory Efficiency for Language Model Fine-tuning

投稿日: 2025年3月26日作成者: jarxiv

要約

言語モデル（LM）の微調整の広範なメモリフットプリントは、研究者と実践者の両方にとって課題となります。
LMSは埋め込みマトリックスを使用して広範な語彙を表し、モデルパラメーターのかなりの割合を形成します。
メモリ効率の高い微調整に向けた以前の作業は、トレーニング可能なパラメーターの数を最小限に抑えることに焦点を当てていますが、埋め込みマトリックスのメモリフットプリントを減らすことはまだ調査されていません。
最初に、語彙のかなりの割合が微調整中は未使用のままであることを実証します。
次に、メモリの使用量を最小限に抑えるために、この発見を活用するシンプルで効果的なアプローチを提案します。
私たちのアプローチは、幅広いモデルとタスクにわたってメモリ使用量を大幅に削減することを示しています。
特に、私たちのアプローチは、計算リソースのより効率的な使用を可能にしながら、下流のタスクパフォーマンスに影響を与えません。

要約(オリジナル)

The extensive memory footprint of language model (LM) fine-tuning poses a challenge for both researchers and practitioners. LMs use an embedding matrix to represent extensive vocabularies, forming a substantial proportion of the model parameters. While previous work towards memory-efficient fine-tuning has focused on minimizing the number of trainable parameters, reducing the memory footprint of the embedding matrix has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused during fine-tuning. We then propose a simple yet effective approach that leverages this finding to minimize memory usage. We show that our approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach does not impact downstream task performance, while allowing more efficient use of computational resources.

arxiv情報

著者	Miles Williams,Nikolaos Aletras
発行日	2025-03-25 13:30:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Exploring Cultural Nuances in Emotion Perception Across 15 African Languages

投稿日: 2025年3月26日作成者: jarxiv

要約

言語全体で感情がどのように表現されるかを理解することは、文化的に認識された包括的なNLPシステムを構築するために不可欠です。
ただし、アフリカの言語での感情表現は理解されており、これらの言語での効果的な感情検出ツールの開発が制限されています。
この作業では、15のアフリカ言語での感情表現の言語間分析を提示します。
感情表現の4つの重要な側面を調べます：テキストの長さ、感情の極性、感情の共起、および強度の変動。
私たちの調査結果は、感情表現の多様な言語固有のパターンを明らかにしています – ソマリアのテキストは通常より長く、イシズルやアルジェリア語のような他のテキストはより簡潔な感情表現を示しています。
Isixhosaのような言語のより低いネガティブ性と比較して、いくつかのナイジェリアの言語での否定的な感情のより高い有病率を観察します。
さらに、感情の共起分析は、特定の感情ペア（怒りのディスガスト、悲しみの恐怖）の間の強い言語間の関連性を示し、普遍的な心理的なつながりを示唆しています。
強度分布は、言語ファミリ間で大きな変動を持つマルチモーダルパターンを示しています。
Bantu言語は、同様の明確なプロファイルを表示しますが、アフロアジア言語とナイジェリアのピジンはより広い強度範囲を示しています。
これらの調査結果は、関連する言語を越えて学習を転送する機会を特定しながら、感情検出に対する言語固有のアプローチの必要性を強調しています。

要約(オリジナル)

Understanding how emotions are expressed across languages is vital for building culturally-aware and inclusive NLP systems. However, emotion expression in African languages is understudied, limiting the development of effective emotion detection tools in these languages. In this work, we present a cross-linguistic analysis of emotion expression in 15 African languages. We examine four key dimensions of emotion representation: text length, sentiment polarity, emotion co-occurrence, and intensity variations. Our findings reveal diverse language-specific patterns in emotional expression — with Somali texts typically longer, while others like IsiZulu and Algerian Arabic show more concise emotional expression. We observe a higher prevalence of negative sentiment in several Nigerian languages compared to lower negativity in languages like IsiXhosa. Further, emotion co-occurrence analysis demonstrates strong cross-linguistic associations between specific emotion pairs (anger-disgust, sadness-fear), suggesting universal psychological connections. Intensity distributions show multimodal patterns with significant variations between language families; Bantu languages display similar yet distinct profiles, while Afroasiatic languages and Nigerian Pidgin demonstrate wider intensity ranges. These findings highlight the need for language-specific approaches to emotion detection while identifying opportunities for transfer learning across related languages.

arxiv情報

著者	Ibrahim Said Ahmad,Shiran Dudy,Tadesse Destaw Belay,Idris Abdulmumin,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad,Kenneth Church
発行日	2025-03-25 13:30:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

A multitask transformer to sign language translation using motion gesture primitives

投稿日: 2025年3月26日作成者: jarxiv

要約

効果的なコミュニケーションの欠如ろう集団は、このコミュニティの主要な社会的ギャップを表しています。
さらに、手話である主な聴覚障害のあるコミュニケーションツールは、文書化されていません。つまり、正式な書面による表現はありません。
その結果、今日の主な課題は、時空間的な標識表現と自然のテキスト言語の間の自動翻訳です。
最近のアプローチは、最も関連性の高い戦略が注意モジュールを統合して非線形対応を強化するエンコーダーデコーダーアーキテクチャに基づいています。さらに、これらの近似の多くは、中間テキスト投影がないため、合理的な予測を達成するために複雑なトレーニングとアーキテクチャスキームを必要とします。
ただし、ビデオシーケンスの冗長な背景情報によってまだ制限されています。
この作業では、より適切な翻訳を実現するための光沢学習表現を含むマルチタスク変圧器アーキテクチャを紹介します。
提案されたアプローチには、ジェスチャーを強化し、手話の重要なコンポーネントである運動学的情報を含む密なモーション表現も含まれます。
この表現から、背景情報を回避し、標識のジオメトリを活用することが可能です。さらに、中間のテキスト表現としてのジェスチャーと光沢のアライメントを促進する時空間表現が含まれます。
提案されたアプローチは、col-SLTDデータセットで評価された最先端のアートを上回り、スプリット1で72,64％のBLEU-4を達成し、スプリット2で14,64％のBLE-4を達成しました。

要約(オリジナル)

The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.

arxiv情報

著者	Fredy Alejandro Mendoza López,Jefferson Rodriguez,Fabio Martínez
発行日	2025-03-25 13:53:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

投稿日: 2025年3月26日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、汎用モデルとして印象的な汎用性を示しています。
ただし、それらの幅広い適用性は、特に各ステップにフォワードパスが必要な自動回帰デコードで、高コストの計算オーバーヘッドでもたらされます。
ドメイン固有の設定では、汎用機能は不要であり、効率のために交換できます。
この作業では、ドメインの適応に関する新しい視点を取り、語彙を関心のあるドメインに適応させることにより、レイテンシと計算コストを削減します。
低リソースドメインのLLM効率を高めるために設計された語彙適応のエンドツーエンドアプローチであるAdaptivoCabを紹介します。
AdaptivoCabは、トークン化装置およびアーキテクチャに適用でき、トークンをドメイン固有のN-GRAMベースのトークンに置き換えることで語彙を変更し、入力処理と出力生成の両方に必要なトークンの数を減らすことができます。
AdaptivoCabは、既存の埋め込みの指数関数的に重み付けされた組み合わせを使用して、新しいNトークンエンゲーションを初期化し、単一のGPUで効率的に実行できる軽量の微調整位相を採用します。
3つのニッチドメインで2つの7B LLMを評価し、効率、生成品質、およびエンドタスクのパフォーマンスを評価します。
私たちの結果は、AdaptivoCabがパフォーマンスを損なうことなくトークンの使用を25％以上削減することを示しています

要約(オリジナル)

Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance

arxiv情報

著者	Itay Nakash,Nitay Calderon,Eyal Ben David,Elad Hoffer,Roi Reichart
発行日	2025-03-25 14:18:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation

投稿日: 2025年3月26日作成者: jarxiv

要約

このペーパーでは、エンティティアウェアマシン翻訳（EA-MT）の共有タスクであるSemeval 2025タスク2の調査結果を紹介します。
このタスクの目標は、英語の文を正確にターゲット言語に翻訳できる翻訳モデルを開発することです。これは、MTシステムに課題をもたらすことが多い名前のエンティティの取り扱いに特に焦点を当てています。
タスクは、10のターゲット言語をソースとして英語でカバーします。
この論文では、採用したさまざまなシステムについて説明し、結果を詳細に説明し、実験から得た洞察について説明します。

要約(オリジナル)

This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.

arxiv情報

著者	Abdulhamid Abubakar,Hamidatu Abdulkadir,Ibrahim Rabiu Abdullahi,Abubakar Auwal Khalid,Ahmad Mustapha Wali,Amina Aminu Umar,Maryam Bala,Sani Abdullahi Sani,Ibrahim Said Ahmad,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Vukosi Marivate
発行日	2025-03-25 14:29:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

投稿日: 2025年3月26日作成者: jarxiv

要約

大規模な対照的な視覚言語のプリトレーニングは、視覚表現学習に大きな進歩を示しています。
個別のラベルの固定セットで訓練された従来の視覚システムとは異なり、新しいパラダイムが\ cite {radford2021Learning}で導入され、オープンボキャブリの設定で画像を生のテキストと直接調整することが直接学習されました。
ダウンストリームタスクでは、ゼロショット予測を行うために慎重に選択されたテキストプロンプトが採用されています。〜自明でないプロンプトエンジニアリングを回避するために、コンテキスト最適化\ Cite {Zhou2021Coop}が少数のショットトレーニングの例でタスク固有のプロンプトとして連続ベクターを学習することが提案されています。
テキスト入力は、視覚的または言語ブランチで機能アダプターを使用して微調整するようにクリップアダプターを提案します。
具体的には、Clip-Adapterは追加のボトルネックレイヤーを採用して新しい機能を学習し、オリジナルの事前トレーニングを受けた機能とブレンドをブレンドします。その結果、Clip-Adapterは、シンプルなデザインを維持しながらコンテキストの最適化を上回ることができます。
さまざまな視覚分類タスクに関する実験と広範なアブレーション研究は、私たちのアプローチの有効性を示しています。
コードはt https://github.com/gaopengcuhk/clip-adapterでリリースされます。

要約(オリジナル)

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

arxiv情報

著者	Peng Gao,Shijie Geng,Renrui Zhang,Teli Ma,Rongyao Fang,Yongfeng Zhang,Hongsheng Li,Yu Qiao
発行日	2025-03-25 14:34:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント