jarxiv | Japanese arxiv | ページ 1386

JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments

投稿日: 2025年3月12日作成者: jarxiv

要約

このペーパーでは、法的情報検索（LIR）のためのブラジルのポルトガルのデータセットであるJuristcuを紹介します。
データセットは自由に入手でき、ブラジル連邦口座裁判所からの16,045の法学的文書と、関連性の判断が注釈された150のクエリで構成されています。
クエリ関連の注釈を備えたポルトガル語のLIRデータセットの希少性に対処します。
クエリは、実際のユーザーキーワードベースのクエリ、合成キーワードベースのクエリ、合成質問ベースのクエリの3つのグループに編成されます。
関連性の判断は、LLMベースのスコアリングとエキスパートドメイン検証を組み合わせたハイブリッドアプローチを通じて生成されました。
語彙検索（ドキュメント拡張方法）とセマンティック検索（BERTベースおよびOpenaI埋め込み）を使用して、14の実験でjuristcuを使用しました。
ドキュメントの拡張方法は、このデータセットでの標準BM25検索のパフォーマンスを大幅に改善し、短いキーワードベースのクエリを評価するときにP@10、R@10、およびNDCG@10メトリックで45％を超える改善点を示しています。
埋め込みモデルの中で、OpenAIモデルは最良の結果を生み出し、P@10、R@10、およびNDCG@10メトリックで約70％の改善が行われ、これらの密な埋め込みはこのドメインでセマンティックな関係をキャプチャし、Lexical用語での信頼を超えています。
検索システムの評価に適したポルトガル語のIRリサーチコミュニティにデータセットを提供することに加えて、結果はブラジルの市民に非常に関連する検索システムの強化にも貢献しています。

要約(オリジナル)

This paper introduces JurisTCU, a Brazilian Portuguese dataset for legal information retrieval (LIR). The dataset is freely available and consists of 16,045 jurisprudential documents from the Brazilian Federal Court of Accounts, along with 150 queries annotated with relevance judgments. It addresses the scarcity of Portuguese-language LIR datasets with query relevance annotations. The queries are organized into three groups: real user keyword-based queries, synthetic keyword-based queries, and synthetic question-based queries. Relevance judgments were produced through a hybrid approach combining LLM-based scoring with expert domain validation. We used JurisTCU in 14 experiments using lexical search (document expansion methods) and semantic search (BERT-based and OpenAI embeddings). We show that the document expansion methods significantly improve the performance of standard BM25 search on this dataset, with improvements exceeding 45% in P@10, R@10, and nDCG@10 metrics when evaluating short keyword-based queries. Among the embedding models, the OpenAI models produced the best results, with improvements of approximately 70% in P@10, R@10, and nDCG@10 metrics for short keyword-based queries, suggesting that these dense embeddings capture semantic relationships in this domain, surpassing the reliance on lexical terms. Besides offering a dataset for the Portuguese-language IR research community, suitable for evaluating search systems, the results also contribute to enhancing a search system highly relevant to Brazilian citizens.

arxiv情報

著者	Leandro Carísio Fernandes,Leandro dos Santos Ribeiro,Marcos Vinícius Borela de Castro,Leonardo Augusto da Silva Pacheco,Edans Flávius de Oliveira Sandes
発行日	2025-03-11 12:39:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.IR | コメントを受け付けていません

OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

投稿日: 2025年3月12日作成者: jarxiv

要約

このホワイトペーパーでは、従来の情報検索（IR）シナリオとの関連性が検索された生成（RAG）シナリオで一貫性がない可能性があることを分析し、経験的に示します。
このギャップを埋めるために、レトリバーを調整してコンテキスト内の関連性をキャプチャして、多様で進化するニーズへの適応を可能にすることにより、エンドツーエンドで最適化されたRAGフレームワークであるOpenRagを紹介します。
幅広いタスクにわたる広範な実験は、レトリーバーのエンドツーエンドを調整することにより、OpenRagが元のレトリバーよりも4.0％の一貫した改善につながり、既存の最先端のレトリバーを2.1％上回ることを示しています。
さらに、我々の結果は、一部のタスクでは、エンドツーエンドの調整された0.2Bレトリバーが、RAG指向または命令チューニングされた8B大手言語モデル（LLMS）の改善を上回る改善を達成できることを示しており、RAGシステムの強化におけるアプローチの費用対効果を強調しています。

要約(オリジナル)

In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.

arxiv情報

著者	Jiawei Zhou,Lei Chen
発行日	2025-03-11 13:04:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.IR | コメントを受け付けていません

Detect, Investigate, Judge and Determine: A Knowledge-guided Framework for Few-shot Fake News Detection

投稿日: 2025年3月12日作成者: jarxiv

要約

いくつかのショットの偽のニュース検出（FS-FND）は、非常に低いリソースのシナリオで、不正確なニュースと実際のニュースを区別することを目指しています。
このタスクは、ソーシャルメディアに対する偽のニュースの広範な普及と有害な影響により、注目を集めています。
大規模な言語モデル（LLMS）は、豊富な事前知識と優れたコンテキスト学習能力の助けを借りて、競争力のあるパフォーマンスを実証しています。
ただし、既存の方法は、LLMの可能性を大幅に損なう曖昧さや情報不足など、大きな制限に直面しています。
これらの欠点に対処するために、内外の視点からLLMを強化するように設計された、二重の知識誘導偽のニュース検出（DKFND）モデルを提案します。
具体的には、DKFNDは、最初に検出モジュールを介して各ニュース記事の知識概念を識別します。
その後、DKFNDは調査モジュールを創造的に設計して、現在のニュースに関する貴重な情報の内外を取得し、次に別の裁判官モジュールが関連性と信頼を評価します。
最後に、決定モジュールは2つのそれぞれの予測をさらに導き出し、最終結果を取得します。
2つのパブリックデータセットでの広範な実験は、特にリソースの低い設定で、提案された方法の有効性を示しています。

要約(オリジナル)

Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learning abilities. However, existing methods face significant limitations, such as the Understanding Ambiguity and Information Scarcity, which significantly undermine the potential of LLMs. To address these shortcomings, we propose a Dual-perspective Knowledge-guided Fake News Detection (DKFND) model, designed to enhance LLMs from both inside and outside perspectives. Specifically, DKFND first identifies the knowledge concepts of each news article through a Detection Module. Subsequently, DKFND creatively designs an Investigation Module to retrieve inside and outside valuable information concerning to the current news, followed by another Judge Module to evaluate the relevance and confidence of them. Finally, a Determination Module further derives two respective predictions and obtain the final result. Extensive experiments on two public datasets show the efficacy of our proposed method, particularly in low-resource settings.

arxiv情報

著者	Ye Liu,Jiajun Zhu,Xukai Liu,Haoyu Tang,Yanghai Zhang,Kai Zhang,Xiaofang Zhou,Enhong Chen
発行日	2025-03-11 13:06:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information

投稿日: 2025年3月12日作成者: jarxiv

要約

この研究の目的は、ファクトチェックに大規模な言語モデル（LLM）を使用して、真実性識別のために自動化された手段の使用に関するより広範な議論に貢献できることを評価することです。
この目的を達成するために、5つのLLMS（ChatGpt 4、Llama 3（70b）、Llama 3.1（405b）、Claude 3.5 Sonnet、およびGoogle Gemini）のパフォーマンスを体系的に評価するAI監査方法論を使用します。
具体的には、トピックモデリングと回帰分析を使用して、どの因子（プロンプトまたはLLMタイプのトピックなど）が真、偽、混合ステートメントの評価に影響するかを調査します。
私たちの調査結果は、ChatGpt 4とGoogle Geminiが他のモデルよりも高い精度を達成しているが、モデル全体の全体的なパフォーマンスは控えめなままであることを明らかにしています。
特に、モデルは、特にCovid-19、アメリカの政治的論争、社会問題などのデリケートなトピックについて、誤った陳述を特定するのに優れていることを示しており、これらのトピックの精度を高める可能性のあるガードレールを示唆しています。
私たちの調査結果の主な意味合いは、異なるLLM間のパフォーマンスの大幅な変動や、トレーニングデータの赤字に起因する特定のトピックの出力の不平等な品質の大幅な変動を含む、ファクトチェックにLLMを使用することには重要な課題があることです。
私たちの研究では、政治的事実確認におけるLLMの潜在性と限界を強調しており、ガードレールのさらなる改善と微調整の潜在的な手段を示唆しています。

要約(オリジナル)

The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.

arxiv情報

著者	Elizaveta Kuznetsova,Ilaria Vitulano,Mykola Makhortykh,Martha Stolze,Tomas Nagy,Victoria Vziatysheva
発行日	2025-03-11 13:06:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CY | コメントを受け付けていません

Towards Zero-Shot Multimodal Machine Translation

投稿日: 2025年3月12日作成者: jarxiv

要約

現在のマルチモーダル機械翻訳（MMT）システムは、完全に監視されたデータに依存しています（つまり、モデルは翻訳とそれに付随する画像で文でトレーニングされています）。
ただし、このタイプのデータは収集するのに費用がかかり、MMTの拡張がそのようなデータが存在しない他の言語ペアに制限します。
この作業では、マルチモーダル英語データのみを使用して、MMTシステムをトレーニングするために完全に監視されたデータの必要性をバイパスする方法を提案します。
Zerommtと呼ばれるこの方法は、2つの目的の混合物でトレーニングすることにより、強力なテキストのみの機械翻訳（MT）モデルを適応させることで構成されています。視覚的に条件付けられたマスク言語モデリングと、元のMMT出力と新しいMMT出力の間のKullback-Leiblerの発散です。
標準のMMTベンチマークと最近リリースされた通勤で評価します。これは、モデルが画像を使用して英語の文章を明らかにする方法を評価することを目的とした対照的なベンチマークです。
完全に監視されている例でさらに訓練された最先端のMMTモデルの近くで、掘削障害のパフォーマンスを取得します。
私たちの方法が完全に監視されたトレーニングデータを利用できない言語に一般化することを証明するために、通勤評価データセットをアラビア語、ロシア語、中国語の3つの新しい言語に拡張します。
さらに、分類器のないガイダンスを使用して、追加データを使用して、推論時間で曖昧性の能力と翻訳の忠実度との間のトレードオフを制御できることを示します。
私たちのコード、データ、訓練されたモデルは公開されています。

要約(オリジナル)

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.

arxiv情報

著者	Matthieu Futeral,Cordelia Schmid,Benoît Sagot,Rachel Bawden
発行日	2025-03-11 13:07:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

VAGUE: Visual Contexts Clarify Ambiguous Expressions

投稿日: 2025年3月12日作成者: jarxiv

要約

人間のコミュニケーションは、多くの場合、曖昧さを解決するために視覚的な手がかりに依存しています。
人間はこれらの手がかりを直感的に統合することができますが、AIシステムは洗練されたマルチモーダル推論に従事することが困難なことがよくあります。
Vagueを紹介します。これは、マルチモーダルAIシステムの視覚的コンテキストを意図的な乱用のために統合する能力を評価するベンチマークを紹介します。
あいまいなものは、1.6Kのあいまいなテキスト式で構成されており、それぞれが画像と複数選択解釈と組み合わされており、正解は視覚的なコンテキストでのみ明らかです。
データセットは、段階的で複雑な（視覚的な常識的な推論）と自然な個人的な（eGo4D）シーンの両方に及び、多様性を確保します。
私たちの実験は、既存のマルチモーダルAIモデルがスピーカーの真の意図を推測するのに苦労していることを明らかにしています。
パフォーマンスはより視覚的な手がかりの導入から一貫して改善されますが、全体的な精度は人間のパフォーマンスをはるかに下回り、マルチモーダル推論の重要なギャップを強調しています。
故障症例の分析は、現在のモデルが真の意図を視覚シーンの表面的な相関と区別できないことを示しており、それらが画像を認識しているが、効果的に推論しないことを示しています。
https://github.com/hazel-heejeong-nam/vague.gitでコードとデータをリリースします。

要約(オリジナル)

Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems’ ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker’s true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.

arxiv情報

著者	Heejeong Nam,Jinwoo Ahn,Keummin Ka,Jiwan Chung,Youngjae Yu
発行日	2025-03-11 13:29:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

Hysteresis Activation Function for Efficient Inference

投稿日: 2025年3月12日作成者: jarxiv

要約

広く使用されているleluは、ハードウェアの効率に好まれます。{推論の実装は1つの標識ケースであるため}ですが、トレーニング中に「死にゆくreliu」問題などの問題に苦しんでいます。
この問題を軽減するための従来のアプローチは、しばしばより複雑で、より少ないハードウェアに優しいアクティベーション機能を導入します。
この作業では、ヒステリシス整流線形ユニット（HELU）を提案します。これは、最小限の複雑さで「死にかけているrelu」問題に対処するために設計された効率的な活性化関数です。
トレーニングと推論のために固定されたしきい値を持つ従来のアクティベーション関数とは異なり、HELUはバックプロパゲーションを改良する可変しきい値を採用しています。
この洗練されたメカニズムにより、より単純なアクティベーション関数が、不必要な複雑さを導入したり、帰納的バイアスを必要とせずに、より複雑な対応物に匹敵する競争力のあるパフォーマンスを実現します。
経験的評価は、Heluが多様なデータセット全体でモデルの一般化を強化し、幅広いニューラルネットワークアーキテクチャに適した効率的かつ効果的な推論のための有望なソリューションを提供することを示しています。

要約(オリジナル)

The widely used ReLU is favored for its hardware efficiency, {as the implementation at inference is a one bit sign case,} yet suffers from issues such as the “dying ReLU” problem, where during training, neurons fail to activate and constantly remain at zero, as highlighted by Lu et al. Traditional approaches to mitigate this issue often introduce more complex and less hardware-friendly activation functions. In this work, we propose a Hysteresis Rectified Linear Unit (HeLU), an efficient activation function designed to address the “dying ReLU” problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation. This refined mechanism allows simpler activation functions to achieve competitive performance comparable to their more complex counterparts without introducing unnecessary complexity or requiring inductive biases. Empirical evaluations demonstrate that HeLU enhances model generalization across diverse datasets, offering a promising solution for efficient and effective inference suitable for a wide range of neural network architectures.

arxiv情報

著者	Moshe Kimhi,Idan Kashani,Avi Mendelson,Chaim Baskin
発行日	2025-03-11 13:41:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.LG, cs.NE | コメントを受け付けていません

Decoding Echo Chambers: LLM-Powered Simulations Revealing Polarization in Social Networks

投稿日: 2025年3月12日作成者: jarxiv

要約

これらの現象は私たちの社会に破壊的な結果をもたらす可能性があるため、エコーチャンバーなどの重要な問題に対するソーシャルメディアの影響に対処する必要があります。
従来の研究は、多くの場合、感情的な傾向と意見の進化を数字と公式に単純化し、ニュースとコミュニケーションがテキストを通じて伝えられていることを無視して、これらのアプローチを制限します。
したがって、この作業では、偏光現象を評価および対抗するための社会的意見ネットワーク向けのLLMベースのシミュレーションを提案します。
最初に、社会的相互作用のさまざまな特性をシミュレートするために、3つの典型的なネットワーク構造を構築します。
次に、エージェントは推奨アルゴリズムに基づいて相互作用し、推論と分析を通じて戦略を更新します。
これらの相互作用を古典的な境界信頼モデル（BCM）、フリードキンジョンセン（FJ）モデルと比較し、エコーチャンバー関連のインデックスを使用すると、意見のダイナミクスをシミュレートし、意見の偏光やエコーチャンバーなどの現象を再現する際のフレームワークの有効性を実証します。
特に言語ベースのシミュレーション内で、エコーチャンバーを削減するのに役立つ、アクティブおよびパッシブナッジの2つの緩和方法を提案します。
私たちの仕事が、社会的偏光緩和のための貴重な洞察とガイダンスを提供することを願っています。

要約(オリジナル)

The impact of social media on critical issues such as echo chambers needs to be addressed, as these phenomena can have disruptive consequences for our society. Traditional research often oversimplifies emotional tendencies and opinion evolution into numbers and formulas, neglecting that news and communication are conveyed through text, which limits these approaches. Hence, in this work, we propose an LLM-based simulation for the social opinion network to evaluate and counter polarization phenomena. We first construct three typical network structures to simulate different characteristics of social interactions. Then, agents interact based on recommendation algorithms and update their strategies through reasoning and analysis. By comparing these interactions with the classic Bounded Confidence Model (BCM), the Friedkin Johnsen (FJ) model, and using echo chamber-related indices, we demonstrate the effectiveness of our framework in simulating opinion dynamics and reproducing phenomena such as opinion polarization and echo chambers. We propose two mitigation methods, active and passive nudges, that can help reduce echo chambers, specifically within language-based simulations. We hope our work will offer valuable insights and guidance for social polarization mitigation.

arxiv情報

著者	Chenxi Wang,Zongfang Liu,Dequan Yang,Xiuying Chen
発行日	2025-03-11 13:44:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.SI | コメントを受け付けていません

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

投稿日: 2025年3月12日作成者: jarxiv

要約

長いコンテキストLLMは、多数のダウンストリームアプリケーションを有効にしましたが、計算およびメモリの効率に関連する重要な課題も導入しました。
これらの課題に対処するために、KVキャッシュを中心とした長いコンテキスト推論の最適化が開発されました。
ただし、既存のベンチマークは、多くの場合、単一のリケストで評価され、実際の使用におけるKVキャッシュの完全なライフサイクルを無視します。
KVキャッシュの再利用は、VLLMやSglangなどのLLMS推論フレームワーク、およびOpenai、Microsoft、Google、AnthropicなどのLLMプロバイダーによって広く採用されているため、この監視が特に重要です。
このギャップに対処するために、kv cachecentricの観点から長いコンテキストメソッドを評価するための包括的なベンチマークであるScbench（sharedcontextbench）を紹介します。
具体的には、Scbenchは共有コンテキストでテスト例を使用し、2つの共有コンテキストモードを備えた12のタスクを使用して、文字列検索、セマンティック検索、グローバル情報、マルチタスクの4つのカテゴリの長いコンテキスト機能をカバーしています。
それに伴い、ゲートリニアRNN、マンバアテナテンションハイブリッド、スパースの注意、KVキャッシュドロップ、量子化、回収、荷重、迅速な圧縮などの効率的な方法を含む8つのカテゴリの長いコンテキストソリューションの広範なKVキャッシュ中心分析を提供します。
評価は、8つの長いコンテキストLLMで実施されます。
私たちの調査結果は、サブO（n）メモリメソッドがマルチターンシナリオで苦しんでおり、O（n）メモリとsub-o（n^2）の事前充填計算でのスパースエンコードが堅牢に機能することを示しています。
動的なスパースは、静的パターンよりも表現力豊かなKVキャッシュをもたらし、ハイブリッドアーキテクチャのレイヤーレベルのスパースは、パフォーマンスが強いとメモリの使用量を削減します。
さらに、長い世代のシナリオで注意分布シフトの問題を特定します。
https://aka.ms/scbench。

要約(オリジナル)

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.

arxiv情報

著者	Yucheng Li,Huiqiang Jiang,Qianhui Wu,Xufang Luo,Surin Ahn,Chengruidong Zhang,Amir H. Abdi,Dongsheng Li,Jianfeng Gao,Yuqing Yang,Lili Qiu
発行日	2025-03-11 14:02:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.LG | コメントを受け付けていません

Stick to Facts: Towards Fidelity-oriented Product Description Generation

投稿日: 2025年3月12日作成者: jarxiv

要約

他のテキスト生成タスクとは異なり、製品の説明生成において、製品属性情報に固執する忠実な説明を生成することが非常に重要です。
しかし、この問題にはほとんど注意が払われていません。
このギャップを埋めるために、Fidelity指向の製品説明ジェネレーター（FPDG）という名前のモデルを提案します。
製品属性情報は常にエンティティワードによって伝えられるため、FPDGは各単語のエンティティラベルを考慮に入れます。
具体的には、最初に、エンティティラベル誘導の長期メモリ（ELSTM）セルに基づいた再発性ニューラルネットワーク（RNN）デコーダーを提案し、各単語の埋め込みラベルとエンティティラベルの両方を入力として使用します。
第二に、エンティティラベルをキーとキーワードとして値として保存するキーワードメモリを確立し、FPDGがエンティティラベルに参加することでキーワードに参加できるようにします。
大規模な現実世界の製品説明データセットで行われた実験は、モデルが従来の生成指標と人間の評価の両方の観点から最先端のパフォーマンスを達成することを示しています。
具体的には、FPDGは生成された説明の忠実度を25％増加させます。

要約(オリジナル)

Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap, we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, allowing FPDG to attend to keywords by attending to their entity labels. Experiments conducted on a large-scale real-world product description dataset show that our model achieves state-of-the-art performance in terms of both traditional generation metrics and human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.

arxiv情報

著者	Zhangming Chan,Xiuying Chen,Yongliang Wang,Juntao Li,Zhiqiang Zhang,Kun Gai,Dongyan Zhao,Rui Yan
発行日	2025-03-11 14:04:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント