jarxiv | Japanese arxiv | ページ 596

JTCSE: Joint Tensor-Modulus Constraints and Cross-Attention for Unsupervised Contrastive Learning of Sentence Embeddings

投稿日: 2025年5月8日作成者: jarxiv

要約

監視されていない対照学習は、自然言語処理のホットな研究トピックになりました。
既存の作業は通常、対照的な学習において高次元のセマンティック空間における正と負のサンプルの表現の方向分布を制約することを目的としていますが、セマンティック表現テンソルは弾性率と方向の両方の特徴を持っています。
％したがって、最初に、セマンティック表現テンソルの弾性率の制約を目的とするトレーニング目標を提案し、対照的な学習における正のサンプル間のアライメントを強化します。
したがって、最初に、セマンティック表現テンソルに弾性率の制約を課すように設計されたトレーニング目標を提案し、対照的な学習における正のサンプル間のアライメントを強化します。
次に、Bertのようなモデルは、注意を沈めるという現象に苦しんでおり、セマンティック情報を集計するCLSトークンに注意の欠如につながります。
これに応じて、Twinタワーのアンサンブルモデルの間で相互参加構造を提案して、CLSトークンへのモデルの注意を高め、CLSプーリングの品質を最適化します。
上記の2つの動機を組み合わせて、新しい\ textbf {j} oint \ textbf {t} ensor表現モジュラス制約と\ textbf {c} ross-attention conterived contrastive learning \ textbf {s} entence \ textbf {e} mdind framect jtcse jtcse in semance framectecseを提案します。
タスクと実験結果は、JTCSEのツインタワーアンサンブルモデルとシングルタワー蒸留モデルが他のベースラインを上回り、現在のソタになることを示しています。
さらに、大規模なゼロショット下流タスク評価を実施しました。これは、JTCSEが130を超えるタスクで全体的に他のベースラインよりも優れていることを示しています。

要約(オリジナル)

Unsupervised contrastive learning has become a hot research topic in natural language processing. Existing works usually aim at constraining the orientation distribution of the representations of positive and negative samples in the high-dimensional semantic space in contrastive learning, but the semantic representation tensor possesses both modulus and orientation features, and the existing works ignore the modulus feature of the representations and cause insufficient contrastive learning. % Therefore, we firstly propose a training objective that aims at modulus constraints on the semantic representation tensor, to strengthen the alignment between the positive samples in contrastive learning. Therefore, we first propose a training objective that is designed to impose modulus constraints on the semantic representation tensor, to strengthen the alignment between positive samples in contrastive learning. Then, the BERT-like model suffers from the phenomenon of sinking attention, leading to a lack of attention to CLS tokens that aggregate semantic information. In response, we propose a cross-attention structure among the twin-tower ensemble models to enhance the model’s attention to CLS token and optimize the quality of CLS Pooling. Combining the above two motivations, we propose a new \textbf{J}oint \textbf{T}ensor representation modulus constraint and \textbf{C}ross-attention unsupervised contrastive learning \textbf{S}entence \textbf{E}mbedding representation framework JTCSE, which we evaluate in seven semantic text similarity computation tasks, and the experimental results show that JTCSE’s twin-tower ensemble model and single-tower distillation model outperform the other baselines and become the current SOTA. In addition, we have conducted an extensive zero-shot downstream task evaluation, which shows that JTCSE outperforms other baselines overall on more than 130 tasks.

arxiv情報

著者	Tianyu Zong,Hongzhu Yi,Bingkang Shi,Yuanxiang Wang,Jungang Xu
発行日	2025-05-07 01:11:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation

投稿日: 2025年5月8日作成者: jarxiv

要約

自動化されたレポート生成の大幅な進歩にもかかわらず、テキストの解釈可能性の不透明度は、生成されたコンテンツの信頼性に疑問を投げかけ続けています。
このペーパーでは、レポート生成モデルの出力に影響を与えるX線画像の特定の画像機能を特定するための新しいアプローチを紹介します。
具体的には、元のX線から操作されたX線と指定されたレポートジェネレーターからそのレポートを生成するモジュールである環状視覚系マニピュレーターCVLMを提案します。
CVLMの本質は、レポートジェネレーターのサイクリング操作X線がX線を生成した変更レポートを生成し、X線生成のレポートに事前に注入された変更に沿った変更されたレポートを生成し、「環状操作」という用語を達成することです。
このプロセスにより、オリジナルと操作されたX線を直接比較することができ、レポートの変更を促進する重要な画像機能を明確にし、モデルユーザーが生成されたテキストの信頼性を評価できるようにします。
経験的評価は、CVLMが既存の説明方法と比較してより正確で信頼できる機能を特定し、AI生成レポートの透明性と適用性を大幅に向上させることを示しています。

要約(オリジナル)

Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator CVLM, a module to generate a manipulated X-ray from an original X-ray and its report from a designated report generator. The essence of CVLM is that cycling manipulated X-rays to the report generator produces altered reports aligned with the alterations pre-injected into the reports for X-ray generation, achieving the term ‘cyclic manipulation’. This process allows direct comparison between original and manipulated X-rays, clarifying the critical image features driving changes in reports and enabling model users to assess the reliability of the generated texts. Empirical evaluations demonstrate that CVLM can identify more precise and reliable features compared to existing explanation methods, significantly enhancing the transparency and applicability of AI-generated reports.

arxiv情報

著者	Yingying Fang,Zihao Jin,Shaojie Guo,Jinda Liu,Zhiling Yue,Yijian Gao,Junzhi Ning,Zhi Li,Simon Walsh,Guang Yang
発行日	2025-05-07 01:51:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG | コメントを受け付けていません

SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation

投稿日: 2025年5月8日作成者: jarxiv

要約

ロボット工学の効率的なパス計画、特に大規模で動的な環境内では、依然として重要なハードルです。
大規模な言語モデル（LLM）は強力な推論機能を提供しますが、その高い計算コストと動的シナリオでの適応性が限られていることは、エッジデバイスでのリアルタイムの展開を妨げます。
SmallPlanを提示します。これは、高レベルのパス計画タスクのために軽量の小言語モデル（SLM）をトレーニングするための教師モデルとしてLLMを活用する新しいフレームワークです。
Smallplanでは、SLMSは、フルスケールの3Dシーンをコンパクトに表すシーングラフを横断する最適なアクションシーケンスを提供します。
SLMは、LLMガイド付きの監視施設微調整（SFT）および補強学習（RL）を使用して、シミュレーション駆動のインターリーブされた方法でトレーニングされています。
この戦略により、SLMSはナビゲーションタスクを正常に完了することを可能にするだけでなく、移動距離や試験数などの重要な要因を認識させることができます。
実験を通じて、微調整されたSLMSは、幻覚や過剰フィッティングに苦しむことなく、シーケンシャルパス計画でGPT-4Oなどのより大きなモデルと競合することを実証します。
Smallplanはリソース効率が高く、エッジデバイスの展開と実用的な自律的ロボット工学の進歩に適しています。

要約(オリジナル)

Efficient path planning in robotics, particularly within large-scale, dynamic environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability in dynamic scenarios hinder real-time deployment on edge devices. We present SmallPlan — a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like travel distance and number of trials. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics.

arxiv情報

著者	Quang P. M. Pham,Khoi T. N. Nguyen,Nhi H. Doan,Cuong A. Pham,Kentaro Inui,Dezhen Song
発行日	2025-05-07 02:00:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.RO | コメントを受け付けていません

LLAMAPIE: Proactive In-Ear Conversation Assistants

投稿日: 2025年5月8日作成者: jarxiv

要約

Lamapieを紹介します。Llamapieは、Healableデバイスを介して提供される控えめで簡潔なガイダンスを通じて、人間の会話を強化するために設計された最初のリアルタイムプロアクティブアシスタントです。
明示的なユーザーの呼び出しを必要とする従来の言語モデルとは異なり、このアシスタントはバックグラウンドで動作し、会話を中断することなくユーザーのニーズを予測します。
いつ応答するかを決定する、会話を強化する簡潔な応答の作成、コンテキスト対応支援のためのユーザーの知識を活用する、リアルタイムのデバイス処理など、いくつかの課題に対処します。
これを達成するために、半合成ダイアログデータセットを構築し、2モデルパイプラインを提案します。応答するタイミングを決定する小さなモデルと、応答を生成するより大きなモデル。
現実世界のデータセットでのアプローチを評価し、有益で目立たない支援を提供する上でのその有効性を示しています。
Apple Silicon M2ハードウェアに実装されたアシスタントを使用したユーザー調査は、支援なしとリアクティブモデルの両方で、プロアクティブなアシスタントを強く好み、Lamapieがライブ会話を強化する可能性を強調しています。

要約(オリジナル)

We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.

arxiv情報

著者	Tuochao Chen,Nicholas Batchelder,Alisa Liu,Noah Smith,Shyamnath Gollakota
発行日	2025-05-07 02:08:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.LG, eess.AS | コメントを受け付けていません

Advancing and Benchmarking Personalized Tool Invocation for LLMs

投稿日: 2025年5月8日作成者: jarxiv

要約

ツールの呼び出しは、大規模な言語モデル（LLMS）の機能を拡張するための重要なメカニズムであり、最近大きな注目を集めています。
これにより、LLMSは、最新の世界知識にアクセスしながら、ツールコールを通じて複雑な問題を解決できます。
ただし、既存の作業は、主に、ツールの呼び出しのパーソナライズされた制約を考慮せずに、問題解決のためのツールを呼び出すためのLLMの基本能力に焦点を当てています。
この作業では、パーソナライズされたツールの呼び出しの概念を紹介し、2つの重要なタスクを定義します。ツールの好みとプロファイル依存クエリです。
ツールの選好は、機能的に類似したツールを選択するときにユーザーの好みに対応しますが、プロファイル依存のクエリは、ユーザークエリに特定のツールパラメーターがない場合を考慮し、モデルにユーザープロファイルから推測する必要があります。
これらの課題に取り組むために、パーソナライズされたツールの呼び出しのために設計されたデータ合成フレームワークであるPtoolを提案します。
さらに、パーソナライズされたツールの呼び出しを評価するための最初のベンチマークである\ textbf {ptbench}を構築します。
次に、さまざまなオープンソースモデルを微調整し、フレームワークの有効性を実証し、貴重な洞察を提供します。
私たちのベンチマークはhttps://github.com/hyfshadow/ptbenchで公開されています。

要約(オリジナル)

Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at https://github.com/hyfshadow/PTBench.

arxiv情報

著者	Xu Huang,Yuefeng Huang,Weiwen Liu,Xingshan Zeng,Yasheng Wang,Ruiming Tang,Hong Xie,Defu Lian
発行日	2025-05-07 02:25:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Natural Language Generation in Healthcare: A Review of Methods and Applications

投稿日: 2025年5月8日作成者: jarxiv

要約

自然言語生成（NLG）は、生成的人工知能（AI）を達成するための重要な技術です。
大規模な言語モデル（LLMS）のブレークスルーにより、NLGはさまざまな医療用途で広く使用されており、臨床ワークフローを強化し、臨床的意思決定をサポートし、臨床文書化を改善する可能性を実証しています。
医療テキスト、画像、知識ベースなどの不均一で多様な医療データのモダリティがNLGで利用されています。
研究者は多くの生成モデルを提案し、多くのヘルスケアアプリケーションにそれらを適用しました。
医療ドメインにおけるNLGメソッドとアプリケーションの包括的なレビューが必要です。
この研究では、データモダリティ、モデルアーキテクチャ、臨床応用、評価方法に焦点を当てた文献検索を使用して特定された合計3,988のNLG関連記事から113の科学出版物を体系的にレビューしました。
PRISMA（系統的レビューおよびメタ分析のための優先報告項目）ガイドラインに従って、重要な方法を分類し、臨床アプリケーションを特定し、その能力、制限、および新たな課題を評価します。
このタイムリーなレビューは、主要なNLGテクノロジーと医療アプリケーションをカバーし、NLGを活用して医学的発見とヘルスケアを変革するために、将来の研究に貴重な洞察を提供します。

要約(オリジナル)

Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.

arxiv情報

著者	Mengxian Lyu,Xiaohan Li,Ziyi Chen,Jinqian Pan,Cheng Peng,Sankalp Talankar,Yonghui Wu
発行日	2025-05-07 02:25:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Large Language Models Are Struggle to Cope with Unreasonability in Math Problems

投稿日: 2025年5月8日作成者: jarxiv

要約

最近の研究では、数学と推論におけるLLMSの印象的なパフォーマンスが実証されています。
ただし、内部の矛盾や欠陥のある仮定など、型破りな条件下で数学の問題に対処するLLMの能力は、ほとんど未踏のままです。
この論文では、数学の問題における不当性を認識して対応するLLMの能力を評価するために設計された新しいベンチマーク不合理数学問題（UMP）を提案します。
ベンチマークは、多様なタイプにわたる不合理な数学の質問の慎重にキュレーションされたコレクションで構成されています。
19 LLMをカバーする広範な実験に基づいて、GPT-4oなどの最先端のモデルでさえ、UMPで0.6のパフォーマンスが限られているのに対し、DeepSeek-R1などの推論モデルは考え直しや不安定であることがわかります。
さらに、不合理な入力の認識を改善し、この挑戦的な環境でのLLMの可能性と制限の両方に光を当てるための戦略を探求します。

要約(オリジナル)

Recent research have demonstrated LLMs’ impressive performance in math and reasoning. However, the capacity of LLMs to address math problems under unconventional conditions, such as internal inconsistencies and flawed assumptions, remains largely unexplored. In this paper, we propose a novel benchmark Unreasonable Math Problem (UMP) designed to assess LLMs’ ability to recognize and respond to unreasonability in math problem. The benchmark consists of a carefully curated collection of unreasonable math questions across diverse types. Based on extensive experiments covering 19 LLMs, we observe that even state-of-the-art models such as GPT-4o achieve only limited performance of 0.6 in UMP, while reasoning models such as DeepSeek-R1 are prone to overthinking and unstable. We further explore strategies for improving the recognition of unreasonable inputs, shedding light on both the possibility and limitations of LLMs in this challenging setting.

arxiv情報

著者	Jingyuan Ma,Damai Dai,Zihang Yuan,Rui li,Weilin Luo,Bin Wang,Qun Liu,Lei Sha,Zhifang Sui
発行日	2025-05-07 03:14:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

投稿日: 2025年5月8日作成者: jarxiv

要約

最近、GPT-4などの大規模な言語モデル（LLM）は、驚くべき会話能力を際立たせ、幅広いトピック全体で動的で文脈的に関連する対話に従事することができます。
ただし、長い会話を考えると、これらのチャットボットは過去の情報を思い出すことができず、一貫性のない応答を生成する傾向があります。
これに対処するために、大規模な言語モデル（LLM）を使用して要約/メモリを再帰的に生成して、長期のメモリ能力を高めることを提案します。
具体的には、私たちの方法はまずLLMSを刺激して小さな対話のコンテキストを記憶し、次に以前のメモリと次のコンテキストを使用して新しいメモリを再帰的に生成します。
最後に、チャットボットは、最新のメモリの助けを借りて、非常に一貫した応答を簡単に生成できます。
オープンLLMと閉じた両方のLLMでの方法を評価し、広く使用されているパブリックデータセットでの実験は、私たちの方法が長いコンテストの会話でより一貫した応答を生成できることを示しています。
また、私たちの戦略が、長いコンテキスト（8Kおよび16Kなど）と検索強化LLMの両方をうまく補完し、さらに長期的な対話パフォーマンスをもたらすことができることを示しています。
特に、私たちの方法は、LLMが非常に長いコンテキストをモデル化できるようにするための潜在的なソリューションです。
コードとスクリプトは後でリリースされます。

要約(オリジナル)

Recently, large language models (LLMs), such as GPT-4, stand out remarkable conversational abilities, enabling them to engage in dynamic and contextually relevant dialogues across a wide range of topics. However, given a long conversation, these chatbots fail to recall past information and tend to generate inconsistent responses. To address this, we propose to recursively generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability. Specifically, our method first stimulates LLMs to memorize small dialogue contexts and then recursively produce new memory using previous memory and following contexts. Finally, the chatbot can easily generate a highly consistent response with the help of the latest memory. We evaluate our method on both open and closed LLMs, and the experiments on the widely-used public dataset show that our method can generate more consistent responses in a long-context conversation. Also, we show that our strategy could nicely complement both long-context (e.g., 8K and 16K) and retrieval-enhanced LLMs, bringing further long-term dialogue performance. Notably, our method is a potential solution to enable the LLM to model the extremely long context. The code and scripts will be released later.

arxiv情報

著者	Qingyue Wang,Yanhe Fu,Yanan Cao,Shuai Wang,Zhiliang Tian,Liang Ding
発行日	2025-05-07 03:31:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

投稿日: 2025年5月8日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、さまざまな複雑なタスクで顕著な習熟度を示しています。
LLMSの重要なアプリケーションの1つは、特にユーザーが報告した問題に基づいてコードを修正することにより、GitHubの実際のタスクを解決することです。
ただし、現在の多くのアプローチは独自のLLMに依存しており、再現性、アクセシビリティ、および透明性を制限しています。
ソフトウェアエンジニアリングの問題に対処するためのLLMSの重要なコンポーネントと、その機能を効果的に強化する方法は不明のままです。
これらの課題に対処するために、GitHubの問題を効果的かつ効率的に解決するように設計された新しいオープンソースフレームワークであるSWE-Fixerを紹介します。
SWE-Fixerは、コードファイル取得モジュールとコード編集モジュールの2つの重要なモジュールで構成されています。
検索モジュールは、BM25を使用して軽量モデルを使用して、粗からファインへのファイル検索を実現します。
その後、コード編集モジュールは他のモデルを使用して、識別されたファイルのパッチを生成します。
公開されているデータセットの欠如を軽減するために、110K Githubの問題を含む広範なデータセットをコンパイルし、対応するパッチとSWE-Fixerの2つのモデルを個別にトレーニングします。
SWEベンチライトと検証されたベンチマークでのアプローチを評価し、22.0％と30.2％のスコアでオープンソースモデル間で競争力のあるパフォーマンスを達成しました。
さらに、SWE-Fixerは、PASS_TO_PASS（P2P）フィルタリングを使用して、最先端のパフォーマンス（Liteで24.7％、検証で32.8％）に達します。
さらに、私たちのアプローチでは、インスタンスごとに2つのモデル呼び出しのみが必要であり、既存の方法よりもはるかに効率的になります。
これらの結果は、実際のコード固定シナリオにおけるSWE-Fixerの有効性を強調しています。
モデル、データセット、およびコードをhttps://github.com/internlm/swe-fixerで公開します。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source framework designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other model to generate patches for the identified files. To mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches and train the two models of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models with scores of 22.0% and 30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on Lite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally, our approach requires only two model calls per instance, making it significantly more efficient than existing methods. These results highlight the effectiveness of SWE-Fixer in real-world code-fixing scenarios. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.

arxiv情報

著者	Chengxing Xie,Bowen Li,Chang Gao,He Du,Wai Lam,Difan Zou,Kai Chen
発行日	2025-05-07 04:06:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Advancements and limitations of LLMs in replicating human color-word associations

投稿日: 2025年5月8日作成者: jarxiv

要約

カラーワードの関連付けは、人間の認知および設計アプリケーションにおいて基本的な役割を果たします。
大規模な言語モデル（LLM）は広く利用可能になり、自然な会話スキルを備えたさまざまなベンチマークでインテリジェントな行動を実証しています。
ただし、人間の色と単語の関連性を再現する能力は依然として考えられていません。
日本語の17色と80語（8つのカテゴリから10語）を含む、10,000人以上の日本人参加者から収集されたデータを使用して、複数の世代のLLM（GPT-3からGPT-4O）を人間の色と単語の関連付けと比較しました。
私たちの調査結果は、GPT-4oが各色とカテゴリに最適な投票された言葉を予測する際に最高の精度を達成することで、世代全体でLLMパフォーマンスの明確な進行を明らかにしています。
ただし、パフォーマンスの中央値は、視覚入力を使用したGPT-4Oでも約50％でした（チャンスレベル10％）。
さらに、単語のカテゴリと色にわたるパフォーマンスのバリエーションが見つかりました。LLMSはリズムや景観などのカテゴリに優れている傾向がありましたが、感情などのカテゴリに苦労しました。
興味深いことに、色と単語の関連データから推定された色の識別能力は、以前の研究と一致して、人間の色識別パターンと高い相関を示しました。
したがって、基本的な色の識別における合理的なアライメントにもかかわらず、人間とLLMは、それらがそれらの色に割り当てる言葉で依然として体系的に分岐します。
私たちの研究は、LLM機能の進歩とその持続的な制限の両方を強調し、色と単語の関連性を表す際に、人間とLLMの間のセマンティックメモリ構造の系統的な違いの可能性を高めています。

要約(オリジナル)

Color-word associations play a fundamental role in human cognition and design applications. Large Language Models (LLMs) have become widely available and have demonstrated intelligent behaviors in various benchmarks with natural conversation skills. However, their ability to replicate human color-word associations remains understudied. We compared multiple generations of LLMs (from GPT-3 to GPT-4o) against human color-word associations using data collected from over 10,000 Japanese participants, involving 17 colors and 80 words (10 word from eight categories) in Japanese. Our findings reveal a clear progression in LLM performance across generations, with GPT-4o achieving the highest accuracy in predicting the best voted word for each color and category. However, the highest median performance was approximately 50% even for GPT-4o with visual inputs (chance level of 10%). Moreover, we found performance variations across word categories and colors: while LLMs tended to excel in categories such as Rhythm and Landscape, they struggled with categories such as Emotions. Interestingly, color discrimination ability estimated from our color-word association data showed high correlation with human color discrimination patterns, consistent with previous studies. Thus, despite reasonable alignment in basic color discrimination, humans and LLMs still diverge systematically in the words they assign to those colors. Our study highlights both the advancements in LLM capabilities and their persistent limitations, raising the possibility of systematic differences in semantic memory structures between humans and LLMs in representing color-word associations.

arxiv情報

著者	Makoto Fukushima,Shusuke Eshita,Hiroshige Fukuhara
発行日	2025-05-07 04:50:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV, cs.GR, cs.HC | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント