BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

要約

大規模な言語モデル（LLM）がツール使用エージェントに進化するにつれて、リアルタイムでWebを閲覧する能力は、推論と検索の能力を測定するための重要な基準になりました。
Browsecompなどの既存のベンチマークは、英語に集中し、他の主要な情報生態系、特に中国の言語、インフラストラクチャ、および検閲関連の複雑さを見落としています。
このギャップに対処するために、中国のWebでLLMエージェントを包括的に評価するために構築された高度なベンチマークであるBrowseComp-ZHを導入します。
BrowseComp-Zhは、11の多様なドメインにまたがる289のマルチホップ質問で構成されています。
各質問は、短く、客観的で、検証可能な答え（日付、数字、または固有名詞など）から逆エンジニアリングされます。
2段階の品質制御プロトコルが適用され、高い質問の難易度を求めて努力し、独自性に答えます。
提案されているBrowsecomp-Zhで、20を超える最先端の言語モデルとエージェント検索システムをベンチマークします。
強力な会話能力と検索機能にもかかわらず、ほとんどのモデルはひどく苦労しています。多くの数が10％未満で、20％を超えたほんの一握りです。
OpenaiのDeepResearchである最高のパフォーマンスシステムでさえ、わずか42.9％に達します。
これらの結果は、BrowseComp-ZHのかなりの困難を示しています。この場合、成功は効果的な検索戦略だけでなく、洗練された推論と情報の調整も必要とします。
データセット、建設ガイドライン、およびベンチマークの結果は、https://github.com/palin2018/browsecomp-zhで公開されています。

要約(オリジナル)

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems — most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI’s DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation — capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

arxiv情報

著者	Peilin Zhou,Bruce Leon,Xiang Ying,Can Zhang,Yifan Shao,Qichen Ye,Dading Chong,Zhiling Jin,Chenxuan Xie,Meng Cao,Yuxin Gu,Sixin Hong,Jing Ren,Jian Chen,Chao Liu,Yining Hua
発行日	2025-05-01 05:02:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー