xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

要約

AIエージェント機能と実世界の生産性とのギャップを埋めるために設計された、動的で職業に並ぶ評価スイートであるXbenchを紹介します。
既存のベンチマークは、多くの場合、孤立した技術スキルに焦点を当てていますが、専門的な環境で提供される経済的価値を正確に反映していない場合があります。
これに対処するために、Xbenchは、業界の専門家によって定義された評価タスクを備えた商業的に重要なドメインをターゲットにしています。
私たちのフレームワークは、生産性の価値と強く相関するメトリックを作成し、テクノロジー市場適合（TMF）の予測を可能にし、時間の経過とともに製品機能の追跡を促進します。
最初の実装として、採用とマーケティングの2つのベンチマークを提示します。
採用のために、実際のヘッドハンティングビジネスシナリオから50のタスクを収集して、会社のマッピング、情報検索、および人材調達におけるエージェントの能力を評価します。
マーケティングのために、インフルエンサーを広告主のニーズと一致させるエージェントの能力を評価し、836人の候補者インフルエンサーのキュレーションされたプールを使用して、50の広告主の要件にわたってパフォーマンスを評価します。
現代の主要なエージェントの初期評価結果を提示し、これらの専門的なドメインのベースラインを確立します。
継続的に更新されたエバルセットと評価は、https：//xbench.orgで入手できます。

要約(オリジナル)

We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks defined by industry professionals. Our framework creates metrics that strongly correlate with productivity value, enables prediction of Technology-Market Fit (TMF), and facilitates tracking of product capabilities over time. As our initial implementations, we present two benchmarks: Recruitment and Marketing. For Recruitment, we collect 50 tasks from real-world headhunting business scenarios to evaluate agents’ abilities in company mapping, information retrieval, and talent sourcing. For Marketing, we assess agents’ ability to match influencers with advertiser needs, evaluating their performance across 50 advertiser requirements using a curated pool of 836 candidate influencers. We present initial evaluation results for leading contemporary agents, establishing a baseline for these professional domains. Our continuously updated evalsets and evaluations are available at https://xbench.org.

arxiv情報

著者	Kaiyuan Chen,Yixin Ren,Yang Liu,Xiaobo Hu,Haotong Tian,Tianbao Xie,Fangfu Liu,Haoye Zhang,Hongzhang Liu,Yuan Gong,Chen Sun,Han Hou,Hui Yang,James Pan,Jianan Lou,Jiayi Mao,Jizheng Liu,Jinpeng Li,Kangyi Liu,Kenkun Liu,Rui Wang,Run Li,Tong Niu,Wenlong Zhang,Wenqi Yan,Xuanzheng Wang,Yuchen Zhang,Yi-Hsin Hung,Yuan Jiang,Zexuan Liu,Zihan Yin,Zijian Ma,Zhiwen Mo
発行日	2025-06-16 16:16:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー