CRMArenaPro

Name: CRMArenaPro
Creator: maas
Published: 2025-09-01 16:42:21
License: 暂无描述

魔搭社区2025-09-01 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/CRMArenaPro

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for CRMArena-Pro - [Dataset Description](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#dataset-description) - [Paper Information](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#paper-information) - [Citation](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#citation) ## Dataset Description [CRMArena-Pro](https://arxiv.org/abs/2505.18878) is a benchmark for evaluating LLM agents' ability to perform real-world work tasks in realistic environment. It expands on CRMArena with nineteen expert-validated tasks across sales, service, and "configure, price, and quote" (CPQ) processes, for both Business-to-Business (B2B) and Business-to-Customer (B2C) scenarios. CRMArena-Pro distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. The benchmark aims to provide a holistic and realistic assessment of LLM agents in diverse professional settings, addressing the scarcity of public, realistic business data and the limitations of existing benchmarks in terms of fidelity and coverage. ### Fields Below, we illustrate the fields in each instance. - `answer`: The ground truth answer. - `task`: The task name. - `metadata`: The metadata for the query/task. These are supposed to be part of the system prompt. - `query`: The query that LLM agents should respond to. ## Paper Information - Paper: https://arxiv.org/abs/2505.18878 - Code: https://github.com/SalesforceAIResearch/CRMArena/ ## Citation ```bibtex @inproceedings{huang-etal-2025-crmarena, title = "CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments", author = "Huang, Kung-Hsiang and Prabhakar, Akshara and Dhawan, Sidharth and Mao, Yixin and Wang, Huan and Savarese, Silvio and Xiong, Caiming and Laban, Philippe and Wu, Chien-Sheng", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", year = "2025", } @article{huang-etal-2025-crmarena-pro, title = "CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions", author = "Huang, Kung-Hsiang and Prabhakar, Akshara and Thorat, Onkar and Agarwal, Divyansh and Choubey, Prafulla Kumar and Mao, Yixin and Savarese, Silvio and Xiong, Caiming and Wu, Chien-Sheng", journal = "arXiv preprint arXiv:2505.18878", year = "2025", } ``` ## Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

# CRMArena-Pro 数据集卡片 - [数据集说明](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#dataset-description) - [论文信息](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#paper-information) - [引用信息](https://huggingface.co/datasets/Salesforce/CRMArenaPro/blob/main/README.md#citation) ## 数据集说明 [CRMArena-Pro](https://arxiv.org/abs/2505.18878) 是一款用于评估大语言模型（Large Language Model，LLM）智能体在真实环境中完成现实工作任务能力的基准测试集。该数据集在CRMArena的基础上进行扩展，新增了19项经过专家验证的任务，覆盖销售、服务以及"配置-定价-报价"（CPQ）流程，涵盖企业对企业（B2B）与企业对消费者（B2C）两类业务场景。CRMArena-Pro的显著特色在于融入了由多样化角色引导的多轮交互机制，以及严谨的保密意识评估环节。本基准测试旨在为各类专业场景下的LLM智能体提供全面且贴合现实的评估方案，以解决当前公开真实业务数据匮乏、以及现有基准测试在真实性与覆盖范围上存在的局限。 ### 数据字段下文将逐一说明每条数据实例中的字段： - `answer`：基准真值答案（标准答案） - `task`：任务名称 - `metadata`：查询/任务的元数据，此类内容将作为系统提示词的组成部分 - `query`：LLM智能体需要作出响应的查询内容 ## 论文信息 - 论文链接：https://arxiv.org/abs/2505.18878 - 代码链接：https://github.com/SalesforceAIResearch/CRMArena/ ## 引用信息 bibtex @inproceedings{huang-etal-2025-crmarena, title = "CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments", author = "Huang, Kung-Hsiang and Prabhakar, Akshara and Dhawan, Sidharth and Mao, Yixin and Wang, Huan and Savarese, Silvio and Xiong, Caiming and Laban, Philippe and Wu, Chien-Sheng", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", year = "2025", } @article{huang-etal-2025-crmarena-pro, title = "CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions", author = "Huang, Kung-Hsiang and Prabhakar, Akshara and Thorat, Onkar and Agarwal, Divyansh and Choubey, Prafulla Kumar and Mao, Yixin and Savarese, Silvio and Xiong, Caiming and Wu, Chien-Sheng", journal = "arXiv preprint arXiv:2505.18878", year = "2025", } ## 伦理考量本数据集仅用于支撑学术论文的研究工作。我们的模型、数据集与代码并未针对所有下游应用场景进行专门设计与评估。我们强烈建议用户在部署该模型前，针对准确性、安全性与公平性相关的潜在问题开展评估与优化。我们鼓励用户充分考虑人工智能的普遍局限性，遵守适用法律法规，并在选择应用场景时遵循最佳实践，尤其针对那些错误或不当使用可能严重影响民众生活、权利或安全的高风险场景。如需获取更多应用场景相关指导，请参考我们的AUP与AI AUP。

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集