five

bdallhrajh371/EnterpriseOps-Gym

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bdallhrajh371/EnterpriseOps-Gym
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 dataset_info: - config_name: oracle features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 287514 num_examples: 61 - name: csm num_bytes: 1463009 num_examples: 103 - name: drive num_bytes: 469510 num_examples: 64 - name: email num_bytes: 466399 num_examples: 67 - name: hr num_bytes: 1691979 num_examples: 102 - name: hybrid num_bytes: 1270996 num_examples: 88 - name: itsm num_bytes: 1422432 num_examples: 103 - name: teams num_bytes: 1305140 num_examples: 61 download_size: 991194 dataset_size: 8376979 - config_name: plus_10_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 291906 num_examples: 59 - name: csm num_bytes: 1465984 num_examples: 102 - name: drive num_bytes: 481171 num_examples: 64 - name: email num_bytes: 478847 num_examples: 67 - name: hr num_bytes: 1714711 num_examples: 102 - name: hybrid num_bytes: 1287591 num_examples: 88 - name: itsm num_bytes: 1447378 num_examples: 103 - name: teams num_bytes: 1145920 num_examples: 52 download_size: 982687 dataset_size: 8313508 - config_name: plus_15_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 296762 num_examples: 59 - name: csm num_bytes: 1473824 num_examples: 102 - name: drive num_bytes: 486100 num_examples: 64 - name: email num_bytes: 484265 num_examples: 67 - name: hr num_bytes: 1726316 num_examples: 102 - name: hybrid num_bytes: 1295972 num_examples: 88 - name: itsm num_bytes: 1460550 num_examples: 103 - name: teams num_bytes: 1150823 num_examples: 52 download_size: 989853 dataset_size: 8374612 - config_name: plus_5_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 286286 num_examples: 59 - name: csm num_bytes: 1450583 num_examples: 102 - name: drive num_bytes: 475636 num_examples: 64 - name: email num_bytes: 473047 num_examples: 67 - name: hr num_bytes: 1703152 num_examples: 102 - name: hybrid num_bytes: 1279121 num_examples: 88 - name: itsm num_bytes: 1434621 num_examples: 103 - name: teams num_bytes: 1140547 num_examples: 52 download_size: 981152 dataset_size: 8242993 configs: - config_name: oracle data_files: - split: calendar path: oracle/calendar-* - split: csm path: oracle/csm-* - split: drive path: oracle/drive-* - split: email path: oracle/email-* - split: hr path: oracle/hr-* - split: hybrid path: oracle/hybrid-* - split: itsm path: oracle/itsm-* - split: teams path: oracle/teams-* - config_name: plus_10_tools data_files: - split: calendar path: plus_10_tools/calendar-* - split: csm path: plus_10_tools/csm-* - split: drive path: plus_10_tools/drive-* - split: email path: plus_10_tools/email-* - split: hr path: plus_10_tools/hr-* - split: hybrid path: plus_10_tools/hybrid-* - split: itsm path: plus_10_tools/itsm-* - split: teams path: plus_10_tools/teams-* - config_name: plus_15_tools data_files: - split: calendar path: plus_15_tools/calendar-* - split: csm path: plus_15_tools/csm-* - split: drive path: plus_15_tools/drive-* - split: email path: plus_15_tools/email-* - split: hr path: plus_15_tools/hr-* - split: hybrid path: plus_15_tools/hybrid-* - split: itsm path: plus_15_tools/itsm-* - split: teams path: plus_15_tools/teams-* - config_name: plus_5_tools data_files: - split: calendar path: plus_5_tools/calendar-* - split: csm path: plus_5_tools/csm-* - split: drive path: plus_5_tools/drive-* - split: email path: plus_5_tools/email-* - split: hr path: plus_5_tools/hr-* - split: hybrid path: plus_5_tools/hybrid-* - split: itsm path: plus_5_tools/itsm-* - split: teams path: plus_5_tools/teams-* --- <div align="center"> <h1><img src="assets/csmgym.png" alt="Logo" width="48" style="vertical-align:middle; margin-right:8px;" /> EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</h1> <p> <a href="https://enterpriseops-gym.github.io/"><img src="https://img.shields.io/badge/Website-blue?logo=google-chrome&logoColor=white" /></a> <a href="https://arxiv.org/abs/2603.13594"><img src="https://img.shields.io/badge/Paper-red?logo=arxiv&logoColor=white" /></a> <a href="https://github.com/ServiceNow/EnterpriseOps-Gym"><img src="https://img.shields.io/badge/GitHub-black?logo=github" /></a> </p> <p><i>EnterpriseOps-Gym is a containerized, resettable enterprise simulation benchmark for evaluating LLM agents on stateful, multi-step planning and tool use across realistic enterprise workflows</i></p> </div> <div align="center"><img src="assets/teaser.png" alt="EnterpriseOps-Gym Overview" width="80%" /></div> ## About **EnterpriseOps-Gym** is a large-scale benchmark for evaluating the agentic planning and tool-use capabilities of LLM agents across enterprise operations. It comprises **1,150 expert-curated tasks** spanning **8 enterprise domains**, each running against live containerized MCP servers backed by realistic, fully synthetic databases. Unlike static QA benchmarks, EnterpriseOps-Gym evaluates agents on **final environment state** using SQL verifiers - meaning agents are rewarded for achieving the correct outcome, not for following a rigid action sequence. Tasks require long-horizon multi-step reasoning, strict policy compliance, and precise tool invocation under complex data dependencies. > **Best model performance: 34.1% success rate** - leaving significant headroom for future research. ## Key Features - 🛠️ **512 tools** across 8 enterprise domains - 🗄️ **164 database tables** with avg 1.7 foreign-key dependencies per table - 🔢 **9.15 avg steps** per task (up to 34), with **5.3 avg verification conditions** - 📏 **89k avg context length** per task - 🔒 Tasks enforce **access control, policy compliance, and referential integrity** - ✅ Evaluation is **outcome-based** via executable SQL verifiers — not action-sequence matching - 🐳 Fully **containerized** sandbox — reproducible and isolated per task run ## Evaluation Framework The evaluation code is available at [ServiceNow/EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym). The framework supports: - **Multiple orchestrators**: ReAct, Planner-ReAct, Decomposing Planner - **Multiple LLM providers**: Anthropic, OpenAI, Azure OpenAI, Google Gemini, DeepSeek, vLLM, and more - **Parallel execution** via [Ray](https://www.ray.io/) for large-scale runs - **Automatic scoring** with per-task and per-mode breakdowns ```python from datasets import load_dataset ds = load_dataset("ServiceNow-AI/EnterpriseOps-Gym", "oracle", split="teams") ``` ## Domain Information The dataset is organized by **domain** (split) and **mode** (configuration subset). ### Domains | Domain | Tasks | Avg Steps | Max Steps | Tools | |--------|------:|----------:|----------:|------:| | Calendar | 100 | 7.05 | 17 | 37 | | CSM | 186 | 12.10 | 27 | 89 | | Drive | 105 | 8.68 | 29 | 55 | | Email | 104 | 6.25 | 22 | 79 | | HR | 184 | 10.54 | 34 | 89 | | ITSM | 181 | 9.00 | 31 | 93 | | Teams | 100 | 9.41 | 18 | 70 | | Hybrid | 155 | 7.79 | 19 | Multi-domain | | **Total** | **1,115** | **9.15** | **34** | **512** | ### Modes (Tool-Set Configurations) Each mode controls the set of tools exposed to the agent, simulating realistic tool-retrieval scenarios: | Mode | Description | |------|-------------| | `oracle` | Only the exact tools needed for the task | | `plus_5_tools` | Oracle tools + 5 randomly sampled distractor tools | | `plus_10_tools` | Oracle tools + 10 randomly sampled distractor tools | | `plus_15_tools` | Oracle tools + 15 randomly sampled distractor tools | ## Field Descriptions Each row in the dataset corresponds to one task instance and contains the following fields: | Field | Type | Description | |-------|------|-------------| | `task_id` | `string` | Unique identifier for the task | | `domain` | `string` | Domain name (e.g., `teams`, `csm`, `hr`) | | `system_prompt` | `string` | Agent role definition and domain-specific policies | | `user_prompt` | `string` | Natural language task instruction | | `verifiers` | `string` (JSON) | Array of SQL-based outcome verification scripts that check final environment state | | `gym_servers_config` | `string` (JSON) | MCP server configuration(s) specifying which containerized gym server(s) to connect to | | `selected_tools` | `list[string]` | Names of tools available to the agent in this mode | ## Example Use Cases **EnterpriseOps-Gym** can be used for: - **Benchmarking LLM agents** on realistic enterprise workflows across IT, HR, CRM, and collaboration domains - **Evaluating tool-use and planning** under long-horizon, multi-step, policy-constrained settings - **Studying tool retrieval robustness** by comparing oracle vs. distractor-augmented tool modes - **Developing new orchestration strategies** — the framework natively supports ReAct, Planner-ReAct, and Decomposing Planner - **Studying failure modes** of state-of-the-art models on high-complexity enterprise tasks (best model: 34.1%) - **Extending the benchmark** with new domains, tasks, or verifiers using the released Docker sandbox infrastructure ## Citation ```bibtex @misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful, title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings}, author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar}, year={2026}, eprint={2603.13594}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2603.13594}, }

许可证:CC BY-NC 4.0 数据集信息: - 配置名称:oracle 特征字段: 1. task_id:字符串类型,任务唯一标识符 2. domain:字符串类型,任务所属领域名称 3. system_prompt:字符串类型,智能体角色与领域策略定义的系统提示词 4. user_prompt:字符串类型,自然语言任务指令 5. selected_tools:字符串列表,当前模式下可用的工具名称 6. restricted_tools:字符串列表,本配置下为空 7. mcp_endpoint:字符串类型,MCP服务器访问端点 8. number_of_runs:64位整型,任务允许的最大运行次数 9. reset_database_between_runs:布尔型,是否在每次运行间重置数据库状态 10. gym_servers_config:字符串类型,Gym服务器配置信息 11. verifiers:字符串类型,任务结果验证逻辑 数据拆分: - 拆分名称:calendar,占用字节:287514,样本数:61 - 拆分名称:csm,占用字节:1463009,样本数:103 - 拆分名称:drive,占用字节:469510,样本数:64 - 拆分名称:email,占用字节:466399,样本数:67 - 拆分名称:hr,占用字节:1691979,样本数:102 - 拆分名称:hybrid,占用字节:1270996,样本数:88 - 拆分名称:itsm,占用字节:1422432,样本数:103 - 拆分名称:teams,占用字节:1305140,样本数:61 下载总大小:991194,数据集总存储大小:8376979 - 配置名称:plus_10_tools 特征字段同oracle配置 数据拆分: - 拆分名称:calendar,占用字节:291906,样本数:59 - 拆分名称:csm,占用字节:1465984,样本数:102 - 拆分名称:drive,占用字节:481171,样本数:64 - 拆分名称:email,占用字节:478847,样本数:67 - 拆分名称:hr,占用字节:1714711,样本数:102 - 拆分名称:hybrid,占用字节:1287591,样本数:88 - 拆分名称:itsm,占用字节:1447378,样本数:103 - 拆分名称:teams,占用字节:1145920,样本数:52 下载总大小:982687,数据集总存储大小:8313508 - 配置名称:plus_15_tools 特征字段同oracle配置 数据拆分: - 拆分名称:calendar,占用字节:296762,样本数:59 - 拆分名称:csm,占用字节:1473824,样本数:102 - 拆分名称:drive,占用字节:486100,样本数:64 - 拆分名称:email,占用字节:484265,样本数:67 - 拆分名称:hr,占用字节:1726316,样本数:102 - 拆分名称:hybrid,占用字节:1295972,样本数:88 - 拆分名称:itsm,占用字节:1460550,样本数:103 - 拆分名称:teams,占用字节:1150823,样本数:52 下载总大小:989853,数据集总存储大小:8374612 - 配置名称:plus_5_tools 特征字段同oracle配置 数据拆分: - 拆分名称:calendar,占用字节:286286,样本数:59 - 拆分名称:csm,占用字节:1450583,样本数:102 - 拆分名称:drive,占用字节:475636,样本数:64 - 拆分名称:email,占用字节:473047,样本数:67 - 拆分名称:hr,占用字节:1703152,样本数:102 - 拆分名称:hybrid,占用字节:1279121,样本数:88 - 拆分名称:itsm,占用字节:1434621,样本数:103 - 拆分名称:teams,占用字节:1140547,样本数:52 下载总大小:981152,数据集总存储大小:8242993 配置列表: - 配置名称:oracle,数据文件路径: - 拆分calendar:oracle/calendar-* - 拆分csm:oracle/csm-* - 拆分drive:oracle/drive-* - 拆分email:oracle/email-* - 拆分hr:oracle/hr-* - 拆分hybrid:oracle/hybrid-* - 拆分itsm:oracle/itsm-* - 拆分teams:oracle/teams-* - 配置名称:plus_10_tools,数据文件路径: - 拆分calendar:plus_10_tools/calendar-* - 拆分csm:plus_10_tools/csm-* - 拆分drive:plus_10_tools/drive-* - 拆分email:plus_10_tools/email-* - 拆分hr:plus_10_tools/hr-* - 拆分hybrid:plus_10_tools/hybrid-* - 拆分itsm:plus_10_tools/itsm-* - 拆分teams:plus_10_tools/teams-* - 配置名称:plus_15_tools,数据文件路径: - 拆分calendar:plus_15_tools/calendar-* - 拆分csm:plus_15_tools/csm-* - 拆分drive:plus_15_tools/drive-* - 拆分email:plus_15_tools/email-* - 拆分hr:plus_15_tools/hr-* - 拆分hybrid:plus_15_tools/hybrid-* - 拆分itsm:plus_15_tools/itsm-* - 拆分teams:plus_15_tools/teams-* - 配置名称:plus_5_tools,数据文件路径: - 拆分calendar:plus_5_tools/calendar-* - 拆分csm:plus_5_tools/csm-* - 拆分drive:plus_5_tools/drive-* - 拆分email:plus_5_tools/email-* - 拆分hr:plus_5_tools/hr-* - 拆分hybrid:plus_5_tools/hybrid-* - 拆分itsm:plus_5_tools/itsm-* - 拆分teams:plus_5_tools/teams-* --- <div align="center"> <h1><img src="assets/csmgym.png" alt="Logo" width="48" style="vertical-align:middle; margin-right:8px;" /> EnterpriseOps-Gym:企业场景下有状态智能体规划与工具使用的仿真环境与评测基准</h1> <p> <a href="https://enterpriseops-gym.github.io/"><img src="https://img.shields.io/badge/官网-blue?logo=google-chrome&logoColor=white" /></a> <a href="https://arxiv.org/abs/2603.13594"><img src="https://img.shields.io/badge/论文-red?logo=arxiv&logoColor=white" /></a> <a href="https://github.com/ServiceNow/EnterpriseOps-Gym"><img src="https://img.shields.io/badge/GitHub-black?logo=github" /></a> </p> <p><i>EnterpriseOps-Gym是一款容器化、可重置的企业仿真评测基准,用于在真实企业工作流场景下评估大语言模型(LLM/Large Language Model)智能体的有状态多步规划与工具使用能力</i></p> </div> <div align="center"><img src="assets/teaser.png" alt="EnterpriseOps-Gym 概览" width="80%" /></div> ## 关于本基准 **EnterpriseOps-Gym**是一款用于评测大语言模型智能体在企业运营场景下智能规划与工具使用能力的大规模基准数据集。该数据集包含覆盖8个企业领域的1150个专家精心打造的任务,所有任务均运行在由真实合成数据库支撑的实时容器化MCP服务器上。 与静态问答基准不同,EnterpriseOps-Gym通过SQL验证器(SQL verifiers)基于最终环境状态对智能体进行评测:即智能体的奖励依据为达成正确的任务结果,而非严格遵循预设的动作序列。所有任务均要求智能体具备长周期多步推理能力、严格的策略合规性,以及在复杂数据依赖下精准调用工具的能力。 > **当前最优模型性能:34.1%任务成功率**,仍存在较大的研究改进空间。 ## 核心特性 - 🛠️ **覆盖8个企业领域的512个工具** - 🗄️ **164张数据库表**,每张表平均存在1.7个外键依赖 - 🔢 **单任务平均步骤数为9.15(最高可达34步)**,平均验证条件数为5.3 - 📏 **单任务平均上下文长度达89k** - 🔒 所有任务均强制执行**访问控制、策略合规性与引用完整性约束** - ✅ 评测基于**可执行SQL验证器的任务结果**,而非动作序列匹配 - 🐳 采用**全容器化沙箱环境**,确保每次任务运行均可复现且相互隔离 ## 评测框架 评测代码可在 [ServiceNow/EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) 获取。该框架支持: - **多种智能体编排器**:ReAct、Planner-ReAct、分解式规划器(Decomposing Planner) - **多家大语言模型服务商**:Anthropic、OpenAI、Azure OpenAI、Google Gemini、DeepSeek、vLLM等 - **并行执行能力**:通过[Ray](https://www.ray.io/)实现大规模评测任务的并行运行 - **自动评分功能**:支持按单任务与单模式进行细分评测结果统计 python from datasets import load_dataset ds = load_dataset("ServiceNow-AI/EnterpriseOps-Gym", "oracle", split="teams") ## 领域信息 本数据集按**领域(数据拆分)**与**模式(配置子集)**进行组织。 ### 领域详情 | 领域名称 | 任务数 | 平均步骤数 | 最大步骤数 | 工具数 | |--------|------:|----------:|----------:|------:| | Calendar(日历协作) | 100 | 7.05 | 17 | 37 | | CSM(客户服务管理) | 186 | 12.10 | 27 | 89 | | Drive(云盘服务) | 105 | 8.68 | 29 | 55 | | Email(邮件系统) | 104 | 6.25 | 22 | 79 | | HR(人力资源) | 184 | 10.54 | 34 | 89 | | ITSM(IT服务管理) | 181 | 9.00 | 31 | 93 | | Teams(团队协作) | 100 | 9.41 | 18 | 70 | | Hybrid(混合领域) | 155 | 7.79 | 19 | 多领域 | | **总计** | **1,115** | **9.15** | **34** | **512** | ### 模式(工具集配置) 每种模式用于控制暴露给智能体的工具集合,模拟真实场景下的工具检索需求: | 模式名称 | 模式说明 | |------|-------------| | `oracle` | 仅提供任务所需的精准工具集 | | `plus_5_tools` | 精准工具集 + 5个随机采样的干扰工具 | | `plus_10_tools` | 精准工具集 + 10个随机采样的干扰工具 | | `plus_15_tools` | 精准工具集 +15个随机采样的干扰工具 | ## 字段说明 数据集中的每一行对应一个任务实例,包含以下字段: | 字段名称 | 数据类型 | 字段描述 | |-------|------|-------------| | `task_id` | `string` | 任务唯一标识符 | | `domain` | `string` | 任务所属领域名称(如`teams`、`csm`、`hr`) | | `system_prompt` | `string` | 定义智能体角色与领域专属策略的系统提示词 | | `user_prompt` | `string` | 自然语言形式的任务指令 | | `selected_tools` | `list[string]` | 当前模式下可供智能体使用的工具名称列表 | | `restricted_tools` | `list[string]` | 当前模式下受限禁用的工具列表 | | `mcp_endpoint` | `string` | MCP服务器的访问端点 | | `number_of_runs` | `int64` | 任务允许的最大运行次数 | | `reset_database_between_runs` | `bool` | 是否在每次任务运行间重置数据库状态 | | `gym_servers_config` | `string`(JSON格式) | MCP服务器配置信息,指定需连接的容器化Gym服务器 | | `verifiers` | `string`(JSON格式) | 用于校验最终环境状态的SQL验证脚本数组 | ## 典型应用场景 **EnterpriseOps-Gym**可应用于以下场景: 1. **基准测试大语言模型智能体**:覆盖IT、人力资源、客户服务管理与协作等多个企业领域的真实工作流 2. **评估工具使用与规划能力**:在长周期、多步且受策略约束的场景下评测智能体的工具调用与规划能力 3. **研究工具检索鲁棒性**:通过对比精准工具集与含干扰工具的模式,研究智能体的工具检索鲁棒性 4. **开发新型编排策略**:本基准原生支持ReAct、Planner-ReAct与分解式规划器等多种编排框架 5. **研究前沿模型的失效模式**:针对高复杂度企业任务分析当前最优模型的失效模式(当前最优模型成功率仅为34.1%) 6. **扩展基准数据集**:通过发布的Docker沙箱基础设施,可新增领域、任务或验证器以扩展该基准 ## 引用格式 bibtex @misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful, title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings}, author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar}, year={2026}, eprint={2603.13594}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2603.13594}, }
提供机构:
bdallhrajh371
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作