ServiceNow-AI/EnterpriseOps-Gym

Name: ServiceNow-AI/EnterpriseOps-Gym
Creator: ServiceNow-AI
Published: 2026-04-30 15:27:14
License: 暂无描述

Hugging Face2026-04-30 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ServiceNow-AI/EnterpriseOps-Gym

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: - config_name: oracle features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 287514 num_examples: 61 - name: csm num_bytes: 1463009 num_examples: 103 - name: drive num_bytes: 469510 num_examples: 64 - name: email num_bytes: 466399 num_examples: 67 - name: hr num_bytes: 1691979 num_examples: 102 - name: hybrid num_bytes: 1270996 num_examples: 88 - name: itsm num_bytes: 1422432 num_examples: 103 - name: teams num_bytes: 1305140 num_examples: 61 download_size: 991194 dataset_size: 8376979 - config_name: plus_10_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 291906 num_examples: 59 - name: csm num_bytes: 1465984 num_examples: 102 - name: drive num_bytes: 481171 num_examples: 64 - name: email num_bytes: 478847 num_examples: 67 - name: hr num_bytes: 1714711 num_examples: 102 - name: hybrid num_bytes: 1287591 num_examples: 88 - name: itsm num_bytes: 1447378 num_examples: 103 - name: teams num_bytes: 1145920 num_examples: 52 download_size: 982687 dataset_size: 8313508 - config_name: plus_15_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 296762 num_examples: 59 - name: csm num_bytes: 1473824 num_examples: 102 - name: drive num_bytes: 486100 num_examples: 64 - name: email num_bytes: 484265 num_examples: 67 - name: hr num_bytes: 1726316 num_examples: 102 - name: hybrid num_bytes: 1295972 num_examples: 88 - name: itsm num_bytes: 1460550 num_examples: 103 - name: teams num_bytes: 1150823 num_examples: 52 download_size: 989853 dataset_size: 8374612 - config_name: plus_5_tools features: - name: task_id dtype: string - name: domain dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: selected_tools list: string - name: restricted_tools list: 'null' - name: mcp_endpoint dtype: string - name: number_of_runs dtype: int64 - name: reset_database_between_runs dtype: bool - name: gym_servers_config dtype: string - name: verifiers dtype: string splits: - name: calendar num_bytes: 286286 num_examples: 59 - name: csm num_bytes: 1450583 num_examples: 102 - name: drive num_bytes: 475636 num_examples: 64 - name: email num_bytes: 473047 num_examples: 67 - name: hr num_bytes: 1703152 num_examples: 102 - name: hybrid num_bytes: 1279121 num_examples: 88 - name: itsm num_bytes: 1434621 num_examples: 103 - name: teams num_bytes: 1140547 num_examples: 52 download_size: 981152 dataset_size: 8242993 configs: - config_name: oracle data_files: - split: calendar path: oracle/calendar-* - split: csm path: oracle/csm-* - split: drive path: oracle/drive-* - split: email path: oracle/email-* - split: hr path: oracle/hr-* - split: hybrid path: oracle/hybrid-* - split: itsm path: oracle/itsm-* - split: teams path: oracle/teams-* - config_name: plus_10_tools data_files: - split: calendar path: plus_10_tools/calendar-* - split: csm path: plus_10_tools/csm-* - split: drive path: plus_10_tools/drive-* - split: email path: plus_10_tools/email-* - split: hr path: plus_10_tools/hr-* - split: hybrid path: plus_10_tools/hybrid-* - split: itsm path: plus_10_tools/itsm-* - split: teams path: plus_10_tools/teams-* - config_name: plus_15_tools data_files: - split: calendar path: plus_15_tools/calendar-* - split: csm path: plus_15_tools/csm-* - split: drive path: plus_15_tools/drive-* - split: email path: plus_15_tools/email-* - split: hr path: plus_15_tools/hr-* - split: hybrid path: plus_15_tools/hybrid-* - split: itsm path: plus_15_tools/itsm-* - split: teams path: plus_15_tools/teams-* - config_name: plus_5_tools data_files: - split: calendar path: plus_5_tools/calendar-* - split: csm path: plus_5_tools/csm-* - split: drive path: plus_5_tools/drive-* - split: email path: plus_5_tools/email-* - split: hr path: plus_5_tools/hr-* - split: hybrid path: plus_5_tools/hybrid-* - split: itsm path: plus_5_tools/itsm-* - split: teams path: plus_5_tools/teams-* --- <div align="center"> <h1><img src="assets/csmgym.png" alt="Logo" width="48" style="vertical-align:middle; margin-right:8px;" /> EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</h1> <p> <a href="https://enterpriseops-gym.github.io/"><img src="https://img.shields.io/badge/Website-blue?logo=google-chrome&logoColor=white" /></a> <a href="https://arxiv.org/abs/2603.13594"><img src="https://img.shields.io/badge/Paper-red?logo=arxiv&logoColor=white" /></a> <a href="https://github.com/ServiceNow/EnterpriseOps-Gym"><img src="https://img.shields.io/badge/GitHub-black?logo=github" /></a> </p> <p><i>EnterpriseOps-Gym is a containerized, resettable enterprise simulation benchmark for evaluating LLM agents on stateful, multi-step planning and tool use across realistic enterprise workflows</i></p> </div> <div align="center"><img src="assets/teaser.png" alt="EnterpriseOps-Gym Overview" width="80%" /></div> ## About **EnterpriseOps-Gym** is a large-scale benchmark for evaluating the agentic planning and tool-use capabilities of LLM agents across enterprise operations. It comprises **1,150 expert-curated tasks** spanning **8 enterprise domains**, each running against live containerized MCP servers backed by realistic, fully synthetic databases. Unlike static QA benchmarks, EnterpriseOps-Gym evaluates agents on **final environment state** using SQL verifiers - meaning agents are rewarded for achieving the correct outcome, not for following a rigid action sequence. Tasks require long-horizon multi-step reasoning, strict policy compliance, and precise tool invocation under complex data dependencies. > **Best model performance: 34.1% success rate** - leaving significant headroom for future research. ## Key Features - 🛠️ **512 tools** across 8 enterprise domains - 🗄️ **164 database tables** with avg 1.7 foreign-key dependencies per table - 🔢 **9.15 avg steps** per task (up to 34), with **5.3 avg verification conditions** - 📏 **89k avg context length** per task - 🔒 Tasks enforce **access control, policy compliance, and referential integrity** - ✅ Evaluation is **outcome-based** via executable SQL verifiers — not action-sequence matching - 🐳 Fully **containerized** sandbox — reproducible and isolated per task run ## Evaluation Framework The evaluation code is available at [ServiceNow/EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym). The framework supports: - **Multiple orchestrators**: ReAct, Planner-ReAct, Decomposing Planner - **Multiple LLM providers**: Anthropic, OpenAI, Azure OpenAI, Google Gemini, DeepSeek, vLLM, and more - **Parallel execution** via [Ray](https://www.ray.io/) for large-scale runs - **Automatic scoring** with per-task and per-mode breakdowns ```python from datasets import load_dataset ds = load_dataset("ServiceNow-AI/EnterpriseOps-Gym", "oracle", split="teams") ``` ## Domain Information The dataset is organized by **domain** (split) and **mode** (configuration subset). ### Domains | Domain | Tasks | Avg Steps | Max Steps | Tools | |--------|------:|----------:|----------:|------:| | Calendar | 100 | 7.05 | 17 | 37 | | CSM | 186 | 12.10 | 27 | 89 | | Drive | 105 | 8.68 | 29 | 55 | | Email | 104 | 6.25 | 22 | 79 | | HR | 184 | 10.54 | 34 | 89 | | ITSM | 181 | 9.00 | 31 | 93 | | Teams | 100 | 9.41 | 18 | 70 | | Hybrid | 155 | 7.79 | 19 | Multi-domain | | **Total** | **1,115** | **9.15** | **34** | **512** | ### Modes (Tool-Set Configurations) Each mode controls the set of tools exposed to the agent, simulating realistic tool-retrieval scenarios: | Mode | Description | |------|-------------| | `oracle` | Only the exact tools needed for the task | | `plus_5_tools` | Oracle tools + 5 randomly sampled distractor tools | | `plus_10_tools` | Oracle tools + 10 randomly sampled distractor tools | | `plus_15_tools` | Oracle tools + 15 randomly sampled distractor tools | ## Field Descriptions Each row in the dataset corresponds to one task instance and contains the following fields: | Field | Type | Description | |-------|------|-------------| | `task_id` | `string` | Unique identifier for the task | | `domain` | `string` | Domain name (e.g., `teams`, `csm`, `hr`) | | `system_prompt` | `string` | Agent role definition and domain-specific policies | | `user_prompt` | `string` | Natural language task instruction | | `verifiers` | `string` (JSON) | Array of SQL-based outcome verification scripts that check final environment state | | `gym_servers_config` | `string` (JSON) | MCP server configuration(s) specifying which containerized gym server(s) to connect to | | `selected_tools` | `list[string]` | Names of tools available to the agent in this mode | ## Example Use Cases **EnterpriseOps-Gym** can be used for: - **Benchmarking LLM agents** on realistic enterprise workflows across IT, HR, CRM, and collaboration domains - **Evaluating tool-use and planning** under long-horizon, multi-step, policy-constrained settings - **Studying tool retrieval robustness** by comparing oracle vs. distractor-augmented tool modes - **Developing new orchestration strategies** — the framework natively supports ReAct, Planner-ReAct, and Decomposing Planner - **Studying failure modes** of state-of-the-art models on high-complexity enterprise tasks (best model: 34.1%) - **Extending the benchmark** with new domains, tasks, or verifiers using the released Docker sandbox infrastructure ## Citation ```bibtex @misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful, title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings}, author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar}, year={2026}, eprint={2603.13594}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2603.13594}, }

EnterpriseOps-Gym is a large-scale benchmark for evaluating the agentic planning and tool-use capabilities of LLM agents across enterprise operations. It comprises 1,150 expert-curated tasks spanning 8 enterprise domains, each running against live containerized MCP servers backed by realistic, fully synthetic databases. The dataset evaluates agents on final environment state using SQL verifiers, ensuring correct outcomes. It also supports multiple tool-set configurations, including oracle, plus_5_tools, plus_10_tools, and plus_15_tools, to simulate different tool retrieval scenarios.

提供机构：

ServiceNow-AI

搜集汇总

数据集介绍

构建方式

EnterpriseOps-Gym的构建依托于真实企业运营场景的深度模拟，涵盖了日历、客户服务管理、云端存储、邮件、人力资源、IT服务管理、团队协作及跨域混合等八大企业核心领域，共计包含1,150项由领域专家精心设计的任务。该数据集采用容器化沙盒技术，每个任务均基于真实的合成数据库运行于MCP服务器之上，并配备可重置的环境，确保实验的可复现性。任务数据通过系统提示词、用户指令、验证器及工具集配置等结构化字段进行组织，其中验证器采用可执行的SQL脚本，以最终环境状态评估智能体表现，而非拘泥于动作序列的匹配。数据集的配置模式涵盖Oracle（仅包含必需工具）及添加了5、10、15个干扰工具的变体，用以模拟不同工具检索场景下的规划难度。

使用方法

EnterpriseOps-Gym的使用方式灵活且扩展性强。首先，用户可通过HuggingFace库加载数据集，如`load_dataset('ServiceNow-AI/EnterpriseOps-Gym', 'oracle', split='teams')`，按领域（split）和模式（配置子集）获取任务实例。每个实例包含任务标识、系统提示、用户指令、验证器脚本及MCP服务器配置等字段，可直接用于智能体的推理与执行。评估框架集成于开源代码库中，支持ReAct、Planner-ReAct及分解型规划器等多种编排策略，并可连接Anthropic、OpenAI、Google Gemini及DeepSeek等主流语言模型提供商。借助Ray框架实现大规模并行执行，自动生成按任务与模式划分的得分报告，便于研究者分析智能体的失败模式或开发新的编排算法。此外，数据集的可扩展性允许用户通过Docker沙箱基础设施添加新领域、任务或验证器。

背景与挑战

背景概述

企业运营领域长期以来面临着流程自动化程度低、多系统协同困难的瓶颈。2026年，由ServiceNow研究团队主导，Shiva Krishna Reddy Malay等学者联合推出了EnterpriseOps-Gym数据集。该数据集聚焦于评估大语言模型（LLM）在复杂企业场景中的规划能力与工具调用水平，涉及日历、人力资源、IT服务管理等8个核心领域，共计包含1,115个专家设计的任务。数据集依托容器化沙盒环境与实时数据库，采用SQL验证器根据最终环境状态评估智能体表现，突破了传统静态问答基准的局限。其发布对推动企业级智能自动化研究具有重要影响，为工业界与学术界提供了标准化的评估框架。

当前挑战

EnterpriseOps-Gym所应对的核心挑战源自企业运营中固有的复杂性：智能体需在包含512种工具与164张数据库表的环境中完成平均9.15步、最长34步的长跨度多步推理，并严格遵循访问控制与策略合规要求。构建该数据集时，团队面临多重技术难题——需设计高度仿真的合成数据库并维护复杂的参照完整性约束，开发基于SQL的可执行验证器以实现结果导向的评估，同时在oracle、plus_5_tools等不同工具集配置下模拟工具检索噪声。当前最优模型仅达成34.1%的成功率，充分揭示了该领域巨大的研究空间与提升潜力。

常用场景

经典使用场景

EnterpriseOps-Gym精心构建了一个面向企业运营场景的智能体规划与工具使用基准评测平台。该数据集涵盖了从日历管理、客户服务、文档协作到人力资源、IT服务管理等八个核心业务领域，总计包含1115个专家精心设计的任务实例。每个任务都设计为需要多步推理、严格遵循企业策略并精确调用工具的复杂工作流，提供了四种不同工具集配置模式（oracle及附加干扰工具的模式），使研究者能够系统评估语言模型智能体在多工具环境下的规划与执行能力。

解决学术问题

该数据集有效应对了当前学术界在评估大型语言模型智能体时面临的关键挑战——缺乏能够反映真实企业环境复杂性的动态、多步骤基准。传统静态问答评测无法捕捉智能体在长期任务中的状态推理、策略合规性及工具选择能力。EnterpriseOps-Gym通过引入基于SQL验证的结果导向评估机制，要求智能体实现正确的最终环境状态而非机械模仿动作序列，为研究长程规划、工具调用鲁棒性和策略约束下的决策推理提供了标准化评测框架，其最佳模型仅34.1%的成功率充分揭示了当前技术的提升空间。

实际应用

该数据集在实际应用中能够为各类企业级智能助手系统的研发提供精准的测评和优化支撑。例如，在IT服务管理领域，可用于评估智能体处理工单分派、权限审批和跨系统协作的效能；在人力资源场景中，验证智能体执行员工入职流程、考勤管理和政策查询的准确性。此外，数据集内置的容器化沙盒环境支持安全隔离的重复实验，为企业在部署自动化运营助理前进行充分的风险评估和能力测试提供了可靠的工程基础。

数据集最近研究