ServiceNow-AI/EnterpriseOps-Gym
收藏Hugging Face2026-04-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ServiceNow-AI/EnterpriseOps-Gym
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
- config_name: oracle
features:
- name: task_id
dtype: string
- name: domain
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: selected_tools
list: string
- name: restricted_tools
list: 'null'
- name: mcp_endpoint
dtype: string
- name: number_of_runs
dtype: int64
- name: reset_database_between_runs
dtype: bool
- name: gym_servers_config
dtype: string
- name: verifiers
dtype: string
splits:
- name: calendar
num_bytes: 287514
num_examples: 61
- name: csm
num_bytes: 1463009
num_examples: 103
- name: drive
num_bytes: 469510
num_examples: 64
- name: email
num_bytes: 466399
num_examples: 67
- name: hr
num_bytes: 1691979
num_examples: 102
- name: hybrid
num_bytes: 1270996
num_examples: 88
- name: itsm
num_bytes: 1422432
num_examples: 103
- name: teams
num_bytes: 1305140
num_examples: 61
download_size: 991194
dataset_size: 8376979
- config_name: plus_10_tools
features:
- name: task_id
dtype: string
- name: domain
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: selected_tools
list: string
- name: restricted_tools
list: 'null'
- name: mcp_endpoint
dtype: string
- name: number_of_runs
dtype: int64
- name: reset_database_between_runs
dtype: bool
- name: gym_servers_config
dtype: string
- name: verifiers
dtype: string
splits:
- name: calendar
num_bytes: 291906
num_examples: 59
- name: csm
num_bytes: 1465984
num_examples: 102
- name: drive
num_bytes: 481171
num_examples: 64
- name: email
num_bytes: 478847
num_examples: 67
- name: hr
num_bytes: 1714711
num_examples: 102
- name: hybrid
num_bytes: 1287591
num_examples: 88
- name: itsm
num_bytes: 1447378
num_examples: 103
- name: teams
num_bytes: 1145920
num_examples: 52
download_size: 982687
dataset_size: 8313508
- config_name: plus_15_tools
features:
- name: task_id
dtype: string
- name: domain
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: selected_tools
list: string
- name: restricted_tools
list: 'null'
- name: mcp_endpoint
dtype: string
- name: number_of_runs
dtype: int64
- name: reset_database_between_runs
dtype: bool
- name: gym_servers_config
dtype: string
- name: verifiers
dtype: string
splits:
- name: calendar
num_bytes: 296762
num_examples: 59
- name: csm
num_bytes: 1473824
num_examples: 102
- name: drive
num_bytes: 486100
num_examples: 64
- name: email
num_bytes: 484265
num_examples: 67
- name: hr
num_bytes: 1726316
num_examples: 102
- name: hybrid
num_bytes: 1295972
num_examples: 88
- name: itsm
num_bytes: 1460550
num_examples: 103
- name: teams
num_bytes: 1150823
num_examples: 52
download_size: 989853
dataset_size: 8374612
- config_name: plus_5_tools
features:
- name: task_id
dtype: string
- name: domain
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: selected_tools
list: string
- name: restricted_tools
list: 'null'
- name: mcp_endpoint
dtype: string
- name: number_of_runs
dtype: int64
- name: reset_database_between_runs
dtype: bool
- name: gym_servers_config
dtype: string
- name: verifiers
dtype: string
splits:
- name: calendar
num_bytes: 286286
num_examples: 59
- name: csm
num_bytes: 1450583
num_examples: 102
- name: drive
num_bytes: 475636
num_examples: 64
- name: email
num_bytes: 473047
num_examples: 67
- name: hr
num_bytes: 1703152
num_examples: 102
- name: hybrid
num_bytes: 1279121
num_examples: 88
- name: itsm
num_bytes: 1434621
num_examples: 103
- name: teams
num_bytes: 1140547
num_examples: 52
download_size: 981152
dataset_size: 8242993
configs:
- config_name: oracle
data_files:
- split: calendar
path: oracle/calendar-*
- split: csm
path: oracle/csm-*
- split: drive
path: oracle/drive-*
- split: email
path: oracle/email-*
- split: hr
path: oracle/hr-*
- split: hybrid
path: oracle/hybrid-*
- split: itsm
path: oracle/itsm-*
- split: teams
path: oracle/teams-*
- config_name: plus_10_tools
data_files:
- split: calendar
path: plus_10_tools/calendar-*
- split: csm
path: plus_10_tools/csm-*
- split: drive
path: plus_10_tools/drive-*
- split: email
path: plus_10_tools/email-*
- split: hr
path: plus_10_tools/hr-*
- split: hybrid
path: plus_10_tools/hybrid-*
- split: itsm
path: plus_10_tools/itsm-*
- split: teams
path: plus_10_tools/teams-*
- config_name: plus_15_tools
data_files:
- split: calendar
path: plus_15_tools/calendar-*
- split: csm
path: plus_15_tools/csm-*
- split: drive
path: plus_15_tools/drive-*
- split: email
path: plus_15_tools/email-*
- split: hr
path: plus_15_tools/hr-*
- split: hybrid
path: plus_15_tools/hybrid-*
- split: itsm
path: plus_15_tools/itsm-*
- split: teams
path: plus_15_tools/teams-*
- config_name: plus_5_tools
data_files:
- split: calendar
path: plus_5_tools/calendar-*
- split: csm
path: plus_5_tools/csm-*
- split: drive
path: plus_5_tools/drive-*
- split: email
path: plus_5_tools/email-*
- split: hr
path: plus_5_tools/hr-*
- split: hybrid
path: plus_5_tools/hybrid-*
- split: itsm
path: plus_5_tools/itsm-*
- split: teams
path: plus_5_tools/teams-*
---
<div align="center">
<h1><img src="assets/csmgym.png" alt="Logo" width="48" style="vertical-align:middle; margin-right:8px;" /> EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</h1>
<p>
<a href="https://enterpriseops-gym.github.io/"><img src="https://img.shields.io/badge/Website-blue?logo=google-chrome&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2603.13594"><img src="https://img.shields.io/badge/Paper-red?logo=arxiv&logoColor=white" /></a>
<a href="https://github.com/ServiceNow/EnterpriseOps-Gym"><img src="https://img.shields.io/badge/GitHub-black?logo=github" /></a>
</p>
<p><i>EnterpriseOps-Gym is a containerized, resettable enterprise simulation benchmark for evaluating LLM agents on stateful, multi-step planning and tool use across realistic enterprise workflows</i></p>
</div>
<div align="center"><img src="assets/teaser.png" alt="EnterpriseOps-Gym Overview" width="80%" /></div>
## About
**EnterpriseOps-Gym** is a large-scale benchmark for evaluating the agentic planning and tool-use capabilities of LLM agents across enterprise operations. It comprises **1,150 expert-curated tasks** spanning **8 enterprise domains**, each running against live containerized MCP servers backed by realistic, fully synthetic databases.
Unlike static QA benchmarks, EnterpriseOps-Gym evaluates agents on **final environment state** using SQL verifiers - meaning agents are rewarded for achieving the correct outcome, not for following a rigid action sequence. Tasks require long-horizon multi-step reasoning, strict policy compliance, and precise tool invocation under complex data dependencies.
> **Best model performance: 34.1% success rate** - leaving significant headroom for future research.
## Key Features
- 🛠️ **512 tools** across 8 enterprise domains
- 🗄️ **164 database tables** with avg 1.7 foreign-key dependencies per table
- 🔢 **9.15 avg steps** per task (up to 34), with **5.3 avg verification conditions**
- 📏 **89k avg context length** per task
- 🔒 Tasks enforce **access control, policy compliance, and referential integrity**
- ✅ Evaluation is **outcome-based** via executable SQL verifiers — not action-sequence matching
- 🐳 Fully **containerized** sandbox — reproducible and isolated per task run
## Evaluation Framework
The evaluation code is available at [ServiceNow/EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym).
The framework supports:
- **Multiple orchestrators**: ReAct, Planner-ReAct, Decomposing Planner
- **Multiple LLM providers**: Anthropic, OpenAI, Azure OpenAI, Google Gemini, DeepSeek, vLLM, and more
- **Parallel execution** via [Ray](https://www.ray.io/) for large-scale runs
- **Automatic scoring** with per-task and per-mode breakdowns
```python
from datasets import load_dataset
ds = load_dataset("ServiceNow-AI/EnterpriseOps-Gym", "oracle", split="teams")
```
## Domain Information
The dataset is organized by **domain** (split) and **mode** (configuration subset).
### Domains
| Domain | Tasks | Avg Steps | Max Steps | Tools |
|--------|------:|----------:|----------:|------:|
| Calendar | 100 | 7.05 | 17 | 37 |
| CSM | 186 | 12.10 | 27 | 89 |
| Drive | 105 | 8.68 | 29 | 55 |
| Email | 104 | 6.25 | 22 | 79 |
| HR | 184 | 10.54 | 34 | 89 |
| ITSM | 181 | 9.00 | 31 | 93 |
| Teams | 100 | 9.41 | 18 | 70 |
| Hybrid | 155 | 7.79 | 19 | Multi-domain |
| **Total** | **1,115** | **9.15** | **34** | **512** |
### Modes (Tool-Set Configurations)
Each mode controls the set of tools exposed to the agent, simulating realistic tool-retrieval scenarios:
| Mode | Description |
|------|-------------|
| `oracle` | Only the exact tools needed for the task |
| `plus_5_tools` | Oracle tools + 5 randomly sampled distractor tools |
| `plus_10_tools` | Oracle tools + 10 randomly sampled distractor tools |
| `plus_15_tools` | Oracle tools + 15 randomly sampled distractor tools |
## Field Descriptions
Each row in the dataset corresponds to one task instance and contains the following fields:
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | `string` | Unique identifier for the task |
| `domain` | `string` | Domain name (e.g., `teams`, `csm`, `hr`) |
| `system_prompt` | `string` | Agent role definition and domain-specific policies |
| `user_prompt` | `string` | Natural language task instruction |
| `verifiers` | `string` (JSON) | Array of SQL-based outcome verification scripts that check final environment state |
| `gym_servers_config` | `string` (JSON) | MCP server configuration(s) specifying which containerized gym server(s) to connect to |
| `selected_tools` | `list[string]` | Names of tools available to the agent in this mode |
## Example Use Cases
**EnterpriseOps-Gym** can be used for:
- **Benchmarking LLM agents** on realistic enterprise workflows across IT, HR, CRM, and collaboration domains
- **Evaluating tool-use and planning** under long-horizon, multi-step, policy-constrained settings
- **Studying tool retrieval robustness** by comparing oracle vs. distractor-augmented tool modes
- **Developing new orchestration strategies** — the framework natively supports ReAct, Planner-ReAct, and Decomposing Planner
- **Studying failure modes** of state-of-the-art models on high-complexity enterprise tasks (best model: 34.1%)
- **Extending the benchmark** with new domains, tasks, or verifiers using the released Docker sandbox infrastructure
## Citation
```bibtex
@misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful,
title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings},
author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar},
year={2026},
eprint={2603.13594},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.13594},
}
EnterpriseOps-Gym is a large-scale benchmark for evaluating the agentic planning and tool-use capabilities of LLM agents across enterprise operations. It comprises 1,150 expert-curated tasks spanning 8 enterprise domains, each running against live containerized MCP servers backed by realistic, fully synthetic databases. The dataset evaluates agents on final environment state using SQL verifiers, ensuring correct outcomes. It also supports multiple tool-set configurations, including oracle, plus_5_tools, plus_10_tools, and plus_15_tools, to simulate different tool retrieval scenarios.
提供机构:
ServiceNow-AI
搜集汇总
数据集介绍

构建方式
EnterpriseOps-Gym的构建依托于真实企业运营场景的深度模拟,涵盖了日历、客户服务管理、云端存储、邮件、人力资源、IT服务管理、团队协作及跨域混合等八大企业核心领域,共计包含1,150项由领域专家精心设计的任务。该数据集采用容器化沙盒技术,每个任务均基于真实的合成数据库运行于MCP服务器之上,并配备可重置的环境,确保实验的可复现性。任务数据通过系统提示词、用户指令、验证器及工具集配置等结构化字段进行组织,其中验证器采用可执行的SQL脚本,以最终环境状态评估智能体表现,而非拘泥于动作序列的匹配。数据集的配置模式涵盖Oracle(仅包含必需工具)及添加了5、10、15个干扰工具的变体,用以模拟不同工具检索场景下的规划难度。
使用方法
EnterpriseOps-Gym的使用方式灵活且扩展性强。首先,用户可通过HuggingFace库加载数据集,如`load_dataset('ServiceNow-AI/EnterpriseOps-Gym', 'oracle', split='teams')`,按领域(split)和模式(配置子集)获取任务实例。每个实例包含任务标识、系统提示、用户指令、验证器脚本及MCP服务器配置等字段,可直接用于智能体的推理与执行。评估框架集成于开源代码库中,支持ReAct、Planner-ReAct及分解型规划器等多种编排策略,并可连接Anthropic、OpenAI、Google Gemini及DeepSeek等主流语言模型提供商。借助Ray框架实现大规模并行执行,自动生成按任务与模式划分的得分报告,便于研究者分析智能体的失败模式或开发新的编排算法。此外,数据集的可扩展性允许用户通过Docker沙箱基础设施添加新领域、任务或验证器。
背景与挑战
背景概述
企业运营领域长期以来面临着流程自动化程度低、多系统协同困难的瓶颈。2026年,由ServiceNow研究团队主导,Shiva Krishna Reddy Malay等学者联合推出了EnterpriseOps-Gym数据集。该数据集聚焦于评估大语言模型(LLM)在复杂企业场景中的规划能力与工具调用水平,涉及日历、人力资源、IT服务管理等8个核心领域,共计包含1,115个专家设计的任务。数据集依托容器化沙盒环境与实时数据库,采用SQL验证器根据最终环境状态评估智能体表现,突破了传统静态问答基准的局限。其发布对推动企业级智能自动化研究具有重要影响,为工业界与学术界提供了标准化的评估框架。
当前挑战
EnterpriseOps-Gym所应对的核心挑战源自企业运营中固有的复杂性:智能体需在包含512种工具与164张数据库表的环境中完成平均9.15步、最长34步的长跨度多步推理,并严格遵循访问控制与策略合规要求。构建该数据集时,团队面临多重技术难题——需设计高度仿真的合成数据库并维护复杂的参照完整性约束,开发基于SQL的可执行验证器以实现结果导向的评估,同时在oracle、plus_5_tools等不同工具集配置下模拟工具检索噪声。当前最优模型仅达成34.1%的成功率,充分揭示了该领域巨大的研究空间与提升潜力。
常用场景
经典使用场景
EnterpriseOps-Gym精心构建了一个面向企业运营场景的智能体规划与工具使用基准评测平台。该数据集涵盖了从日历管理、客户服务、文档协作到人力资源、IT服务管理等八个核心业务领域,总计包含1115个专家精心设计的任务实例。每个任务都设计为需要多步推理、严格遵循企业策略并精确调用工具的复杂工作流,提供了四种不同工具集配置模式(oracle及附加干扰工具的模式),使研究者能够系统评估语言模型智能体在多工具环境下的规划与执行能力。
解决学术问题
该数据集有效应对了当前学术界在评估大型语言模型智能体时面临的关键挑战——缺乏能够反映真实企业环境复杂性的动态、多步骤基准。传统静态问答评测无法捕捉智能体在长期任务中的状态推理、策略合规性及工具选择能力。EnterpriseOps-Gym通过引入基于SQL验证的结果导向评估机制,要求智能体实现正确的最终环境状态而非机械模仿动作序列,为研究长程规划、工具调用鲁棒性和策略约束下的决策推理提供了标准化评测框架,其最佳模型仅34.1%的成功率充分揭示了当前技术的提升空间。
实际应用
该数据集在实际应用中能够为各类企业级智能助手系统的研发提供精准的测评和优化支撑。例如,在IT服务管理领域,可用于评估智能体处理工单分派、权限审批和跨系统协作的效能;在人力资源场景中,验证智能体执行员工入职流程、考勤管理和政策查询的准确性。此外,数据集内置的容器化沙盒环境支持安全隔离的重复实验,为企业在部署自动化运营助理前进行充分的风险评估和能力测试提供了可靠的工程基础。
数据集最近研究
最新研究方向
EnterpriseOps-Gym 数据集的最新研究方向聚焦于在大规模企业级场景下,评估和增强大语言模型(LLM)的智能体规划与工具使用能力,特别是应对状态依赖、多步骤推理和复杂策略约束的挑战。该基准通过512个涵盖日历、CSM、HR、ITSM等8个业务领域的工具,以及基于SQL验证器的最终状态评估机制,突破了传统静态QA基准的局限。前沿研究热点包括探索ReAct、Planner-ReAct等编排策略在长序列任务(平均9.15步,最长34步)中的鲁棒性,并分析额外工具(如plus_5至plus_15模式)对智能体行为的影响。当前最佳模型仅达到34.1%的成功率,凸显了在企业自动化、IT运维管理和跨部门协作等热点事件驱动下,提升代理在高复杂度、策略严苛环境中的决策可靠性和适应性的重要意义,为该领域开辟了广阔的研究空间。
以上内容由遇见数据集搜集并总结生成



