ZixuanKe/evovling_tools
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ZixuanKe/evovling_tools
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- other
language:
- en
tags:
- agents
- tool-use
- evolving-benchmark
- enterprise-ops
pretty_name: EnterpriseOps Evolving Tool Benchmark
configs:
- config_name: calendar_v1
data_files:
- split: train
path: calendar/v1/train.jsonl
- split: test
path: calendar/v1/test.jsonl
- config_name: calendar_v2
data_files:
- split: train
path: calendar/v2/train.jsonl
- split: test
path: calendar/v2/test.jsonl
- config_name: calendar_v3
data_files:
- split: train
path: calendar/v3/train.jsonl
- split: test
path: calendar/v3/test.jsonl
- config_name: csm_v1
data_files:
- split: train
path: csm/v1/train.jsonl
- split: test
path: csm/v1/test.jsonl
- config_name: csm_v2
data_files:
- split: train
path: csm/v2/train.jsonl
- split: test
path: csm/v2/test.jsonl
- config_name: csm_v3
data_files:
- split: train
path: csm/v3/train.jsonl
- split: test
path: csm/v3/test.jsonl
- config_name: csm_v4
data_files:
- split: train
path: csm/v4/train.jsonl
- split: test
path: csm/v4/test.jsonl
- config_name: drive_v1
data_files:
- split: train
path: drive/v1/train.jsonl
- split: test
path: drive/v1/test.jsonl
- config_name: drive_v2
data_files:
- split: train
path: drive/v2/train.jsonl
- split: test
path: drive/v2/test.jsonl
- config_name: drive_v3
data_files:
- split: train
path: drive/v3/train.jsonl
- split: test
path: drive/v3/test.jsonl
- config_name: email_v1
data_files:
- split: train
path: email/v1/train.jsonl
- split: test
path: email/v1/test.jsonl
- config_name: email_v2
data_files:
- split: train
path: email/v2/train.jsonl
- split: test
path: email/v2/test.jsonl
- config_name: email_v3
data_files:
- split: train
path: email/v3/train.jsonl
- split: test
path: email/v3/test.jsonl
- config_name: email_v4
data_files:
- split: train
path: email/v4/train.jsonl
- split: test
path: email/v4/test.jsonl
- config_name: email_v5
data_files:
- split: train
path: email/v5/train.jsonl
- split: test
path: email/v5/test.jsonl
- config_name: email_v6
data_files:
- split: train
path: email/v6/train.jsonl
- split: test
path: email/v6/test.jsonl
- config_name: hr_v1
data_files:
- split: train
path: hr/v1/train.jsonl
- split: test
path: hr/v1/test.jsonl
- config_name: hr_v2
data_files:
- split: train
path: hr/v2/train.jsonl
- split: test
path: hr/v2/test.jsonl
- config_name: hr_v3
data_files:
- split: train
path: hr/v3/train.jsonl
- split: test
path: hr/v3/test.jsonl
- config_name: hr_v4
data_files:
- split: train
path: hr/v4/train.jsonl
- split: test
path: hr/v4/test.jsonl
- config_name: hr_v5
data_files:
- split: train
path: hr/v5/train.jsonl
- split: test
path: hr/v5/test.jsonl
- config_name: hybrid_v1
data_files:
- split: train
path: hybrid/v1/train.jsonl
- split: test
path: hybrid/v1/test.jsonl
- config_name: hybrid_v2
data_files:
- split: train
path: hybrid/v2/train.jsonl
- split: test
path: hybrid/v2/test.jsonl
- config_name: hybrid_v3
data_files:
- split: train
path: hybrid/v3/train.jsonl
- split: test
path: hybrid/v3/test.jsonl
- config_name: hybrid_v4
data_files:
- split: train
path: hybrid/v4/train.jsonl
- split: test
path: hybrid/v4/test.jsonl
- config_name: itsm_v1
data_files:
- split: train
path: itsm/v1/train.jsonl
- split: test
path: itsm/v1/test.jsonl
- config_name: itsm_v2
data_files:
- split: train
path: itsm/v2/train.jsonl
- split: test
path: itsm/v2/test.jsonl
- config_name: itsm_v3
data_files:
- split: train
path: itsm/v3/train.jsonl
- split: test
path: itsm/v3/test.jsonl
- config_name: teams_v1
data_files:
- split: train
path: teams/v1/train.jsonl
- split: test
path: teams/v1/test.jsonl
- config_name: teams_v2
data_files:
- split: train
path: teams/v2/train.jsonl
- split: test
path: teams/v2/test.jsonl
- config_name: teams_v3
data_files:
- split: train
path: teams/v3/train.jsonl
- split: test
path: teams/v3/test.jsonl
- config_name: teams_v4
data_files:
- split: train
path: teams/v4/train.jsonl
- split: test
path: teams/v4/test.jsonl
---
# Evolving Tool Benchmark (EnterpriseOps-Gym v7)
Each domain ships as a sequence of versions `V1, V2, ..., VK` that simulate
a real-world tool universe growing over time:
- **Tools accumulate**: `C_1 ⊆ C_2 ⊆ ... ⊆ C_K` — each version adds new tools on top of the previous one.
- **Tasks are partitioned per stage** into `adapt` (used here as `train`, e.g. for in-context examples / fine-tuning) and `test` splits.
- **Frequency-driven anchoring** uses real co-occurrence statistics so early versions contain the most popular tools.
- The schedule is **adaptively** built to satisfy growth-rate and minimum task-count constraints.
## Layout
Each domain (`calendar`, `csm`, `drive`, `email`, `hr`, `hybrid`, `itsm`, `teams`)
has 3-6 versions. Each `(domain, version)` pair is a **config**, with `train`
and `test` splits:
```
<repo>/
├── calendar/
│ ├── v1/
│ │ ├── train.jsonl # adapt tasks at V1
│ │ └── test.jsonl # test tasks at V1
│ ├── v2/
│ └── v3/
├── csm/
│ ├── v1/ ... v4/
├── drive/
├── email/
├── hr/
├── hybrid/
├── itsm/
└── teams/
```
## Usage
```python
from datasets import load_dataset
# One config = one (domain, version) pair
ds = load_dataset("ZixuanKe/evovling_tools", "calendar_v1")
train_ds = ds["train"]
test_ds = ds["test"]
# Or load a single split directly:
train_ds = load_dataset("ZixuanKe/evovling_tools", "calendar_v1", split="train")
test_ds = load_dataset("ZixuanKe/evovling_tools", "csm_v3", split="test")
```
## Row schema
Every row contains the original task config plus metadata columns:
| field | type | description |
| --- | --- | --- |
| `domain` | str | one of `calendar, csm, drive, email, hr, hybrid, itsm, teams` |
| `version` | str | `v1`, `v2`, ... (1-indexed; matches the `V1, V2, ...` schedule in the source manifest) |
| `split` | str | `train` (=adapt) or `test` |
| `task_id` | str | original task id, stable across versions (use to join the same task at multiple stages) |
| `oracle_tools` | list[str] | minimal ground-truth tool list from the source `selected_tools` field (order and any duplicates preserved as-is) |
| `system_prompt` | str | system prompt for the agent |
| `user_prompt` | str | user request the agent must satisfy |
| `cummulative_tools` | list[str] | the **cumulative** tool universe `C_k` at the assigned stage (what the agent sees, includes distractors) |
| `mcp_endpoint` | str | MCP HTTP endpoint, e.g. `/mcp` |
| `gym_servers_config` | list[dict] | per-server MCP config (URL, seed DB, user info) |
| `verifiers` | list[dict] | DB-state / API-state verifiers used to grade the agent |
## Evaluating an evolving agent
[TODO]
提供机构:
ZixuanKe



