Justin1233/RobustBench-TC
收藏Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Justin1233/RobustBench-TC
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
- machine-generated
language:
- en
license: apache-2.0
multilinguality: monolingual
pretty_name: "RobustBench-TC: Unified Perturbation Benchmark for Tool-Calling Agents"
size_categories:
- 10K<n<100K
source_datasets:
- BFCL V3
- API-Bank
- ToolAlpaca
- RoTBench
- ToolEyes
- ACEBench
tags:
- tool-calling
- function-calling
- robustness
- perturbation
- benchmark
- agents
- MDP
- reinforcement-learning
task_categories:
- text-generation
task_ids:
- text2text-generation
dataset_info:
features:
- name: id
dtype: string
- name: benchmark
dtype: string
- name: category
dtype: string
- name: perturbation
struct:
- name: type
dtype: string
- name: mdp_category
dtype: string
- name: variant
dtype: string
- name: conversation
list:
- name: role
dtype: string
- name: content
dtype: string
- name: tools
list:
- name: name
dtype: string
- name: description
dtype: string
- name: parameters
dtype: string
- name: golden_answers
list:
- name: name
dtype: string
- name: parameters
dtype: string
- name: eval_config
struct:
- name: method
dtype: string
splits:
- name: bfcl_v3
num_examples: 36354
- name: apibank
num_examples: 10074
- name: acebench
num_examples: 15191
- name: tooleyes
num_examples: 4579
- name: toolalpaca
num_examples: 1938
- name: rotbench
num_examples: 1785
configs:
- config_name: all
data_files: "unified_benchmark/**/*.jsonl"
- config_name: bfcl_v3
data_files: "unified_benchmark/bfcl_v3/**/*.jsonl"
- config_name: apibank
data_files: "unified_benchmark/apibank/**/*.jsonl"
- config_name: acebench
data_files: "unified_benchmark/acebench/**/*.jsonl"
- config_name: toolalpaca
data_files: "unified_benchmark/toolalpaca/**/*.jsonl"
- config_name: rotbench
data_files: "unified_benchmark/rotbench/**/*.jsonl"
- config_name: tooleyes
data_files: "unified_benchmark/tooleyes/**/*.jsonl"
- config_name: eval
data_files: "datasets/api_eval/*.jsonl"
- config_name: train_toolrl
data_files: "datasets/train_toolrl/*.jsonl"
---
# RobustBench-TC: Unified Perturbation Benchmark for Tool-Calling Agents
## Overview
RobustBench-TC is the first systematic robustness benchmark for tool-calling AI agents,
formalizing tool use as a Markov Decision Process (S, O, A, T, R) and applying 22
perturbation operators across 4 MDP categories.
**69,921 total samples** across 6 source benchmarks.
## Repository Structure
This dataset contains three components:
### 1. Full Benchmark (`unified_benchmark/`)
69,921 samples across 6 benchmarks with 22 perturbation types. Use for comprehensive evaluation.
### 2. Eval Subset (`datasets/api_eval/`)
Lightweight evaluation set for API-based testing (e.g., GPT-4o, Claude).
- **200 unique IDs** sampled from 5 benchmarks (excludes ACEBench)
- 1 clean + 16 static perturbations per ID = **3,145 static samples**
- 6 transition types via runtime injection = **1,200 additional API calls**
- **Total: ~4,345 API calls** for full evaluation
| Benchmark | IDs |
|-----------|-----|
| BFCL V3 | 32 |
| API-Bank | 74 |
| ToolEyes | 51 |
| ToolAlpaca| 22 |
| RoTBench | 21 |
### 3. Training Data (`datasets/train_toolrl/`)
4,000 samples sourced from ToolRL's training set (ToolACE 2000 + Hammer Masked 1000 + xLAM 1000),
converted to UnifiedSample format with perturbations applied.
Three experiment groups for controlled comparison:
| Group | File | Content | Purpose |
|-------|------|---------|---------|
| **A** | `clean.jsonl` | 4,000 clean samples | ToolRL baseline reproduction |
| **B** | `perturbed.jsonl` | 4,000 perturbed samples | Our method |
| **C** | `mixed.jsonl` | 2,000 clean + 2,000 perturbed | Ablation |
Group B perturbation distribution (tool-calling samples only):
- **Reward (60%)**: CD, TD, CD_NT, TD_NT, CD_AB, TD_AB — 2,111 samples
- **Observation (25%)**: realistic_typos, query_paraphrase, paraphrase_tool/param_description — 880 samples
- **Action (15%)**: same_name A-E, redundant — 527 samples
## MDP-Based Perturbation Taxonomy
| MDP Category | Perturbation Types | Count | Method |
|---------------|-------------------|-------|----------|
| Observation | realistic_typos, query_paraphrase, paraphrase_tool_description, paraphrase_parameter_description | 4 | LLM |
| Action | same_name (A-E), redundant | 6 | Rule/LLM |
| Transition | timeout, rate_limit, auth_error, server_error, malformed_response, schema_drift | 6 | Runtime |
| Reward | CD, TD, CD_NT, TD_NT, CD_AB, TD_AB | 6 | Rule |
## Source Benchmarks
| Benchmark | Samples | Eval Method | Turn Type |
|-----------|---------|-------------|-----------|
| BFCL V3 | 36,354 | exact_match | single+multi |
| ACEBench | 15,191 | exact_match | single+multi |
| API-Bank | 10,074 | exact_match | single |
| ToolEyes | 4,579 | gpt_judge | single |
| ToolAlpaca| 1,938 | gpt_judge | single |
| RoTBench | 1,785 | rule_based | single |
## Key Findings
| MDP Category | Avg Accuracy Drop | Most Severe |
|---|---|---|
| Transition | 33.73% | timeout (33.73%) |
| Reward | 28.71% | CD_AB (37.82%) |
| Observation | 4.85% | paraphrase (8.23%) |
| Action | 1.18% | redundant (5.68%) |
## Data Format
Each sample is a JSON line (JSONL):
```json
{
"id": "bfcl_v3__BFCL_v3_simple__simple_0",
"benchmark": "bfcl_v3",
"category": "simple",
"perturbation": {
"type": "realistic_typos",
"mdp_category": "observation"
},
"conversation": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
],
"tools": [
{"name": "calculate_area", "description": "...", "parameters": {...}}
],
"golden_answers": [
{"name": "calculate_area", "parameters": {"base": 10, "height": 5}}
],
"eval_config": {"method": "exact_match"}
}
```
## Usage
### Load full benchmark
```python
from datasets import load_dataset
ds = load_dataset("Justin1233/RobustBench-TC", "all")
```
### Load a single benchmark
```python
ds = load_dataset("Justin1233/RobustBench-TC", "bfcl_v3")
```
### Load eval subset
```python
ds = load_dataset("Justin1233/RobustBench-TC", "eval")
```
### Load training data
```python
ds = load_dataset("Justin1233/RobustBench-TC", "train_toolrl")
perturbed = ds["perturbed"] # 4000 perturbed samples for GRPO
```
### Run evaluation CLI
```bash
# Full eval on GPT-4o (~4,345 API calls)
python scripts/run_eval.py --model gpt-4o --api-key $OPENAI_API_KEY
# Dry run to check cost
python scripts/run_eval.py --model gpt-4o --dry-run
# Specific perturbations only
python scripts/run_eval.py --model gpt-4o --perturbations clean CD TD realistic_typos
```
## Directory Structure
```
.
├── unified_benchmark/ # Full benchmark (69,921 samples)
│ ├── bfcl_v3/
│ │ ├── baseline/ (17 JSONL files)
│ │ ├── observation/ (4 perturbation types)
│ │ ├── action/ (6 perturbation types)
│ │ ├── reward/ (6 perturbation types)
│ │ └── transition/ (runtime marker)
│ ├── acebench/ (same structure)
│ ├── apibank/ (same structure)
│ ├── toolalpaca/ (same structure)
│ ├── rotbench/ (same structure)
│ ├── tooleyes/ (same structure)
│ └── manifest.json
│
├── datasets/
│ ├── api_eval/ # Eval subset (3,145 samples)
│ │ ├── clean.jsonl (200 samples)
│ │ ├── realistic_typos.jsonl ... TD_AB.jsonl
│ │ └── (16 perturbation files)
│ │
│ ├── train_toolrl/ # Training data (ToolRL-sourced)
│ │ ├── clean.jsonl (4,000 — baseline reproduction)
│ │ ├── perturbed.jsonl (4,000 — our method)
│ │ ├── mixed.jsonl (4,000 — ablation)
│ │ └── metadata.json
│ │
│ └── metadata.json
│
├── robustbench_tc.py # HuggingFace loader
├── scripts/
│ ├── run_eval.py # Evaluation CLI
│ ├── build_eval_and_train_datasets.py
│ └── build_train_from_toolrl.py
└── dataset_card.yaml
```
## Citation
```bibtex
@inproceedings{robustbench-tc2026,
title={RobustBench-TC: A Unified Perturbation Benchmark for Tool-Calling Agents},
year={2026},
}
```
## License
Apache 2.0
提供机构:
Justin1233
搜集汇总
数据集介绍

构建方式
在智能体工具调用研究领域,RobustBench-TC 的构建遵循了严谨的系统化方法。该数据集通过整合六个权威的源基准,包括 BFCL V3、API-Bank 和 ToolAlpaca 等,共汇集了 69,921 个样本。其核心创新在于将工具调用形式化为马尔可夫决策过程,并据此设计了涵盖观察、行动、转移和奖励四大类别的 22 种扰动算子。构建过程融合了专家生成与机器生成两种标注方式,并针对不同用途细分为完整的统一基准、轻量化的评估子集以及用于强化学习训练的数据子集,确保了数据在广度和深度上的均衡覆盖。
使用方法
对于研究者而言,该数据集提供了灵活多样的使用途径。用户可通过 Hugging Face 的 `datasets` 库便捷加载整个基准或按需选择特定源基准的数据。数据集附带的评估命令行工具支持对如 GPT-4o 等模型进行自动化测试,并可进行成本预估与特定扰动类型的筛选。在模型训练方面,专门准备的训练数据子集包含了纯净样本、扰动样本及其混合版本,为复现基线、验证新方法以及进行消融实验提供了直接可用的资源。这种模块化的设计使得该数据集既能服务于大规模的基准评测,也能支撑具体的模型训练与算法改进工作。
背景与挑战
背景概述
在人工智能代理工具调用能力快速发展的背景下,RobustBench-TC数据集于2026年应运而生,由研究人员Justin1233等人构建,旨在系统评估工具调用智能体的鲁棒性。该数据集创新性地将工具调用过程形式化为马尔可夫决策过程,并整合了BFCL V3、API-Bank等六个主流基准的69,921个样本,通过设计22种扰动算子覆盖观察、行动、转移和奖励四个MDP维度。作为首个面向工具调用代理的鲁棒性基准,该数据集为智能体在复杂现实场景中的可靠性评估提供了标准化框架,推动了具身智能与工具学习领域的量化研究进展。
当前挑战
该数据集致力于解决工具调用智能体在动态环境中鲁棒性评估的核心挑战,具体体现在两个层面:在领域问题层面,智能体需应对工具描述语义扰动、API响应异常、奖励信号误导等多模态干扰,其性能在转移扰动下平均下降33.73%,揭示了现有模型对运行时异常的脆弱性;在构建过程中,研究团队面临多基准数据格式统一、扰动算子系统性设计、以及动态扰动(如超时、速率限制)的实时注入等工程挑战,最终通过建立标准化数据模式和运行时注入机制实现了跨基准的可比性评估。
常用场景
解决学术问题
该数据集有效解决了工具调用智能体研究中缺乏系统性鲁棒性评估标准的学术难题。通过将工具调用形式化为马尔可夫决策过程,并设计跨观察、动作、转移和奖励四个维度的扰动分类体系,为量化模型在非理想环境下的退化行为提供了可复现的度量基准。其意义在于首次建立了工具调用领域的扰动评估方法论,揭示了不同扰动类型对模型影响的差异性,例如转移扰动导致平均33.73%的准确率下降,为后续鲁棒性增强技术的研究指明了关键改进方向。
实际应用
在实际部署场景中,RobustBench-TC为构建高可靠性的工具调用系统提供了关键验证工具。开发团队可利用其训练数据子集进行对抗性训练,提升模型在真实环境中应对用户输入噪声、API服务异常和工具描述歧义等挑战的能力。特别是在金融交易、医疗诊断等高风险领域,该基准能够帮助评估智能体在遭遇网络延迟、身份验证错误或服务器故障时的容错表现,确保复杂工作流执行的稳定性和安全性。
数据集最近研究
最新研究方向
在智能体工具调用领域,鲁棒性评估正成为前沿研究的核心议题。RobustBench-TC数据集通过将工具调用形式化为马尔可夫决策过程,并系统性地引入四类扰动算子,为衡量智能体在复杂环境下的稳定性提供了统一基准。当前研究热点聚焦于利用该基准探索强化学习策略在扰动环境中的泛化能力,特别是在奖励函数畸变与状态转移异常等极端场景下的性能退化机制。这一工作推动了具身智能体在实际部署中对噪声和对抗性干扰的适应能力,为构建可靠、安全的自动化系统奠定了关键评估基础。
以上内容由遇见数据集搜集并总结生成



