Justin1233/RobustBench-TC

Name: Justin1233/RobustBench-TC
Creator: Justin1233
Published: 2026-04-11 08:22:37
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Justin1233/RobustBench-TC

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated - machine-generated language: - en license: apache-2.0 multilinguality: monolingual pretty_name: "RobustBench-TC: Unified Perturbation Benchmark for Tool-Calling Agents" size_categories: - 10K<n<100K source_datasets: - BFCL V3 - API-Bank - ToolAlpaca - RoTBench - ToolEyes - ACEBench tags: - tool-calling - function-calling - robustness - perturbation - benchmark - agents - MDP - reinforcement-learning task_categories: - text-generation task_ids: - text2text-generation dataset_info: features: - name: id dtype: string - name: benchmark dtype: string - name: category dtype: string - name: perturbation struct: - name: type dtype: string - name: mdp_category dtype: string - name: variant dtype: string - name: conversation list: - name: role dtype: string - name: content dtype: string - name: tools list: - name: name dtype: string - name: description dtype: string - name: parameters dtype: string - name: golden_answers list: - name: name dtype: string - name: parameters dtype: string - name: eval_config struct: - name: method dtype: string splits: - name: bfcl_v3 num_examples: 36354 - name: apibank num_examples: 10074 - name: acebench num_examples: 15191 - name: tooleyes num_examples: 4579 - name: toolalpaca num_examples: 1938 - name: rotbench num_examples: 1785 configs: - config_name: all data_files: "unified_benchmark/**/*.jsonl" - config_name: bfcl_v3 data_files: "unified_benchmark/bfcl_v3/**/*.jsonl" - config_name: apibank data_files: "unified_benchmark/apibank/**/*.jsonl" - config_name: acebench data_files: "unified_benchmark/acebench/**/*.jsonl" - config_name: toolalpaca data_files: "unified_benchmark/toolalpaca/**/*.jsonl" - config_name: rotbench data_files: "unified_benchmark/rotbench/**/*.jsonl" - config_name: tooleyes data_files: "unified_benchmark/tooleyes/**/*.jsonl" - config_name: eval data_files: "datasets/api_eval/*.jsonl" - config_name: train_toolrl data_files: "datasets/train_toolrl/*.jsonl" --- # RobustBench-TC: Unified Perturbation Benchmark for Tool-Calling Agents ## Overview RobustBench-TC is the first systematic robustness benchmark for tool-calling AI agents, formalizing tool use as a Markov Decision Process (S, O, A, T, R) and applying 22 perturbation operators across 4 MDP categories. **69,921 total samples** across 6 source benchmarks. ## Repository Structure This dataset contains three components: ### 1. Full Benchmark (`unified_benchmark/`) 69,921 samples across 6 benchmarks with 22 perturbation types. Use for comprehensive evaluation. ### 2. Eval Subset (`datasets/api_eval/`) Lightweight evaluation set for API-based testing (e.g., GPT-4o, Claude). - **200 unique IDs** sampled from 5 benchmarks (excludes ACEBench) - 1 clean + 16 static perturbations per ID = **3,145 static samples** - 6 transition types via runtime injection = **1,200 additional API calls** - **Total: ~4,345 API calls** for full evaluation | Benchmark | IDs | |-----------|-----| | BFCL V3 | 32 | | API-Bank | 74 | | ToolEyes | 51 | | ToolAlpaca| 22 | | RoTBench | 21 | ### 3. Training Data (`datasets/train_toolrl/`) 4,000 samples sourced from ToolRL's training set (ToolACE 2000 + Hammer Masked 1000 + xLAM 1000), converted to UnifiedSample format with perturbations applied. Three experiment groups for controlled comparison: | Group | File | Content | Purpose | |-------|------|---------|---------| | **A** | `clean.jsonl` | 4,000 clean samples | ToolRL baseline reproduction | | **B** | `perturbed.jsonl` | 4,000 perturbed samples | Our method | | **C** | `mixed.jsonl` | 2,000 clean + 2,000 perturbed | Ablation | Group B perturbation distribution (tool-calling samples only): - **Reward (60%)**: CD, TD, CD_NT, TD_NT, CD_AB, TD_AB — 2,111 samples - **Observation (25%)**: realistic_typos, query_paraphrase, paraphrase_tool/param_description — 880 samples - **Action (15%)**: same_name A-E, redundant — 527 samples ## MDP-Based Perturbation Taxonomy | MDP Category | Perturbation Types | Count | Method | |---------------|-------------------|-------|----------| | Observation | realistic_typos, query_paraphrase, paraphrase_tool_description, paraphrase_parameter_description | 4 | LLM | | Action | same_name (A-E), redundant | 6 | Rule/LLM | | Transition | timeout, rate_limit, auth_error, server_error, malformed_response, schema_drift | 6 | Runtime | | Reward | CD, TD, CD_NT, TD_NT, CD_AB, TD_AB | 6 | Rule | ## Source Benchmarks | Benchmark | Samples | Eval Method | Turn Type | |-----------|---------|-------------|-----------| | BFCL V3 | 36,354 | exact_match | single+multi | | ACEBench | 15,191 | exact_match | single+multi | | API-Bank | 10,074 | exact_match | single | | ToolEyes | 4,579 | gpt_judge | single | | ToolAlpaca| 1,938 | gpt_judge | single | | RoTBench | 1,785 | rule_based | single | ## Key Findings | MDP Category | Avg Accuracy Drop | Most Severe | |---|---|---| | Transition | 33.73% | timeout (33.73%) | | Reward | 28.71% | CD_AB (37.82%) | | Observation | 4.85% | paraphrase (8.23%) | | Action | 1.18% | redundant (5.68%) | ## Data Format Each sample is a JSON line (JSONL): ```json { "id": "bfcl_v3__BFCL_v3_simple__simple_0", "benchmark": "bfcl_v3", "category": "simple", "perturbation": { "type": "realistic_typos", "mdp_category": "observation" }, "conversation": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."} ], "tools": [ {"name": "calculate_area", "description": "...", "parameters": {...}} ], "golden_answers": [ {"name": "calculate_area", "parameters": {"base": 10, "height": 5}} ], "eval_config": {"method": "exact_match"} } ``` ## Usage ### Load full benchmark ```python from datasets import load_dataset ds = load_dataset("Justin1233/RobustBench-TC", "all") ``` ### Load a single benchmark ```python ds = load_dataset("Justin1233/RobustBench-TC", "bfcl_v3") ``` ### Load eval subset ```python ds = load_dataset("Justin1233/RobustBench-TC", "eval") ``` ### Load training data ```python ds = load_dataset("Justin1233/RobustBench-TC", "train_toolrl") perturbed = ds["perturbed"] # 4000 perturbed samples for GRPO ``` ### Run evaluation CLI ```bash # Full eval on GPT-4o (~4,345 API calls) python scripts/run_eval.py --model gpt-4o --api-key $OPENAI_API_KEY # Dry run to check cost python scripts/run_eval.py --model gpt-4o --dry-run # Specific perturbations only python scripts/run_eval.py --model gpt-4o --perturbations clean CD TD realistic_typos ``` ## Directory Structure ``` . ├── unified_benchmark/ # Full benchmark (69,921 samples) │ ├── bfcl_v3/ │ │ ├── baseline/ (17 JSONL files) │ │ ├── observation/ (4 perturbation types) │ │ ├── action/ (6 perturbation types) │ │ ├── reward/ (6 perturbation types) │ │ └── transition/ (runtime marker) │ ├── acebench/ (same structure) │ ├── apibank/ (same structure) │ ├── toolalpaca/ (same structure) │ ├── rotbench/ (same structure) │ ├── tooleyes/ (same structure) │ └── manifest.json │ ├── datasets/ │ ├── api_eval/ # Eval subset (3,145 samples) │ │ ├── clean.jsonl (200 samples) │ │ ├── realistic_typos.jsonl ... TD_AB.jsonl │ │ └── (16 perturbation files) │ │ │ ├── train_toolrl/ # Training data (ToolRL-sourced) │ │ ├── clean.jsonl (4,000 — baseline reproduction) │ │ ├── perturbed.jsonl (4,000 — our method) │ │ ├── mixed.jsonl (4,000 — ablation) │ │ └── metadata.json │ │ │ └── metadata.json │ ├── robustbench_tc.py # HuggingFace loader ├── scripts/ │ ├── run_eval.py # Evaluation CLI │ ├── build_eval_and_train_datasets.py │ └── build_train_from_toolrl.py └── dataset_card.yaml ``` ## Citation ```bibtex @inproceedings{robustbench-tc2026, title={RobustBench-TC: A Unified Perturbation Benchmark for Tool-Calling Agents}, year={2026}, } ``` ## License Apache 2.0

提供机构：

Justin1233

搜集汇总

数据集介绍

构建方式

在智能体工具调用研究领域，RobustBench-TC 的构建遵循了严谨的系统化方法。该数据集通过整合六个权威的源基准，包括 BFCL V3、API-Bank 和 ToolAlpaca 等，共汇集了 69,921 个样本。其核心创新在于将工具调用形式化为马尔可夫决策过程，并据此设计了涵盖观察、行动、转移和奖励四大类别的 22 种扰动算子。构建过程融合了专家生成与机器生成两种标注方式，并针对不同用途细分为完整的统一基准、轻量化的评估子集以及用于强化学习训练的数据子集，确保了数据在广度和深度上的均衡覆盖。

使用方法

对于研究者而言，该数据集提供了灵活多样的使用途径。用户可通过 Hugging Face 的 `datasets` 库便捷加载整个基准或按需选择特定源基准的数据。数据集附带的评估命令行工具支持对如 GPT-4o 等模型进行自动化测试，并可进行成本预估与特定扰动类型的筛选。在模型训练方面，专门准备的训练数据子集包含了纯净样本、扰动样本及其混合版本，为复现基线、验证新方法以及进行消融实验提供了直接可用的资源。这种模块化的设计使得该数据集既能服务于大规模的基准评测，也能支撑具体的模型训练与算法改进工作。

背景与挑战

背景概述

在人工智能代理工具调用能力快速发展的背景下，RobustBench-TC数据集于2026年应运而生，由研究人员Justin1233等人构建，旨在系统评估工具调用智能体的鲁棒性。该数据集创新性地将工具调用过程形式化为马尔可夫决策过程，并整合了BFCL V3、API-Bank等六个主流基准的69,921个样本，通过设计22种扰动算子覆盖观察、行动、转移和奖励四个MDP维度。作为首个面向工具调用代理的鲁棒性基准，该数据集为智能体在复杂现实场景中的可靠性评估提供了标准化框架，推动了具身智能与工具学习领域的量化研究进展。

当前挑战

该数据集致力于解决工具调用智能体在动态环境中鲁棒性评估的核心挑战，具体体现在两个层面：在领域问题层面，智能体需应对工具描述语义扰动、API响应异常、奖励信号误导等多模态干扰，其性能在转移扰动下平均下降33.73%，揭示了现有模型对运行时异常的脆弱性；在构建过程中，研究团队面临多基准数据格式统一、扰动算子系统性设计、以及动态扰动（如超时、速率限制）的实时注入等工程挑战，最终通过建立标准化数据模式和运行时注入机制实现了跨基准的可比性评估。

常用场景

解决学术问题

该数据集有效解决了工具调用智能体研究中缺乏系统性鲁棒性评估标准的学术难题。通过将工具调用形式化为马尔可夫决策过程，并设计跨观察、动作、转移和奖励四个维度的扰动分类体系，为量化模型在非理想环境下的退化行为提供了可复现的度量基准。其意义在于首次建立了工具调用领域的扰动评估方法论，揭示了不同扰动类型对模型影响的差异性，例如转移扰动导致平均33.73%的准确率下降，为后续鲁棒性增强技术的研究指明了关键改进方向。

实际应用

在实际部署场景中，RobustBench-TC为构建高可靠性的工具调用系统提供了关键验证工具。开发团队可利用其训练数据子集进行对抗性训练，提升模型在真实环境中应对用户输入噪声、API服务异常和工具描述歧义等挑战的能力。特别是在金融交易、医疗诊断等高风险领域，该基准能够帮助评估智能体在遭遇网络延迟、身份验证错误或服务器故障时的容错表现，确保复杂工作流执行的稳定性和安全性。

数据集最近研究