ToolRM-train-data
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/ToolRM-train-data
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center">ToolRM Training Dataset</h1>
<div align="center">
<a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="Static Badge" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a>
<a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="Static Badge" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a>
</div>
## 📖 Dataset Description
This is a version of the training data utilized for ToolRM, a collection of outcome reward models specifically designed for evaluating and improving function-calling capabilities in large language models. It consists of ~459K examples, where each example includes a user-assistant conversation, available tool specifications, and a pair of correct and incorrect tool calls. The incorrect calls were generated by prompting 9 open-source language models on queries from three public datasets. Reward Models trained on this dataset were found to result in an average improvement of up to 25% in downstream task performance, enhance robustness to input noise, and enable data-efficient fine-tuning through reward-guided filtering.
## 📊 Dataset Statistics
- **Total Training Samples**: 458,575
- **Composition**:
- Single-turn interactions: 256,851 samples
- Multi-turn interactions: 159,757 samples
- Irrelevance cases: 41,967 samples
- **Source Datasets**: [APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue), [xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k)
- **Generator Models**: 9 permissively-licensed open-weight models
## 🗂️ Dataset Schema
The dataset contains the following fields:
| Field | Type | Description |
|-------|------|-------------|
| `uuid` | str | Unique identifier for each training sample |
| `dataset_name` | str | Source dataset from which the sample was derived |
| `conversation` | list | Conversation between user and assistant |
| `tools` | str | Catalog of available function specifications |
| `tool_calls_correct` | str | Ground-truth correct tool invocations for the given conversation |
| `tool_calls_incorrect` | str | Incorrect tool invocations generated by the model pool |
| `generator_model` | str | Identifier of the model that produced the incorrect tool call |
*Note: `tools`, `tool_calls_correct`, and `tool_calls_incorrect` fields have been serialized. While loading the dataset, convert them into JSON objects using `json.loads`*
## ⚙️ Data Generation Methodology
### Generator Model Pool
The incorrect tool calls were generated using the following models:
- **Granite Series**: [granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct), [granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct), [granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling)
- **SmolLM**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- **Mistral Series**: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3), [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
- **GPT-OSS Series**: [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)
### Data Collection Process
1. **Source Datasets**: We start with publicly available function-calling datasets that cover a wide range of interaction patterns
2. **Obfuscation**: Function and parameter names were replaced with random strings, and schema keys were reordered to prevent models from regurgitating the training data
3. **Generation**: Each sample is processed through the model pool to generate function calls
4. **Verification**: The generated outputs are compared against ground-truth annotations to identify incorrect calls
5. **Filtering**: We keep only the incorrect generations, selecting up to three incorrect samples per query to maintain diversity while avoiding over-representation
## 🎯 Benchmark
In a Best-of-N setting, we found that ToolRM significantly improves performance over Greedy decoding, Majority Voting, and Schema Validation baselines.
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 />
</div>
For reward-guided data filtering, we found that a model fine-tuned with 8K top-ranked samples by ToolRM outperforms the model fine-tuned with the entire training dataset of 16K samples.
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 />
</div>
More experiments and a detailed discussion of the results can be found in the paper.
## 📚 Citation
If you use this dataset in your research, please cite:
```
@misc{agarwal2025toolrmoutcomereward,
title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models},
author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi},
year={2025},
eprint={2509.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.11963},
}
```
<h1 align="center">ToolRM训练数据集</h1>
<div align="center">
<a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="静态徽章" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a>
<a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="静态徽章" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a>
</div>
## 📖 数据集描述
本数据集为ToolRM所用训练数据的一个版本,ToolRM是一系列专为评估与优化大语言模型(Large Language Model,LLM)函数调用能力而设计的结果奖励模型。该数据集包含约45.9万个样本,每个样本均包含用户-助手对话、可用工具规范,以及一组正确与错误的工具调用序列。其中错误的工具调用序列由9个开源语言模型基于三个公开数据集的查询生成。经该数据集训练的奖励模型可使下游任务性能平均提升最高达25%,增强对输入噪声的鲁棒性,并可通过奖励引导的筛选实现数据高效的微调。
## 📊 数据集统计
- **总训练样本数**:458575
- **样本构成**:
- 单轮交互样本:256851个
- 多轮交互样本:159757个
- 无关样本:41967个
- **源数据集**:[APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)、[Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)、[xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k)
- **生成模型**:9个采用宽松许可协议的开源权重模型
## 🗂️ 数据集Schema
该数据集包含以下字段:
| 字段名 | 数据类型 | 描述 |
|-------|------|-------------|
| `uuid` | 字符串 | 每个训练样本的唯一标识符 |
| `dataset_name` | 字符串 | 该样本所属的源数据集名称 |
| `conversation` | 列表 | 用户与助手之间的对话内容 |
| `tools` | 字符串 | 可用函数规范的目录 |
| `tool_calls_correct` | 字符串 | 对应对话的真实正确工具调用序列 |
| `tool_calls_incorrect` | 字符串 | 由模型池生成的错误工具调用序列 |
| `generator_model` | 字符串 | 生成错误工具调用的模型标识符 |
*注:`tools`、`tool_calls_correct` 与 `tool_calls_incorrect` 字段已序列化。加载数据集时,请使用 `json.loads` 将其转换为JSON对象*
## ⚙️ 数据生成方法
### 生成模型池
错误的工具调用序列由以下模型生成:
- **Granite系列**:[granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct)、[granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct)、[granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling)
- **SmolLM系列**:[SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)、[SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- **Mistral系列**:[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)、[Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
- **GPT-OSS系列**:[gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)、[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)
### 数据收集流程
1. **源数据集选取**:我们从覆盖多种交互模式的公开函数调用数据集起步
2. **混淆处理**:将函数与参数名称替换为随机字符串,并重新排列Schema键的顺序,以防止模型直接复现训练数据
3. **调用生成**:将每个样本送入模型池以生成函数调用序列
4. **结果验证**:将生成的输出与真实标注进行比对,以识别错误的调用序列
5. **样本筛选**:仅保留错误的生成结果,每个查询最多选取3个错误样本以保持多样性,同时避免样本分布失衡
## 🎯 基准测试
在Best-of-N设置下,我们发现ToolRM相较于贪心解码、多数投票与Schema验证等基线模型,性能提升显著。
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 />
</div>
在奖励引导的数据筛选场景中,我们发现使用ToolRM排序得到的8000个高排名样本微调的模型,性能优于使用全部16000个训练样本微调的模型。
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 />
</div>
更多实验与结果的详细讨论可参阅相关论文。
## 📚 引用
如果您在研究中使用该数据集,请引用以下文献:
@misc{agarwal2025toolrmoutcomereward,
title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models},
author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi},
year={2025},
eprint={2509.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.11963},
}
提供机构:
maas
创建时间:
2025-11-01



