ToolRM-train-data

Name: ToolRM-train-data
Creator: maas
Published: 2025-12-05 16:55:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/ibm-research/ToolRM-train-data

下载链接

链接失效反馈

官方服务：

资源简介：

<h1 align="center">ToolRM Training Dataset</h1> <div align="center"> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="Static Badge" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="Static Badge" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a> </div> ## 📖 Dataset Description This is a version of the training data utilized for ToolRM, a collection of outcome reward models specifically designed for evaluating and improving function-calling capabilities in large language models. It consists of ~459K examples, where each example includes a user-assistant conversation, available tool specifications, and a pair of correct and incorrect tool calls. The incorrect calls were generated by prompting 9 open-source language models on queries from three public datasets. Reward Models trained on this dataset were found to result in an average improvement of up to 25% in downstream task performance, enhance robustness to input noise, and enable data-efficient fine-tuning through reward-guided filtering. ## 📊 Dataset Statistics - **Total Training Samples**: 458,575 - **Composition**: - Single-turn interactions: 256,851 samples - Multi-turn interactions: 159,757 samples - Irrelevance cases: 41,967 samples - **Source Datasets**: [APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue), [xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) - **Generator Models**: 9 permissively-licensed open-weight models ## 🗂️ Dataset Schema The dataset contains the following fields: | Field | Type | Description | |-------|------|-------------| | `uuid` | str | Unique identifier for each training sample | | `dataset_name` | str | Source dataset from which the sample was derived | | `conversation` | list | Conversation between user and assistant | | `tools` | str | Catalog of available function specifications | | `tool_calls_correct` | str | Ground-truth correct tool invocations for the given conversation | | `tool_calls_incorrect` | str | Incorrect tool invocations generated by the model pool | | `generator_model` | str | Identifier of the model that produced the incorrect tool call | *Note: `tools`, `tool_calls_correct`, and `tool_calls_incorrect` fields have been serialized. While loading the dataset, convert them into JSON objects using `json.loads`* ## ⚙️ Data Generation Methodology ### Generator Model Pool The incorrect tool calls were generated using the following models: - **Granite Series**: [granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct), [granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct), [granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling) - **SmolLM**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - **Mistral Series**: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3), [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) - **GPT-OSS Series**: [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) ### Data Collection Process 1. **Source Datasets**: We start with publicly available function-calling datasets that cover a wide range of interaction patterns 2. **Obfuscation**: Function and parameter names were replaced with random strings, and schema keys were reordered to prevent models from regurgitating the training data 3. **Generation**: Each sample is processed through the model pool to generate function calls 4. **Verification**: The generated outputs are compared against ground-truth annotations to identify incorrect calls 5. **Filtering**: We keep only the incorrect generations, selecting up to three incorrect samples per query to maintain diversity while avoiding over-representation ## 🎯 Benchmark In a Best-of-N setting, we found that ToolRM significantly improves performance over Greedy decoding, Majority Voting, and Schema Validation baselines. <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 /> </div> For reward-guided data filtering, we found that a model fine-tuned with 8K top-ranked samples by ToolRM outperforms the model fine-tuned with the entire training dataset of 16K samples. <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 /> </div> More experiments and a detailed discussion of the results can be found in the paper. ## 📚 Citation If you use this dataset in your research, please cite: ``` @misc{agarwal2025toolrmoutcomereward, title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models}, author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi}, year={2025}, eprint={2509.11963}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.11963}, } ```

<h1 align="center">ToolRM训练数据集</h1> <div align="center"> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="静态徽章" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="静态徽章" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a> </div> ## 📖 数据集描述本数据集为ToolRM所用训练数据的一个版本，ToolRM是一系列专为评估与优化大语言模型（Large Language Model，LLM）函数调用能力而设计的结果奖励模型。该数据集包含约45.9万个样本，每个样本均包含用户-助手对话、可用工具规范，以及一组正确与错误的工具调用序列。其中错误的工具调用序列由9个开源语言模型基于三个公开数据集的查询生成。经该数据集训练的奖励模型可使下游任务性能平均提升最高达25%，增强对输入噪声的鲁棒性，并可通过奖励引导的筛选实现数据高效的微调。 ## 📊 数据集统计 - **总训练样本数**：458575 - **样本构成**： - 单轮交互样本：256851个 - 多轮交互样本：159757个 - 无关样本：41967个 - **源数据集**：[APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)、[Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)、[xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) - **生成模型**：9个采用宽松许可协议的开源权重模型 ## 🗂️ 数据集Schema 该数据集包含以下字段： | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `uuid` | 字符串 | 每个训练样本的唯一标识符 | | `dataset_name` | 字符串 | 该样本所属的源数据集名称 | | `conversation` | 列表 | 用户与助手之间的对话内容 | | `tools` | 字符串 | 可用函数规范的目录 | | `tool_calls_correct` | 字符串 | 对应对话的真实正确工具调用序列 | | `tool_calls_incorrect` | 字符串 | 由模型池生成的错误工具调用序列 | | `generator_model` | 字符串 | 生成错误工具调用的模型标识符 | *注：`tools`、`tool_calls_correct` 与 `tool_calls_incorrect` 字段已序列化。加载数据集时，请使用 `json.loads` 将其转换为JSON对象* ## ⚙️ 数据生成方法 ### 生成模型池错误的工具调用序列由以下模型生成： - **Granite系列**：[granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct)、[granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct)、[granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling) - **SmolLM系列**：[SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)、[SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - **Mistral系列**：[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)、[Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) - **GPT-OSS系列**：[gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)、[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) ### 数据收集流程 1. **源数据集选取**：我们从覆盖多种交互模式的公开函数调用数据集起步 2. **混淆处理**：将函数与参数名称替换为随机字符串，并重新排列Schema键的顺序，以防止模型直接复现训练数据 3. **调用生成**：将每个样本送入模型池以生成函数调用序列 4. **结果验证**：将生成的输出与真实标注进行比对，以识别错误的调用序列 5. **样本筛选**：仅保留错误的生成结果，每个查询最多选取3个错误样本以保持多样性，同时避免样本分布失衡 ## 🎯 基准测试在Best-of-N设置下，我们发现ToolRM相较于贪心解码、多数投票与Schema验证等基线模型，性能提升显著。 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 /> </div> 在奖励引导的数据筛选场景中，我们发现使用ToolRM排序得到的8000个高排名样本微调的模型，性能优于使用全部16000个训练样本微调的模型。 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 /> </div> 更多实验与结果的详细讨论可参阅相关论文。 ## 📚 引用如果您在研究中使用该数据集，请引用以下文献： @misc{agarwal2025toolrmoutcomereward, title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models}, author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi}, year={2025}, eprint={2509.11963}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.11963}, }

提供机构：

maas

创建时间：

2025-11-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集