five

ToolRM-train-data

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/ToolRM-train-data
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center">ToolRM Training Dataset</h1> <div align="center"> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="Static Badge" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="Static Badge" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a> </div> ## 📖 Dataset Description This is a version of the training data utilized for ToolRM, a collection of outcome reward models specifically designed for evaluating and improving function-calling capabilities in large language models. It consists of ~459K examples, where each example includes a user-assistant conversation, available tool specifications, and a pair of correct and incorrect tool calls. The incorrect calls were generated by prompting 9 open-source language models on queries from three public datasets. Reward Models trained on this dataset were found to result in an average improvement of up to 25% in downstream task performance, enhance robustness to input noise, and enable data-efficient fine-tuning through reward-guided filtering. ## 📊 Dataset Statistics - **Total Training Samples**: 458,575 - **Composition**: - Single-turn interactions: 256,851 samples - Multi-turn interactions: 159,757 samples - Irrelevance cases: 41,967 samples - **Source Datasets**: [APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue), [xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) - **Generator Models**: 9 permissively-licensed open-weight models ## 🗂️ Dataset Schema The dataset contains the following fields: | Field | Type | Description | |-------|------|-------------| | `uuid` | str | Unique identifier for each training sample | | `dataset_name` | str | Source dataset from which the sample was derived | | `conversation` | list | Conversation between user and assistant | | `tools` | str | Catalog of available function specifications | | `tool_calls_correct` | str | Ground-truth correct tool invocations for the given conversation | | `tool_calls_incorrect` | str | Incorrect tool invocations generated by the model pool | | `generator_model` | str | Identifier of the model that produced the incorrect tool call | *Note: `tools`, `tool_calls_correct`, and `tool_calls_incorrect` fields have been serialized. While loading the dataset, convert them into JSON objects using `json.loads`* ## ⚙️ Data Generation Methodology ### Generator Model Pool The incorrect tool calls were generated using the following models: - **Granite Series**: [granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct), [granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct), [granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling) - **SmolLM**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - **Mistral Series**: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3), [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) - **GPT-OSS Series**: [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) ### Data Collection Process 1. **Source Datasets**: We start with publicly available function-calling datasets that cover a wide range of interaction patterns 2. **Obfuscation**: Function and parameter names were replaced with random strings, and schema keys were reordered to prevent models from regurgitating the training data 3. **Generation**: Each sample is processed through the model pool to generate function calls 4. **Verification**: The generated outputs are compared against ground-truth annotations to identify incorrect calls 5. **Filtering**: We keep only the incorrect generations, selecting up to three incorrect samples per query to maintain diversity while avoiding over-representation ## 🎯 Benchmark In a Best-of-N setting, we found that ToolRM significantly improves performance over Greedy decoding, Majority Voting, and Schema Validation baselines. <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 /> </div> For reward-guided data filtering, we found that a model fine-tuned with 8K top-ranked samples by ToolRM outperforms the model fine-tuned with the entire training dataset of 16K samples. <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 /> </div> More experiments and a detailed discussion of the results can be found in the paper. ## 📚 Citation If you use this dataset in your research, please cite: ``` @misc{agarwal2025toolrmoutcomereward, title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models}, author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi}, year={2025}, eprint={2509.11963}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.11963}, } ```

<h1 align="center">ToolRM训练数据集</h1> <div align="center"> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2509.11963"><img alt="静态徽章" src="https://img.shields.io/badge/arxiv-2509.11963-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://huggingface.co/datasets/ibm-research/fc-reward-bench"><img alt="静态徽章" src="https://img.shields.io/badge/HF-fc--reward--bench-green?logo=huggingface"></a> </div> ## 📖 数据集描述 本数据集为ToolRM所用训练数据的一个版本,ToolRM是一系列专为评估与优化大语言模型(Large Language Model,LLM)函数调用能力而设计的结果奖励模型。该数据集包含约45.9万个样本,每个样本均包含用户-助手对话、可用工具规范,以及一组正确与错误的工具调用序列。其中错误的工具调用序列由9个开源语言模型基于三个公开数据集的查询生成。经该数据集训练的奖励模型可使下游任务性能平均提升最高达25%,增强对输入噪声的鲁棒性,并可通过奖励引导的筛选实现数据高效的微调。 ## 📊 数据集统计 - **总训练样本数**:458575 - **样本构成**: - 单轮交互样本:256851个 - 多轮交互样本:159757个 - 无关样本:41967个 - **源数据集**:[APIGen](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)、[Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)、[xlam-irrelevance](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) - **生成模型**:9个采用宽松许可协议的开源权重模型 ## 🗂️ 数据集Schema 该数据集包含以下字段: | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `uuid` | 字符串 | 每个训练样本的唯一标识符 | | `dataset_name` | 字符串 | 该样本所属的源数据集名称 | | `conversation` | 列表 | 用户与助手之间的对话内容 | | `tools` | 字符串 | 可用函数规范的目录 | | `tool_calls_correct` | 字符串 | 对应对话的真实正确工具调用序列 | | `tool_calls_incorrect` | 字符串 | 由模型池生成的错误工具调用序列 | | `generator_model` | 字符串 | 生成错误工具调用的模型标识符 | *注:`tools`、`tool_calls_correct` 与 `tool_calls_incorrect` 字段已序列化。加载数据集时,请使用 `json.loads` 将其转换为JSON对象* ## ⚙️ 数据生成方法 ### 生成模型池 错误的工具调用序列由以下模型生成: - **Granite系列**:[granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct)、[granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct)、[granite-20b-functioncalling](https://huggingface.co/ibm-granite/granite-20b-functioncalling) - **SmolLM系列**:[SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)、[SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - **Mistral系列**:[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)、[Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) - **GPT-OSS系列**:[gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)、[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) ### 数据收集流程 1. **源数据集选取**:我们从覆盖多种交互模式的公开函数调用数据集起步 2. **混淆处理**:将函数与参数名称替换为随机字符串,并重新排列Schema键的顺序,以防止模型直接复现训练数据 3. **调用生成**:将每个样本送入模型池以生成函数调用序列 4. **结果验证**:将生成的输出与真实标注进行比对,以识别错误的调用序列 5. **样本筛选**:仅保留错误的生成结果,每个查询最多选取3个错误样本以保持多样性,同时避免样本分布失衡 ## 🎯 基准测试 在Best-of-N设置下,我们发现ToolRM相较于贪心解码、多数投票与Schema验证等基线模型,性能提升显著。 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/m-I-B9TSRKq-CtpuQWW5C.png" width=800 /> </div> 在奖励引导的数据筛选场景中,我们发现使用ToolRM排序得到的8000个高排名样本微调的模型,性能优于使用全部16000个训练样本微调的模型。 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6229237ed94a4a3d5efbacb5/Dq3_-yPlvOFxQTjf_Mi2a.png" width=800 /> </div> 更多实验与结果的详细讨论可参阅相关论文。 ## 📚 引用 如果您在研究中使用该数据集,请引用以下文献: @misc{agarwal2025toolrmoutcomereward, title={ToolRM: Outcome Reward Models for Tool-Calling Large Language Models}, author={Mayank Agarwal and Ibrahim Abdelaziz and Kinjal Basu and Merve Unuvar and Luis A. Lastras and Yara Rizk and Pavan Kapanipathi}, year={2025}, eprint={2509.11963}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.11963}, }
提供机构:
maas
创建时间:
2025-11-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作