When2Call

Name: When2Call
Creator: maas
Published: 2025-12-04 16:32:29
License: 暂无描述

魔搭社区2025-12-04 更新2025-05-03 收录

下载链接：

https://modelscope.cn/datasets/nv-community/When2Call

下载链接

链接失效反馈

官方服务：

资源简介：

# When2Call 💾 <a href="https://github.com/NVIDIA/When2Call">Github</a>&nbsp&nbsp | &nbsp&nbsp 📄 <a href="https://aclanthology.org/2025.naacl-long.174/">Paper</a> ## Dataset Description: When2Call is a benchmark designed to evaluate tool-calling decision-making for large language models (LLMs), including when to generate a tool call, when to ask follow-up questions, when to admit the question can't be answered with the tools provided, and what to do if the question seems to require tool use but a tool call can't be made. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. The dataset offers a training set for When2Call and leverages the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerable improvement over traditional fine-tuning for tool calling. This dataset is ready for commercial use. Evaluation code and the synthetic data generation scripts used to generate the datasets can be found in the [GitHub repo](https://github.com/NVIDIA/When2Call). ## Load Test/Train Data: ### Test data The test set has two files: Multi-Choice Question evaluation (`mcq`) and LLM-as-a-judge (`llm_judge`), which is a subset of the MCQ evaluation set so you can download the two datasets as a single `DatasetDict`. ```python >>> from datasets import load_dataset >>> test_ds = load_dataset("nvidia/When2Call", "test") >>> test_ds DatasetDict({ llm_judge: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 300 }) mcq: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 3652 }) }) ``` ### Train data When2Call has two training datasets available: one for Supervised Fine-Tuning (`train_sft`) and one for Preference Tuning such as DPO (`train_pref`). As `train_pref` has additional fields `chosen_response` and `rejected_response`, these datasets need to be loaded separately. #### SFT ```python >>> train_sft_ds = load_dataset("nvidia/When2Call", "train_sft") >>> train_sft_ds DatasetDict({ train: Dataset({ features: ['tools', 'messages'], num_rows: 15000 }) ``` #### Preference Tuning ```python >>> train_pref_ds = load_dataset("nvidia/When2Call", "train_pref") >>> train_pref_ds DatasetDict({ train: Dataset({ features: ['tools', 'messages', 'chosen_response', 'rejected_response'], num_rows: 9000 }) }) ``` ## Dataset Owner: NVIDIA Corporation ## Dataset Creation Date: September 2024 ## License/Terms of Use: This dataset is licensed under a Creative Commons Attribution 4.0 International License available at https://creativecommons.org/licenses/by/4.0/legalcode. ## Intended Usage: Evaluate and train LLMs’ tool-calling capabilities. ## Dataset Characterization ** Data Collection Method * Synthetic ** Labeling Method * Automated ## Dataset Format Text (.jsonl) ## Dataset Quantification The training dataset contains examples consisting of tool specification(s), user input, multiple choices, and the expected response. The preference dataset includes chosen and rejected responses instead of expected responses. The test dataset follows the same format. - Training set - SFT dataset: 15,000 examples - Preference dataset: 9,000 examples - Test set - MCQ dataset: 3,652 examples - LLM-as-a-judge dataset: 300 examples Measurement of total data storage: 56MB ## References: - [Hayley Ross, Ameya Sunil Mahabaleshwarka, Yoshi Suhara, “When2Call: When (not) to Call Tools”, NAACL 2025.](https://aclanthology.org/2025.naacl-long.174/) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

# When2Call 💾 <a href="https://github.com/NVIDIA/When2Call">GitHub</a>&nbsp&nbsp | &nbsp&nbsp 📄 <a href="https://aclanthology.org/2025.naacl-long.174/">论文</a> ## 数据集描述 When2Call是一款专为评估大语言模型（Large Language Models, LLMs）工具调用决策能力而设计的基准数据集，涵盖生成工具调用请求、发起追问、表明无法通过现有工具回答问题，以及当问题看似需要工具调用却无法实际发起调用时的应对策略等场景。我们发现，当前最先进的工具调用型大语言模型在When2Call基准数据集上仍存在显著的优化空间，这也凸显了该基准的重要价值。本数据集提供了When2Call的训练集，并利用基准的多项选择特性构建了偏好优化训练框架，该框架在工具调用任务上的表现远超传统微调方法。本数据集可用于商业用途。数据集的评估代码与合成数据生成脚本可在[GitHub仓库](https://github.com/NVIDIA/When2Call)中获取。 ## 加载测试与训练数据 ### 测试数据测试集包含两个文件：多项选择问答评估集（`mcq`）与LLM作为裁判集（`llm_judge`），其中后者是MCQ评估集的子集，因此你可以将两个数据集作为单个`DatasetDict`下载。 python >>> from datasets import load_dataset >>> test_ds = load_dataset("nvidia/When2Call", "test") >>> test_ds DatasetDict({ llm_judge: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 300 }) mcq: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 3652 }) }) ### 训练数据 When2Call提供两类训练数据集：一类用于监督微调（Supervised Fine-Tuning，`train_sft`），另一类用于偏好微调（如DPO，`train_pref`）。由于`train_pref`包含额外的`chosen_response`与`rejected_response`字段，因此需要分别加载这两类数据集。 #### 监督微调 python >>> train_sft_ds = load_dataset("nvidia/When2Call", "train_sft") >>> train_sft_ds DatasetDict({ train: Dataset({ features: ['tools', 'messages'], num_rows: 15000 }) #### 偏好微调 python >>> train_pref_ds = load_dataset("nvidia/When2Call", "train_pref") >>> train_pref_ds DatasetDict({ train: Dataset({ features: ['tools', 'messages', 'chosen_response', 'rejected_response'], num_rows: 9000 }) }) ### 数据集所有者 NVIDIA公司（NVIDIA Corporation） ### 数据集创建日期 2024年9月 ### 使用许可条款本数据集采用知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International License）授权，详情参见https://creativecommons.org/licenses/by/4.0/legalcode。 ### 预期用途用于评估与训练大语言模型的工具调用能力。 ### 数据集特征 ** 数据收集方式 * 合成数据 ** 标注方式 * 自动标注 ### 数据集格式文本格式（.jsonl） ### 数据集量化信息训练数据集的样本包含工具说明、用户输入、多项选择项与预期回复。偏好微调数据集则以被选中的回复与被拒绝的回复替代预期回复，测试数据集采用相同格式。 - 训练集 - 监督微调数据集：15,000条样本 - 偏好微调数据集：9,000条样本 - 测试集 - MCQ数据集：3,652条样本 - LLM作为裁判数据集：300条样本总数据存储量：56MB ### 参考文献 - [Hayley Ross, Ameya Sunil Mahabaleshwarka, Yoshi Suhara, “When2Call: When (not) to Call Tools”, NAACL 2025.](https://aclanthology.org/2025.naacl-long.174/) ### 伦理考量 NVIDIA认为，可信人工智能是一项共同责任，我们已建立相关政策与实践规范，以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时，应与内部模型团队协作，确保模型符合相关行业与应用场景的要求，并防范可能出现的产品滥用问题。如需报告安全漏洞或NVIDIA人工智能相关问题，请访问[此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交。

提供机构：

maas

创建时间：

2025-04-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集