Daring-Anteater

Name: Daring-Anteater
Creator: maas
Published: 2025-12-04 16:21:16
License: 暂无描述

魔搭社区2025-12-04 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/nv-community/Daring-Anteater

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card Daring-Anteater is a comprehensive dataset for instruction tuning, covering a wide range of tasks and scenarios. The majority of the dataset is synthetically generated using NVIDIA proprietary models and [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), while the remaining samples are sourced from [FinQA](https://finqasite.github.io/), [wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions), and commercially-friendly subsets of [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). This dataset is used in [HelpSteer2 paper](https://arxiv.org/abs/2406.08673), resulting in a solid SFT model for further preference tuning. We open-source this dataset to promote reproducibility. ## Dataset The dataset consists of four columns: 1. conversations: user and assistant turns in a conversational format 2. mask: the turns that losses are not calculated on ("User" by default) 3. system: system prompt 4. dataset: dataset source Details of the data blend are as follows: | Data Source | Number of samples | License | |:-----------------------------|:----------------|:-----| | synthetic_conv | 82450 | CC-BY-4.0 | | synthetic_roleplay | 2996 | CC-BY-4.0 | | synthetic_math | 3000 | CC-BY-4.0 | | synthetic_precise_instruction_following | 1500 | CC-BY-4.0 | | synthetic_json_format_following | 1499 | CC-BY-4.0 | | synthetic_complex_instruction | 1500 | CC-BY-4.0 | | open_platypus_commercial | 6000 | CC-BY-4.0/Apache-2.0/MIT | | FinQA | 300 | CC-BY-4.0 | | wikitablequestions | 287 | CC-BY-4.0 | ## License We open-source our synthetic subsets under the CC-BY-4.0 license. All other subsets are also under permissive licenses, making the dataset usable for commercial purposes as long as you follow the terms of the licenses. ## Contact E-Mail: [Jiaqi Zeng](mailto:jiaqiz@nvidia.com) ## Citation If you find this dataset useful, please cite the following works ```bibtex @misc{wang2024helpsteer2, title={HelpSteer2: Open-source dataset for training top-performing reward models}, author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev}, year={2024}, eprint={2406.08673}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} } ```

# 数据集卡片 Daring-Anteater 是一款用于指令微调（Instruction Tuning）的综合数据集，覆盖广泛的任务与场景。该数据集的绝大多数样本由NVIDIA专有模型与 [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 合成生成，剩余样本则源自 [FinQA](https://finqasite.github.io/)、[wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions) 以及 [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) 的商用友好子集。本数据集已在 [HelpSteer2 论文](https://arxiv.org/abs/2406.08673) 中得到使用，可用于训练可靠的监督微调（Supervised Fine-Tuning, SFT）模型以开展进一步的偏好微调。我们开源该数据集旨在推动研究的可复现性。 ## 数据集该数据集包含四列数据： 1. **conversations（对话）**：对话格式下的用户与助手交互轮次 2. **mask（掩码）**：无需计算损失的交互轮次（默认设置为"User"） 3. **system（系统提示）**：系统提示词 4. **dataset（数据集来源）**：样本所属的数据集来源数据混合方案详情如下： | 数据来源 | 样本数量 | 授权协议 | |:-----------------------------|:----------------|:-----| | synthetic_conv（合成对话） | 82450 | CC-BY-4.0 | | synthetic_roleplay（合成角色扮演） | 2996 | CC-BY-4.0 | | synthetic_math（合成数学任务） | 3000 | CC-BY-4.0 | | synthetic_precise_instruction_following（合成精准指令遵循） | 1500 | CC-BY-4.0 | | synthetic_json_format_following（合成JSON格式遵循） | 1499 | CC-BY-4.0 | | synthetic_complex_instruction（合成复杂指令） | 1500 | CC-BY-4.0 | | open_platypus_commercial（商用友好型Open-Platypus子集） | 6000 | CC-BY-4.0/Apache-2.0/MIT | | FinQA | 300 | CC-BY-4.0 | | wikitablequestions | 287 | CC-BY-4.0 | ## 授权协议我们将所有合成子集以 CC-BY-4.0 协议开源。其余子集均采用宽松授权协议，因此只要遵循各协议的条款，本数据集即可用于商业用途。 ## 联系方式电子邮箱：[Jiaqi Zeng](mailto:jiaqiz@nvidia.com) ## 引用若您发现本数据集对研究有所帮助，请引用以下文献： bibtex @misc{wang2024helpsteer2, title={HelpSteer2: Open-source dataset for training top-performing reward models}, author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev}, year={2024}, eprint={2406.08673}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} }

提供机构：

maas

创建时间：

2025-01-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集