five

Daring-Anteater

收藏
魔搭社区2025-12-04 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Daring-Anteater
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card Daring-Anteater is a comprehensive dataset for instruction tuning, covering a wide range of tasks and scenarios. The majority of the dataset is synthetically generated using NVIDIA proprietary models and [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), while the remaining samples are sourced from [FinQA](https://finqasite.github.io/), [wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions), and commercially-friendly subsets of [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). This dataset is used in [HelpSteer2 paper](https://arxiv.org/abs/2406.08673), resulting in a solid SFT model for further preference tuning. We open-source this dataset to promote reproducibility. ## Dataset The dataset consists of four columns: 1. conversations: user and assistant turns in a conversational format 2. mask: the turns that losses are not calculated on ("User" by default) 3. system: system prompt 4. dataset: dataset source Details of the data blend are as follows: | Data Source | Number of samples | License | |:-----------------------------|:----------------|:-----| | synthetic_conv | 82450 | CC-BY-4.0 | | synthetic_roleplay | 2996 | CC-BY-4.0 | | synthetic_math | 3000 | CC-BY-4.0 | | synthetic_precise_instruction_following | 1500 | CC-BY-4.0 | | synthetic_json_format_following | 1499 | CC-BY-4.0 | | synthetic_complex_instruction | 1500 | CC-BY-4.0 | | open_platypus_commercial | 6000 | CC-BY-4.0/Apache-2.0/MIT | | FinQA | 300 | CC-BY-4.0 | | wikitablequestions | 287 | CC-BY-4.0 | ## License We open-source our synthetic subsets under the CC-BY-4.0 license. All other subsets are also under permissive licenses, making the dataset usable for commercial purposes as long as you follow the terms of the licenses. ## Contact E-Mail: [Jiaqi Zeng](mailto:jiaqiz@nvidia.com) ## Citation If you find this dataset useful, please cite the following works ```bibtex @misc{wang2024helpsteer2, title={HelpSteer2: Open-source dataset for training top-performing reward models}, author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev}, year={2024}, eprint={2406.08673}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} } ```

# 数据集卡片 Daring-Anteater 是一款用于指令微调(Instruction Tuning)的综合数据集,覆盖广泛的任务与场景。该数据集的绝大多数样本由NVIDIA专有模型与 [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 合成生成,剩余样本则源自 [FinQA](https://finqasite.github.io/)、[wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions) 以及 [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) 的商用友好子集。 本数据集已在 [HelpSteer2 论文](https://arxiv.org/abs/2406.08673) 中得到使用,可用于训练可靠的监督微调(Supervised Fine-Tuning, SFT)模型以开展进一步的偏好微调。我们开源该数据集旨在推动研究的可复现性。 ## 数据集 该数据集包含四列数据: 1. **conversations(对话)**:对话格式下的用户与助手交互轮次 2. **mask(掩码)**:无需计算损失的交互轮次(默认设置为"User") 3. **system(系统提示)**:系统提示词 4. **dataset(数据集来源)**:样本所属的数据集来源 数据混合方案详情如下: | 数据来源 | 样本数量 | 授权协议 | |:-----------------------------|:----------------|:-----| | synthetic_conv(合成对话) | 82450 | CC-BY-4.0 | | synthetic_roleplay(合成角色扮演) | 2996 | CC-BY-4.0 | | synthetic_math(合成数学任务) | 3000 | CC-BY-4.0 | | synthetic_precise_instruction_following(合成精准指令遵循) | 1500 | CC-BY-4.0 | | synthetic_json_format_following(合成JSON格式遵循) | 1499 | CC-BY-4.0 | | synthetic_complex_instruction(合成复杂指令) | 1500 | CC-BY-4.0 | | open_platypus_commercial(商用友好型Open-Platypus子集) | 6000 | CC-BY-4.0/Apache-2.0/MIT | | FinQA | 300 | CC-BY-4.0 | | wikitablequestions | 287 | CC-BY-4.0 | ## 授权协议 我们将所有合成子集以 CC-BY-4.0 协议开源。其余子集均采用宽松授权协议,因此只要遵循各协议的条款,本数据集即可用于商业用途。 ## 联系方式 电子邮箱:[Jiaqi Zeng](mailto:jiaqiz@nvidia.com) ## 引用 若您发现本数据集对研究有所帮助,请引用以下文献: bibtex @misc{wang2024helpsteer2, title={HelpSteer2: Open-source dataset for training top-performing reward models}, author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev}, year={2024}, eprint={2406.08673}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} }
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作