Daring-Anteater
收藏魔搭社区2025-12-04 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Daring-Anteater
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card
Daring-Anteater is a comprehensive dataset for instruction tuning, covering a wide range of tasks and scenarios. The majority of the dataset is synthetically generated using NVIDIA proprietary models and [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), while the remaining samples are sourced from [FinQA](https://finqasite.github.io/), [wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions), and commercially-friendly subsets of [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
This dataset is used in [HelpSteer2 paper](https://arxiv.org/abs/2406.08673), resulting in a solid SFT model for further preference tuning. We open-source this dataset to promote reproducibility.
## Dataset
The dataset consists of four columns:
1. conversations: user and assistant turns in a conversational format
2. mask: the turns that losses are not calculated on ("User" by default)
3. system: system prompt
4. dataset: dataset source
Details of the data blend are as follows:
| Data Source | Number of samples | License |
|:-----------------------------|:----------------|:-----|
| synthetic_conv | 82450 | CC-BY-4.0 |
| synthetic_roleplay | 2996 | CC-BY-4.0 |
| synthetic_math | 3000 | CC-BY-4.0 |
| synthetic_precise_instruction_following | 1500 | CC-BY-4.0 |
| synthetic_json_format_following | 1499 | CC-BY-4.0 |
| synthetic_complex_instruction | 1500 | CC-BY-4.0 |
| open_platypus_commercial | 6000 | CC-BY-4.0/Apache-2.0/MIT |
| FinQA | 300 | CC-BY-4.0 |
| wikitablequestions | 287 | CC-BY-4.0 |
## License
We open-source our synthetic subsets under the CC-BY-4.0 license. All other subsets are also under permissive licenses, making the dataset usable for commercial purposes as long as you follow the terms of the licenses.
## Contact
E-Mail: [Jiaqi Zeng](mailto:jiaqiz@nvidia.com)
## Citation
If you find this dataset useful, please cite the following works
```bibtex
@misc{wang2024helpsteer2,
title={HelpSteer2: Open-source dataset for training top-performing reward models},
author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
year={2024},
eprint={2406.08673},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
```
# 数据集卡片
Daring-Anteater 是一款用于指令微调(Instruction Tuning)的综合数据集,覆盖广泛的任务与场景。该数据集的绝大多数样本由NVIDIA专有模型与 [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 合成生成,剩余样本则源自 [FinQA](https://finqasite.github.io/)、[wikitablequestions](https://huggingface.co/datasets/Stanford/wikitablequestions) 以及 [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) 的商用友好子集。
本数据集已在 [HelpSteer2 论文](https://arxiv.org/abs/2406.08673) 中得到使用,可用于训练可靠的监督微调(Supervised Fine-Tuning, SFT)模型以开展进一步的偏好微调。我们开源该数据集旨在推动研究的可复现性。
## 数据集
该数据集包含四列数据:
1. **conversations(对话)**:对话格式下的用户与助手交互轮次
2. **mask(掩码)**:无需计算损失的交互轮次(默认设置为"User")
3. **system(系统提示)**:系统提示词
4. **dataset(数据集来源)**:样本所属的数据集来源
数据混合方案详情如下:
| 数据来源 | 样本数量 | 授权协议 |
|:-----------------------------|:----------------|:-----|
| synthetic_conv(合成对话) | 82450 | CC-BY-4.0 |
| synthetic_roleplay(合成角色扮演) | 2996 | CC-BY-4.0 |
| synthetic_math(合成数学任务) | 3000 | CC-BY-4.0 |
| synthetic_precise_instruction_following(合成精准指令遵循) | 1500 | CC-BY-4.0 |
| synthetic_json_format_following(合成JSON格式遵循) | 1499 | CC-BY-4.0 |
| synthetic_complex_instruction(合成复杂指令) | 1500 | CC-BY-4.0 |
| open_platypus_commercial(商用友好型Open-Platypus子集) | 6000 | CC-BY-4.0/Apache-2.0/MIT |
| FinQA | 300 | CC-BY-4.0 |
| wikitablequestions | 287 | CC-BY-4.0 |
## 授权协议
我们将所有合成子集以 CC-BY-4.0 协议开源。其余子集均采用宽松授权协议,因此只要遵循各协议的条款,本数据集即可用于商业用途。
## 联系方式
电子邮箱:[Jiaqi Zeng](mailto:jiaqiz@nvidia.com)
## 引用
若您发现本数据集对研究有所帮助,请引用以下文献:
bibtex
@misc{wang2024helpsteer2,
title={HelpSteer2: Open-source dataset for training top-performing reward models},
author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
year={2024},
eprint={2406.08673},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
提供机构:
maas
创建时间:
2025-01-20



