Fineweb-Instruct
收藏魔搭社区2025-12-05 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/Fineweb-Instruct
下载链接
链接失效反馈官方服务:
资源简介:
We convert the pre-training corpus from Fineweb-Edu (https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) to instruction following format. We select a subset with quality filter and then use GPT-4 to extract instruction-following pairs. The dataset contains roughly 16M instruction pairs. The basic concept is similar to MAmmoTH2 (https://arxiv.org/abs/2405.03548).

## Citation
If you use dataset useful, please cite the following paper:
```
@article{yue2024mammoth2,
title={MAmmoTH2: Scaling Instructions from the Web},
author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
journal={arXiv preprint arXiv:2405.03548},
year={2024}
}
```
我们将来自Fineweb-Edu(https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)的预训练语料库转换为指令遵循格式。我们通过质量过滤选取了一个子集,随后使用GPT-4提取指令遵循样本对。本数据集共包含约1600万条指令样本对,其核心设计理念与MAmmoTH2(https://arxiv.org/abs/2405.03548)相近。

## 引用
若您使用本数据集,请引用如下论文:
@article{yue2024mammoth2,
title={MAmmoTH2:从网络中扩展指令},
author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
journal={arXiv预印本 arXiv:2405.03548},
year={2024}
}
提供机构:
maas
创建时间:
2025-02-03



