web_instruct
收藏魔搭社区2025-11-14 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/mrsteamedbun/web_instruct
下载链接
链接失效反馈官方服务:
资源简介:
# 🦣 MAmmoTH2: Scaling Instructions from the Web
Project Page: [https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/)
Paper: [https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548)
Code: [https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2)
## WebInstruct (Subset)
This repo partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning.
## License
- For the data from "mathstackexchange" and "stackexchange", we use Apache-2.0 license. You are free to share and adapt for any purposes.
- For the data from "socratic", we use CC BY-NC 4.0 license according to https://socratic.org/terms. You are free to share and adapt, but only for non-commercial purposes.
## Fields in our dataset
The field `orig_question' and `orig_answer' are the extracted question-answer pairs from the recalled documents. The `question' and `answer' are the refined version of the extracted question/answer pairs.
Regarding the data source:
1. mathstackexchange: https://math.stackexchange.com/.
2. stackexchange: including https://physics.stackexchange.com/, https://biology.stackexchange.com/, https://chemistry.stackexchange.com/, https://cs.stackexchange.com/.
3. Socratic: the data is originally from https://socratic.org/.
## Size of different sources
| Domain | Size | Subjects |
|:---------------------|:---------|:------------------------------------------------------------------------------------------|
| MathStackExchange | 1484630 | Mathematics |
| ScienceStackExchange | 317209 | Physics, Biology, Chemistry, Computer Science |
| Socratic | 533384 | Mathematics, Science, Humanties |
## Dataset Construction
We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.

## Citation
```
@article{yue2024mammoth2,
title={MAmmoTH2: Scaling Instructions from the Web},
author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
journal={arXiv preprint arXiv:2405.03548},
year={2024}
}
```
# 🦣 MAmmoTH2:从网络扩展指令数据集
项目页面:[https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/)
论文:[https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548)
代码:[https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2)
## WebInstruct(子集)
本仓库的部分数据集出自论文《MAmmoTH2:从网络扩展指令数据集》。该子集数据主要来源于Stack Exchange类论坛,包含高质量的指令微调数据,可用于提升大语言模型(LLM)的性能。
## 许可协议
- 对于来自mathstackexchange与stackexchange的数据,采用Apache-2.0许可协议,您可自由共享并适配于任何用途。
- 对于来自Socratic的数据,依据https://socratic.org/terms 采用CC BY-NC 4.0许可协议,您可自由共享并适配,但仅可用于非商业用途。
## 数据集字段说明
数据集中的`orig_question`与`orig_answer`为从召回文档中提取的原始问答对;`question`与`answer`则为提取后的问答对经过精炼优化后的版本。
关于数据来源:
1. mathstackexchange:https://math.stackexchange.com/
2. stackexchange:涵盖https://physics.stackexchange.com/、https://biology.stackexchange.com/、https://chemistry.stackexchange.com/、https://cs.stackexchange.com/
3. Socratic:数据最初来源于https://socratic.org/
## 各数据源规模
| 领域 | 数据量 | 研究主题 |
|:---------------------|:---------|:------------------------------------------------------------------------------------------|
| MathStackExchange | 1484630 | 数学 |
| ScienceStackExchange | 317209 | 物理学、生物学、化学、计算机科学 |
| Socratic | 533384 | 数学、科学、人文科学 |
## 数据集构建
我们提出从网络中挖掘指令数据,认为海量高质量的指令数据广泛存在于网络语料中,覆盖数学、科学等诸多领域。我们的三步流水线包括:从通用网络爬虫(Common Crawl)中召回文档、提取问答对、对问答对进行质量优化。该方法共生成1000万条指令-回复对,为现有数据集提供了一种可扩展的替代方案。我们将该整理后的数据集命名为WebInstruct。

## 引用
@article{yue2024mammoth2,
title={MAmmoTH2: Scaling Instructions from the Web},
author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
journal={arXiv preprint arXiv:2405.03548},
year={2024}
}
提供机构:
maas
创建时间:
2024-08-26



