five

web_instruct

收藏
魔搭社区2025-11-14 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/mrsteamedbun/web_instruct
下载链接
链接失效反馈
官方服务:
资源简介:
# 🦣 MAmmoTH2: Scaling Instructions from the Web Project Page: [https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/) Paper: [https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548) Code: [https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2) ## WebInstruct (Subset) This repo partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning. ## License - For the data from "mathstackexchange" and "stackexchange", we use Apache-2.0 license. You are free to share and adapt for any purposes. - For the data from "socratic", we use CC BY-NC 4.0 license according to https://socratic.org/terms. You are free to share and adapt, but only for non-commercial purposes. ## Fields in our dataset The field `orig_question' and `orig_answer' are the extracted question-answer pairs from the recalled documents. The `question' and `answer' are the refined version of the extracted question/answer pairs. Regarding the data source: 1. mathstackexchange: https://math.stackexchange.com/. 2. stackexchange: including https://physics.stackexchange.com/, https://biology.stackexchange.com/, https://chemistry.stackexchange.com/, https://cs.stackexchange.com/. 3. Socratic: the data is originally from https://socratic.org/. ## Size of different sources | Domain | Size | Subjects | |:---------------------|:---------|:------------------------------------------------------------------------------------------| | MathStackExchange | 1484630 | Mathematics | | ScienceStackExchange | 317209 | Physics, Biology, Chemistry, Computer Science | | Socratic | 533384 | Mathematics, Science, Humanties | ## Dataset Construction We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct. ![Project Framework](https://tiger-ai-lab.github.io/MAmmoTH2/static/images/teaser.jpg) ## Citation ``` @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={arXiv preprint arXiv:2405.03548}, year={2024} } ```

# 🦣 MAmmoTH2:从网络扩展指令数据集 项目页面:[https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/) 论文:[https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548) 代码:[https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2) ## WebInstruct(子集) 本仓库的部分数据集出自论文《MAmmoTH2:从网络扩展指令数据集》。该子集数据主要来源于Stack Exchange类论坛,包含高质量的指令微调数据,可用于提升大语言模型(LLM)的性能。 ## 许可协议 - 对于来自mathstackexchange与stackexchange的数据,采用Apache-2.0许可协议,您可自由共享并适配于任何用途。 - 对于来自Socratic的数据,依据https://socratic.org/terms 采用CC BY-NC 4.0许可协议,您可自由共享并适配,但仅可用于非商业用途。 ## 数据集字段说明 数据集中的`orig_question`与`orig_answer`为从召回文档中提取的原始问答对;`question`与`answer`则为提取后的问答对经过精炼优化后的版本。 关于数据来源: 1. mathstackexchange:https://math.stackexchange.com/ 2. stackexchange:涵盖https://physics.stackexchange.com/、https://biology.stackexchange.com/、https://chemistry.stackexchange.com/、https://cs.stackexchange.com/ 3. Socratic:数据最初来源于https://socratic.org/ ## 各数据源规模 | 领域 | 数据量 | 研究主题 | |:---------------------|:---------|:------------------------------------------------------------------------------------------| | MathStackExchange | 1484630 | 数学 | | ScienceStackExchange | 317209 | 物理学、生物学、化学、计算机科学 | | Socratic | 533384 | 数学、科学、人文科学 | ## 数据集构建 我们提出从网络中挖掘指令数据,认为海量高质量的指令数据广泛存在于网络语料中,覆盖数学、科学等诸多领域。我们的三步流水线包括:从通用网络爬虫(Common Crawl)中召回文档、提取问答对、对问答对进行质量优化。该方法共生成1000万条指令-回复对,为现有数据集提供了一种可扩展的替代方案。我们将该整理后的数据集命名为WebInstruct。 ![项目框架](https://tiger-ai-lab.github.io/MAmmoTH2/static/images/teaser.jpg) ## 引用 @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={arXiv preprint arXiv:2405.03548}, year={2024} }
提供机构:
maas
创建时间:
2024-08-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作