five

WebInstructSub

收藏
魔搭社区2026-04-28 更新2024-06-01 收录
下载链接:
https://modelscope.cn/datasets/swift/WebInstructSub
下载链接
链接失效反馈
官方服务:
资源简介:
# 🦣 MAmmoTH2: Scaling Instructions from the Web Project Page: [https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/) Paper: [https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548) Code: [https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2) ## WebInstruct (Subset) This repo contains the partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning. ## License - For the data from "mathstackexchange" and "stackexchange", we use Apache-2.0 license. You are free to share and adapt for any purposes. - For the data from "socratic", we use CC BY-NC 4.0 license according to https://socratic.org/terms. You are free to share and adapt, but only for non-commercial purposes. ## Fields in our dataset The field `orig_question' and `orig_answer' are the extracted question-answer pairs from the recalled documents. The `question' and `answer' are the refined version of the extracted question/answer pairs. Regarding the data source: 1. mathstackexchange: https://math.stackexchange.com/. 2. stackexchange: including https://physics.stackexchange.com/, https://biology.stackexchange.com/, https://chemistry.stackexchange.com/, https://cs.stackexchange.com/. 3. Socratic: the data is originally from https://socratic.org/. ## Size of different sources | Domain | Size | Subjects | |:---------------------|:---------|:------------------------------------------------------------------------------------------| | MathStackExchange | 1484630 | Mathematics | | ScienceStackExchange | 317209 | Physics, Biology, Chemistry, Computer Science | | Socratic | 533384 | Mathematics, Science, Humanties | ## Dataset Construction We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct. ![Project Framework](https://tiger-ai-lab.github.io/MAmmoTH2/static/images/teaser.jpg) ## Citation ``` @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={Advances in Neural Information Processing Systems}, year={2024} } ```

# 🦣 MAmmoTH2:从网络扩展指令数据集 项目主页:[https://tiger-ai-lab.github.io/MAmmoTH2/](https://tiger-ai-lab.github.io/MAmmoTH2/) 论文:[https://arxiv.org/pdf/2405.03548](https://arxiv.org/pdf/2405.03548) 代码仓库:[https://github.com/TIGER-AI-Lab/MAmmoTH2](https://github.com/TIGER-AI-Lab/MAmmoTH2) ## WebInstruct(子集) 本仓库包含论文《MAmmoTH2:从网络扩展指令数据集》中使用的部分数据集。该部分数据主要源自Stack Exchange(堆栈交换)类论坛,此子集包含高质量数据,可用于指令微调以提升大语言模型(Large Language Model)的性能。 ## 许可协议 - 对于源自mathstackexchange和stackexchange的数据,采用Apache-2.0许可协议。您可自由共享并改编该数据用于任意用途。 - 对于源自Socratic的数据,依据https://socratic.org/terms 采用CC BY-NC 4.0许可协议。您可自由共享并改编,但仅可用于非商业用途。 ## 数据集字段说明 本数据集包含`orig_question`与`orig_answer`字段,二者为从召回文档中提取的原始问答对;`question`与`answer`字段则为提取后的问答对经过精炼优化后的版本。 关于数据来源: 1. mathstackexchange:https://math.stackexchange.com/ 2. stackexchange:涵盖https://physics.stackexchange.com/、https://biology.stackexchange.com/、https://chemistry.stackexchange.com/、https://cs.stackexchange.com/ 3. Socratic:数据最初源自https://socratic.org/ ## 各数据源规模 | 领域 | 数据量 | 所属学科 | |:---------------------|:---------|:------------------------------------------------------------------------------------------| | MathStackExchange | 1484630 | 数学 | | ScienceStackExchange | 317209 | 物理学、生物学、化学、计算机科学 | | Socratic | 533384 | 数学、科学、人文科学 | ## 数据集构建方案 我们提出从网络中挖掘指令数据的研究思路。我们认为,海量高质量的指令数据广泛分布于网络语料库中,覆盖数学、科学等众多领域。我们的三阶段流水线包含以下步骤:从Common Crawl语料库中召回文档、提取问答对,以及对提取内容进行质量精炼。该方法可生成1000万条指令-响应对,为现有数据集提供了一种可扩展的替代方案。我们将该精选数据集命名为WebInstruct。 ![项目框架](https://tiger-ai-lab.github.io/MAmmoTH2/static/images/teaser.jpg) ## 引用格式 @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={Advances in Neural Information Processing Systems}, year={2024} }
提供机构:
maas
创建时间:
2024-06-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作