MANTA-1M
收藏魔搭社区2025-10-31 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/LGAI-EXAONE/MANTA-1M
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="Manta.png" alt="Manta" width="50%">
</p>
## **Abstract**
We introduce **MANTA**, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.
## **Dataset Details**
This dataset is generated by [**EXAONE-3.5-32B-Instruct**](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-32B-Instruct) using MANTA method. Please refer to our paper for implementation details.
The dataset is divided into 11 major categories, with their respective proportions as follows. These proportions naturally reflect the domain distribution of documents on the web, as the instructions were created based on information extracted from a large-scale web source.
| Domain | percent % |
| --- | --- |
| Mathematics | 17.37% |
| Social Sciences | 21.21% |
| Natural Sciences | 22.39% |
| Engineering | 5.31% |
| Economics and Business | 4.32% |
| Computer Science and Coding | 24.82% |
| Arts | 3.03% |
| Philosophy, Religion | 0.97% |
| History | 0.83% |
| Literature | 0.83% |
| Languages | 0.40% |
Additionally, to ensure the quality of each dataset, we have annotated them with complexity scores using the method described in [1].
[1] Yuan, Weizhe, et al. "Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions." *arXiv preprint arXiv:2502.13124* (2025).
## **Usage**
```python
from datasets import load_dataset
dataset = load_dataset("LGAI-EXAONE/MANTA-1M")
```
## **Citation**
```json
```
## **License**
This dataset is released under the **CC-BY-NC-4.0** License.
## **Contact**
LG AI Research Technical Support: [**contact_us@lgresearch.ai**](mailto:contact_us@lgresearch.ai)
<p align="center"><img src="Manta.png" alt="Manta" width="50%"></p>
## **摘要**
我们提出**MANTA**——一种自动化流水线,可从海量网页语料库中生成高质量的大规模指令微调数据集,同时保留数据的多样性与可扩展性。通过从网页文档中提取结构化教学大纲,并借助高性能大语言模型(LLMs),我们的方法可在极少人工干预的情况下实现高效的查询-响应生成。针对80亿参数规模的大语言模型开展的大量实验表明,在MANTA-1M数据集上进行微调的效果显著优于其他大规模数据集生成方法,尤其在MMLU、MMLU-Pro等知识密集型任务中表现突出,同时在广泛的任务场景中均能实现更优性能。此外,MANTA支持无缝扩展:可持续集成网页语料库数据,从而能够拓展至需要深度知识的领域。
## **数据集详情**
本数据集由[**EXAONE-3.5-32B-Instruct**](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-32B-Instruct)基于MANTA方法生成。具体实现细节请参阅我们的学术论文。
该数据集共分为11个主要类别,各类别占比详情如下。由于指令集的构建基于从大规模网页源提取的信息,这些占比自然反映了网页文档的领域分布情况。
| 领域分类 | 占比(%) |
| ------ | -------- |
| 数学 | 17.37% |
| 社会科学 | 21.21% |
| 自然科学 | 22.39% |
| 工程学 | 5.31% |
| 经济学与商学 | 4.32% |
| 计算机科学与编程 | 24.82% |
| 艺术学 | 3.03% |
| 哲学、宗教学 | 0.97% |
| 历史学 | 0.83% |
| 文学 | 0.83% |
| 语言学 | 0.40% |
此外,为保障数据集的整体质量,我们采用文献[1]中提及的方法为每条数据标注了复杂度得分。
[1] 袁伟哲等. "NaturalReasoning: Reasoning in the wild with 2.8 m challenging questions"[J/OL]. arXiv预印本arXiv:2502.13124, 2025.
## **使用方法**
python
from datasets import load_dataset
dataset = load_dataset("LGAI-EXAONE/MANTA-1M")
## **引用格式**
json
## **许可证**
本数据集采用**CC-BY-NC-4.0**许可证发布。
## **联系方式**
LG AI研究院技术支持:[**contact_us@lgresearch.ai**](mailto:contact_us@lgresearch.ai)
提供机构:
maas
创建时间:
2025-09-19



