JustcallmeJo/MANTA_1M_ENG_KO_70_30

Name: JustcallmeJo/MANTA_1M_ENG_KO_70_30
Creator: JustcallmeJo
Published: 2026-02-06 13:41:23
License: 暂无描述

Hugging Face2026-02-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JustcallmeJo/MANTA_1M_ENG_KO_70_30

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - question-answering language: - en size_categories: - 100K<n<1M dataset_info: features: - name: conversations list: - name: content dtype: string - name: role dtype: string - name: lang dtype: string - name: type dtype: string splits: - name: train num_bytes: 161010265 num_examples: 40830 download_size: 83408235 dataset_size: 161010265 configs: - config_name: MANTA_1M data_files: - split: train path: data/train.parquet - config_name: default data_files: - split: train path: data/train-* --- <p align="center"> <img src="Manta.png" alt="Manta" width="50%"> </p> ## **Abstract** We introduce **MANTA**, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge. ## **Dataset Details** This dataset is generated by [**EXAONE-3.5-32B-Instruct**](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-32B-Instruct) using MANTA method. Please refer to our paper for implementation details. The dataset is divided into 11 major categories, with their respective proportions as follows. These proportions naturally reflect the domain distribution of documents on the web, as the instructions were created based on information extracted from a large-scale web source. | Domain | percent % | | --- | --- | | Mathematics | 17.37% | | Social Sciences | 21.21% | | Natural Sciences | 22.39% | | Engineering | 5.31% | | Economics and Business | 4.32% | | Computer Science and Coding | 24.82% | | Arts | 3.03% | | Philosophy, Religion | 0.97% | | History | 0.83% | | Literature | 0.83% | | Languages | 0.40% | Additionally, to ensure the quality of each dataset, we have annotated them with complexity scores using the method described in [1]. [1] Yuan, Weizhe, et al. "Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions." *arXiv preprint arXiv:2502.13124* (2025). ## **Usage** ```python from datasets import load_dataset dataset = load_dataset("LGAI-EXAONE/MANTA-1M") ``` ## **Citation** ```json ``` ## **License** This dataset is released under the **CC-BY-NC-4.0** License. ## **Contact** LG AI Research Technical Support: [**contact_us@lgresearch.ai**](mailto:contact_us@lgresearch.ai)

提供机构：

JustcallmeJo

5,000+

优质数据集

54 个

任务类型

进入经典数据集