five

LGAI-EXAONE/MANTA-1M

收藏
Hugging Face2026-04-04 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/LGAI-EXAONE/MANTA-1M
下载链接
链接失效反馈
官方服务:
资源简介:
MANTA-1M数据集是一个大规模的指令微调数据集,从大量网络语料库中自动生成,同时保持了数据的多样性和可扩展性。它通过从网络文档中提取结构化的课程大纲并利用高性能的大型语言模型,实现了最小人工干预的高效查询响应生成。该数据集在8B规模的大型语言模型上的广泛实验表明,在知识密集型任务上,如MMLU和MMLU-Pro,微调MANTA-1M数据集的表现显著优于其他大规模数据集生成方法,同时在广泛的任务上也提供了优越的性能。此外,MANTA支持无缝扩展,允许持续集成网络语料库数据,使其能够扩展到需要大量知识的领域。

The MANTA-1M dataset is a large-scale instruction fine-tuning dataset automatically generated from massive web corpora, while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, the approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.
提供机构:
LGAI-EXAONE
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作