LLM360/MegaMath
收藏Hugging Face2025-04-09 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/LLM360/MegaMath
下载链接
链接失效反馈官方服务:
资源简介:
MegaMath是一个开放的数学预训练数据集,由LLM360团队策划,包含超过3000亿个token。该数据集由三个主要部分组成:重新提取的数学文档、高质量的数学相关代码以及合成的问答文本和代码块。MegaMath是目前最大的开放数学预训练数据集,相较于现有数据集,其在质量和性能上都有显著优势。该数据集适用于最新的语言模型训练,并在多个下游任务上展示了性能提升。
MegaMath is an open math pre-training dataset curated by the LLM360 Team, containing over 300 billion tokens. The dataset consists of three main components: re-extracted mathematical documents, high-quality math-related code, and synthetic QA-style text and interleaved text-code blocks. MegaMath is currently the largest open math pre-training dataset, offering significant advantages in quality and performance over existing datasets. It is suitable for training the latest language models and has demonstrated performance improvements on multiple downstream tasks.
提供机构:
LLM360



