UltraData-Math

Name: UltraData-Math
Creator: maas
Published: 2026-05-16 15:50:00
License: 暂无描述

魔搭社区2026-05-16 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/OpenBMB/UltraData-Math

下载链接

链接失效反馈

官方服务：

资源简介：

# UltraData-Math <div align="center"> <img src="assets/ultradata-math-logo.png" width="600"/> </div> <p align="center"> <a href="https://huggingface.co/datasets/openbmb/UltraData-Math">🤗 Dataset</a> | <a href="https://github.com/UltraData-OpenBMB/UltraData-Math">💻 Source Code</a> | <a href="https://huggingface.co/datasets/openbmb/UltraData-Math/blob/main/README_ZH.md">🇨🇳 中文 README</a> </p> ***UltraData-Math*** is a large-scale, high-quality mathematical pre-training dataset totaling **290B+ tokens** across three progressive tiers—**L1** (170.5B tokens web corpus), **L2** (33.7B tokens quality-selected), and **L3** (88B tokens multi-format refined)—designed to systematically enhance mathematical reasoning in LLMs. It has been applied to the mathematical pre-training of the [MiniCPM Series](https://huggingface.co/collections/openbmb/minicpm4) models. It was introduced in the paper [Data Science and Technology Towards AGI Part I: Tiered Data Management](https://huggingface.co/papers/2602.09003). ## 🆕 What's New - **[2026.02.09]**: **UltraData-Math**, a large-scale high-quality mathematical pre-training dataset with 290B+ tokens across three progressive tiers (L1/L2-preview/L3), is now available on Hugging Face. Released as part of the [UltraData](https://ultradata.openbmb.cn/) ecosystem. 🔥🔥🔥 - **[2026.02.10]**: **UltraData-Math** tops the Hugging Face Datasets Trending list, reaching the #1 spot! ⭐️⭐️⭐️ ## 📚 Introduction High-quality pre-training data is crucial for enhancing the mathematical reasoning capabilities of large language models (LLMs). However, existing mathematical pre-training data construction schemes have the following shortcomings: - **HTML Parsing**: General parsers (such as trafilatura, readability) are mainly designed for news/article parsing, lacking specialized processing for mathematical formulas and other content, often leading to formula structure destruction or loss; meanwhile, mathematical discussions on forum-like pages are difficult to extract completely. - **Data Quality**: Existing datasets generally lack a systematic quality grading mechanism, with high-value mathematical content mixed with low-quality noise. - **Data Diversity**: Mainstream datasets mostly originate from textbooks or competition question banks, lacking mathematical discussions and application scenarios in real web pages; synthetic data formats are single, difficult to cover diverse needs such as multi-turn dialogues and multi-style expressions. To address these issues, we propose ***UltraData-Math***—a large-scale high-quality pre-training dataset for mathematical reasoning tasks. This dataset is developed based on the [UltraData](https://ultradata.openbmb.cn/blog/position-paper) L0-L4 Tiered Data Management Framework, containing four progressive levels: - **L0 Raw Data**: Develops a mathematical parser based on *magic-html*, combined with *w3m* layout preservation rendering and multi-level fallback strategies, standardizing MathML, KaTeX, and AsciiMath into LaTeX format. - **L1 Filtered Data**: Cleans noise through heuristic rules and performs document-level deduplication. - **L2 Selected Data**: Uses proprietary large models to annotate seed data and distills it into a lightweight embedding classifier to achieve efficient quality grading of the full corpus. - **L3 Refined Data**: Produces structured content with clear reasoning through rewriting, synthetic generation, and refinement in various formats such as Q&A, multi-turn dialogues, multi-style rewriting, and knowledge-grounded textbooks. Experiments show that on the MiniCPM-1.2B architecture, ***UltraData-Math*** achieves a score of **37.02pp** on the MATH500 benchmark, an improvement of **+3.62pp** compared to Nemotron-CC 4plus; it achieves **61.79pp** on GSM8K, an improvement of **+3.34pp**, while maintaining code generation and general knowledge capabilities. ***UltraData-Math*** has been applied to the mathematical pre-training of the [MiniCPM Series](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b) models. - **[UltraData-Math-L1](https://huggingface.co/datasets/openbmb/UltraData-Math)**: Large-scale high-quality mathematical pre-training dataset, containing 170.5B tokens of web mathematical corpus. - **[UltraData-Math-L2](https://huggingface.co/datasets/openbmb/UltraData-Math-L2)**: High-quality mathematical pre-training dataset selected by the quality model, containing 33.7B tokens of high-quality web mathematical corpus. - **[UltraData-Math-L3](https://huggingface.co/datasets/openbmb/UltraData-Math-L3)**: High-quality refined mathematical dataset, containing 88B tokens of multi-format refined data (Q&A, multi-turn dialogues, knowledge textbooks, etc.). ## 🏗️ Data Processing Pipeline To break through the limitations of existing mathematical datasets in quality and diversity, we established a refined grading standard centered on "mathematical content integrity" and "information density". ***UltraData-Math*** adopts the **L0-L4 Tiered Data Management Framework** proposed by the [UltraData](https://ultradata.openbmb.cn/blog/position-paper) paper. Through standardized level definitions, it achieves orderly management and efficient flow of mathematical data assets. Each level represents higher data purity and mathematical value, while also corresponding to a more refined degree of processing. <div align="center"> <img src="assets/ultradata-math-pipeline.png" width="900"/> </div> ### L0: Raw Data Parsing and Standardization **Goal**: Address the poor support of general HTML parsers for mathematical formulas and maximize the preservation of mathematical semantics in web pages. The L0 phase mainly processes raw web data obtained from sources such as Common Crawl. Given the specificity of mathematical web pages, we develop specialized parsing strategies through the [UltraData-Math-Parser](https://huggingface.co/spaces/openbmb/UltraData-Math-L0-Parser) instead of directly using general parsers like trafilatura or readability. - **Unified Parsing Mode**: Automatically identifies page types to ensure complete content extraction as much as possible. - **Multi-level Fallback Strategy**: To prevent data loss due to parsing failures, we implement a multi-level fallback mechanism to ensure text content is captured even if structured parsing fails. - **Mathematical Formula Standardization**: We unify different mathematical expressions in web pages into standard LaTeX format, achieving data format normalization for unified model learning. ### L1: Heuristic Cleaning and Filtering **Goal**: Remove format noise and improve data readability and standardization. After obtaining text containing complete mathematical formulas, we clean the L0 data through a series of heuristic rules: - **Format Repair**: - Clean invisible characters, garbled text, and unnatural continuous line breaks. - Remove irrelevant web noise such as navigation bars, footers, ad pop-ups, and "read more". - **Content Filtering**: - *Length Filtering*: Remove overly short text fragments, which usually lack context and are difficult to support effective mathematical reasoning training. - *Language Identification*: Ensure the dataset is composed mainly of high-quality English and Chinese mathematical content. - *Document Deduplication*: Perform deduplication at the document level to prevent duplicate content from biasing model training. ### L2: Selection Based on Quality Models **Goal**: Identify core corpora with high value from massive data. Although L1 data has a clean format, the content quality varies. The L2 phase introduces a model-based quality assessment system: - **Seed Data Annotation**: Use proprietary large models to score a portion of seed data across multiple dimensions. - **Classifier Training and Distillation**: Train lightweight embedding classifiers based on annotated data to equip them with the ability to identify high-value mathematical content. - **Full-scale Inference**: Use the trained classifier to score and screen L1 data in full. - *Retention*: Content containing detailed problem-solving steps, mathematical concept explanations, and high-level academic discussions. - *Exclusion*: Simple stacking of nouns, meaningless lists of numbers, juvenile content, or noise from non-mathematical fields. ### L3: Refined Data **Goal**: Produce structured content with clear reasoning and explicit educational intent through rewriting, synthetic generation, and refinement, achieving textbook-quality standards and ensuring maximum learnability. Natural web data is mostly declarative text, lacking structured reasoning steps and diverse pedagogical formats. To enhance the model's chain-of-thought (CoT) capabilities and multi-turn interaction skills, we build the L3 refined data layer through the [UltraData-Math-Generator](https://huggingface.co/spaces/openbmb/UltraData-Math-L3-Generator): - **Q&A Pair Generation**: Use high-performance models to rewrite declarative documents into "Question-Answer" pairs, constructing QA-style data with explicit reasoning steps. - **Multi-turn Dialogue Synthesis**: Simulate "Teacher-Student" tutoring scenarios to generate multi-turn dialogue data containing follow-up questions, corrections, and guidance. - **Multi-style Rewriting**: Rewrite single-source data into multiple styles (such as rigorous textbook style, competition problem-solving style, intuitive popular science style) to improve model generalization. - **Knowledge Point Textbook Generation**: Generate systematic textbook-like content based on specific knowledge points to ensure the model masters core mathematical concepts. - **Format Repair and Enhancement**: Fix formatting issues in the source data (e.g., broken LaTeX formulas, inconsistent notation) and enhance content coherence to achieve textbook-quality standards. Based on the above methodology, we produce the following ***UltraData-Math*** datasets: | Dataset | # Tokens | # Documents | |:---|:---:|:---:| | UltraData-Math-L1 | 170.5B | 85.6M | | UltraData-Math-L2-preview | 33.7B | 14.98M | | UltraData-Math-L3 | 88B | 81.4M | ## 🚀 Quick Start You can load the dataset directly from Hugging Face: ```python from datasets import load_dataset # Load UltraData-Math-L1 ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L1") # Load UltraData-Math-L2-preview ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L2-preview") # Load UltraData-Math-L3 (default: Conversation-Synthetic) ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L3-Conversation-Synthetic") # Other L3 configs: # - UltraData-Math-L3-Multi-Style-Synthetic # - UltraData-Math-L3-QA-Synthetic # - UltraData-Math-L3-Textbook-Exercise-Synthetic ``` ## 📈 Experimental Results We evaluated data quality using the **Decay Verification** method: continuing pre-training of a **MiniCPM-1.2B** base model (pre-trained on 1.3T tokens with **MiniCPM3-4B** tokenizer) with **~100B tokens** (30% target data + 70% general data). We used [OpenCompass](https://github.com/open-compass/opencompass) as our evaluation framework. Evaluation benchmarks include: - **General English:** MMLU, ARC-E, ARC-C, BigBench Hard (BBH), CommonSenseQA, HellaSwag, OpenbookQA, PIQA, SIQA, Winogrande - **General Chinese:** C-Eval, CMMLU - **Math Reasoning:** MATH500, GSM8K, Math-Bench, R-Bench-Math - **Code Reasoning:** MBPP, HumanEval ### Effectiveness of L0 Parsing Strategy To fairly compare different parsing strategies, we conducted experiments on a data subset sampled from the **2023-2024** distribution. We re-parsed the raw HTML from this source using different parsers. This comparison demonstrates the **effectiveness of our L0 Parser** against other parsers. <div align="center"> <img src="assets/ultradata-math-l0-parser-comparison.png" width="700"/> </div> ### Pipeline Effectiveness (L1 vs L2 vs L3) To validate the effectiveness of our L0-L3 tiered framework, we conducted ablation studies comparing models trained on different tiers of UltraData-Math. Unlike the L0 parser comparison above (which used a 2023-2024 subset), these results are based on the **full dataset**. Results demonstrate that higher-tier data (L3) significantly boosts mathematical reasoning (MATH500, GSM8K) and general capabilities. <div align="center"> <img src="assets/ultradata-math-l1l2l3-comparison.png" width="700"/> </div> ### Full Evaluation Results To compare against existing public mathematical pre-training datasets, we trained models independently on each dataset using the same model architecture and training budget (~100B tokens). The baselines include [Nemotron-CC-Math](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1), [MegaMath-Web-Pro](https://huggingface.co/datasets/LLM360/MegaMath), and [FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath). All models are evaluated under identical conditions for a fair comparison: <div align="center"> <img src="assets/ultradata-math-full-comparison.png" width="700"/> </div> ## ❤️ Acknowledgements - **L0 Parsing Layer**: [magic-html](https://github.com/opendatalab/magic-html), [w3m](http://w3m.sourceforge.net/), [trafilatura](https://github.com/adbar/trafilatura) - **L3 Synthesis Layer**: [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct), [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B), [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) - **Seed Data**: [Nemotron-CC-Math](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1), [MegaMath](https://huggingface.co/datasets/LLM360/MegaMath), [FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath) ## 📖 Citation If you find **UltraData-Math** useful in your research, please consider citing: ```bibtex @misc{ultradata-math, title={UltraData-Math}, author={Chuyue Zhou and Hongya Lyu and Xinle Lin and Hengyu Zhao and Junshao Guo and Xueren Zhang and Shuaikang Xue and Qiang Ma and Jie Zhou and Yudong Wang and Zhiyuan Liu}, year={2026}, url={https://huggingface.co/datasets/openbmb/UltraData-Math}, publisher={Hugging Face} } ``` ## 📜 License This project is licensed under the [Apache 2.0](./LICENSE) license.

# UltraData-Math <div align="center"> <img src="assets/ultradata-math-logo.png" width="600"/> </div> <p align="center"> <a href="https://huggingface.co/datasets/openbmb/UltraData-Math">🤗 数据集</a> | <a href="https://github.com/UltraData-OpenBMB/UltraData-Math">💻 源代码</a> | <a href="https://huggingface.co/datasets/openbmb/UltraData-Math/blob/main/README_ZH.md">🇨🇳 中文 README</a> </p> ***UltraData-Math*** 是一款大规模高质量数学预训练数据集，总令牌（Token）数超过2900亿，涵盖三个递进式层级：**L1**（1705亿令牌的网页语料库）、**L2**（337亿令牌的优质筛选语料）以及**L3**（880亿令牌的多格式精炼语料），旨在系统性提升大语言模型（Large Language Model，LLM）的数学推理能力。该数据集已应用于[MiniCPM系列模型](https://huggingface.co/collections/openbmb/minicpm4)的数学预训练。该数据集首次提出于论文《面向通用人工智能（AGI）的数据科学与技术第一部分：分层数据管理》（[Data Science and Technology Towards AGI Part I: Tiered Data Management](https://huggingface.co/papers/2602.09003)）。 ## 🆕 更新动态 - **[2026.02.09]**：***UltraData-Math*** 这款拥有2900亿+令牌、涵盖三个递进层级（L1/L2预览版/L3）的大规模高质量数学预训练数据集，现已在Hugging Face平台上线，作为[UltraData](https://ultradata.openbmb.cn/)生态系统的一部分正式发布。🔥🔥🔥 - **[2026.02.10]**：***UltraData-Math*** 登顶Hugging Face数据集热门榜单，位列第一！⭐️⭐️⭐️ ## 📚 数据集简介高质量的预训练数据对于提升大语言模型的数学推理能力至关重要。然而，现有的数学预训练数据构建方案存在以下缺陷： - **HTML解析**：通用解析器（如trafilatura、readability）主要针对新闻/文章解析设计，缺乏针对数学公式等内容的专门处理逻辑，常导致公式结构破坏或丢失；同时，论坛类页面中的数学讨论内容难以被完整提取。 - **数据质量**：现有数据集普遍缺乏系统性的质量分级机制，高价值的数学内容与低质量噪声混杂在一起。 - **数据多样性**：主流数据集大多来源于教科书或竞赛题库，缺乏真实网页中的数学讨论与应用场景；合成数据格式单一，难以覆盖多轮对话、多风格表达等多样化需求。为解决上述问题，我们推出了***UltraData-Math***——一款面向数学推理任务的大规模高质量预训练数据集。本数据集基于[UltraData](https://ultradata.openbmb.cn/blog/position-paper)提出的L0-L4分层数据管理框架构建，包含四个递进式层级： - **L0 原始数据**：基于*magic-html*开发数学解析器，结合*w3m*布局保留渲染与多级回退策略，将MathML、KaTeX与AsciiMath统一标准化为LaTeX格式。 - **L1 过滤数据**：通过启发式规则清理噪声，并在文档层面进行去重。 - **L2 精选数据**：使用自研大语言模型对种子数据进行标注，并将标注结果蒸馏为轻量级嵌入分类器，实现对全语料库的高效质量分级。 - **L3 精炼数据**：通过重写、合成生成与多格式精炼（如问答对、多轮对话、多风格重写、知识点导向教科书等），生成推理逻辑清晰的结构化内容。实验结果表明，在MiniCPM-1.2B架构上，***UltraData-Math***在MATH500基准测试中取得了**37.02pp**的得分，较Nemotron-CC 4plus提升了**+3.62pp**；在GSM8K基准测试中取得**61.79pp**的得分，提升了**+3.34pp**，同时保留了代码生成与通用知识能力。 ***UltraData-Math*** 已应用于[MiniCPM系列模型](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b)的数学预训练。 - **[UltraData-Math-L1](https://huggingface.co/datasets/openbmb/UltraData-Math)**：大规模高质量数学预训练数据集，包含1705亿令牌的网页数学语料。 - **[UltraData-Math-L2](https://huggingface.co/datasets/openbmb/UltraData-Math-L2)**：经质量模型筛选的高质量数学预训练数据集，包含337亿令牌的优质网页数学语料。 - **[UltraData-Math-L3](https://huggingface.co/datasets/openbmb/UltraData-Math-L3)**：高质量精炼数学数据集，包含880亿令牌的多格式精炼数据（问答对、多轮对话、知识点教科书等）。 ## 🏗️ 数据处理流程为突破现有数学数据集在质量与多样性上的局限，我们建立了以“数学内容完整性”与“信息密度”为核心的精细化分级标准。***UltraData-Math*** 采用[UltraData](https://ultradata.openbmb.cn/blog/position-paper)论文提出的**L0-L4分层数据管理框架**，通过标准化的层级定义实现数学数据资产的有序管理与高效流转。每一个层级都代表更高的数据纯净度与数学价值，同时对应更精细化的处理程度。 <div align="center"> <img src="assets/ultradata-math-pipeline.png" width="900"/> </div> ### L0：原始数据解析与标准化 **目标**：解决通用HTML解析器对数学公式的支持不足问题，最大化保留网页中的数学语义。 L0阶段主要处理从Common Crawl等来源获取的原始网页数据。考虑到数学类网页的特殊性，我们通过[UltraData-Math-Parser](https://huggingface.co/spaces/openbmb/UltraData-Math-L0-Parser)开发专用解析策略，而非直接使用trafilatura或readability等通用解析器。 - **统一解析模式**：自动识别页面类型，尽可能确保内容完整提取。 - **多级回退策略**：为防止解析失败导致的数据丢失，我们实现了多级回退机制，即使结构化解析失败，也能确保捕获文本内容。 - **数学公式标准化**：将网页中各类数学表达式统一转换为标准LaTeX格式，实现数据格式归一化，便于模型统一学习。 ### L1：启发式清洗与过滤 **目标**：移除格式噪声，提升数据可读性与规范性。在获取包含完整数学公式的文本后，我们通过一系列启发式规则对L0数据进行清洗： - **格式修复**： - 清理不可见字符、乱码与非自然连续换行。 - 移除导航栏、页脚、弹窗广告、“查看更多”等无关网页噪声。 - **内容过滤**： - *长度过滤*：移除过短的文本片段，这类内容通常缺乏上下文，难以支撑有效的数学推理训练。 - *语言识别*：确保数据集主要由高质量的中英数学内容构成。 - *文档去重*：在文档层面进行去重，避免重复内容对模型训练造成偏差。 ### L2：基于质量模型的筛选 **目标**：从海量数据中识别高价值的核心语料。尽管L1数据格式整洁，但内容质量参差不齐。L2阶段引入了基于模型的质量评估体系： - **种子数据标注**：使用自研大语言模型从多维度对部分种子数据进行评分。 - **分类器训练与蒸馏**：基于标注数据训练轻量级嵌入分类器，使其具备识别高价值数学内容的能力。 - **全量推理**：使用训练好的分类器对L1数据进行全量评分与筛选： - *保留*：包含详细解题步骤、数学概念解释与高层次学术讨论的内容。 - *剔除*：简单名词堆叠、无意义数字列表、低龄化内容或非数学领域的噪声。 ### L3：精炼数据 **目标**：通过重写、合成生成与多格式精炼，生成推理逻辑清晰、教育意图明确的结构化内容，达到教科书级标准，确保最大学习价值。自然网页数据大多为陈述性文本，缺乏结构化推理步骤与多样化的教学格式。为提升模型的思维链（Chain-of-Thought，CoT）能力与多轮交互技能，我们通过[UltraData-Math-Generator](https://huggingface.co/spaces/openbmb/UltraData-Math-L3-Generator)构建L3精炼数据层： - **问答对生成**：使用高性能模型将陈述性文档重写为“问答对”，构建具备明确推理步骤的问答风格数据。 - **多轮对话合成**：模拟“师生辅导”场景，生成包含跟进问题、修正与指导的多轮对话数据。 - **多风格重写**：将单源数据重写为多种风格（如严谨教科书风格、竞赛解题风格、直观科普风格），提升模型泛化能力。 - **知识点教科书生成**：基于特定知识点生成系统化的教科书级内容，确保模型掌握核心数学概念。 - **格式修复与增强**：修复源数据中的格式问题（如损坏的LaTeX公式、不一致的符号），增强内容连贯性，达到教科书级标准。基于上述方法，我们构建了以下***UltraData-Math***数据集： | 数据集名称 | 令牌数 | 文档数 | |:---|:---:|:---:| | UltraData-Math-L1 | 1705亿 | 8560万 | | UltraData-Math-L2-preview | 337亿 | 1498万 | | UltraData-Math-L3 | 880亿 | 8140万 | ## 🚀 快速上手你可以直接从Hugging Face加载该数据集： python from datasets import load_dataset # 加载 UltraData-Math-L1 ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L1") # 加载 UltraData-Math-L2-preview ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L2-preview") # 加载 UltraData-Math-L3（默认：对话合成型） ds = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L3-Conversation-Synthetic") # 其他 L3 配置项： # - UltraData-Math-L3-Multi-Style-Synthetic（多风格合成型） # - UltraData-Math-L3-QA-Synthetic（问答合成型） # - UltraData-Math-L3-Textbook-Exercise-Synthetic（教科书习题合成型） ## 📈 实验结果我们采用**衰减验证**方法评估数据质量：使用约1000亿令牌（30%目标数据+70%通用数据）对**MiniCPM-1.2B**基础模型（在1.3万亿令牌上使用**MiniCPM3-4B**分词器预训练）进行持续预训练。我们使用[OpenCompass](https://github.com/open-compass/opencompass)作为评估框架，评估基准包括： - **通用英语任务**：MMLU、ARC-E、ARC-C、BigBench Hard（BBH）、CommonSenseQA、HellaSwag、OpenbookQA、PIQA、SIQA、Winogrande - **通用中文任务**：C-Eval、CMMLU - **数学推理任务**：MATH500、GSM8K、Math-Bench、R-Bench-Math - **代码推理任务**：MBPP、HumanEval ### L0解析策略的有效性为公平对比不同解析策略的效果，我们在从**2023-2024**分布中采样的数据集子集上开展了实验，使用不同解析器重新解析该来源的原始HTML。该对比验证了我们的L0解析器相较于其他解析器的有效性。 <div align="center"> <img src="assets/ultradata-math-l0-parser-comparison.png" width="700"/> </div> ### 流程有效性（L1 vs L2 vs L3）为验证我们的L0-L3分层框架的有效性，我们开展了消融实验，对比使用不同层级UltraData-Math数据集训练的模型。与上述L0解析器对比实验（使用2023-2024子集）不同，本实验结果基于**完整数据集**。结果表明，更高层级的数据（L3）可显著提升数学推理能力（MATH500、GSM8K）与通用能力。 <div align="center"> <img src="assets/ultradata-math-l1l2l3-comparison.png" width="700"/> </div> ### 完整评估结果为与现有公开数学预训练数据集进行对比，我们使用相同的模型架构与训练预算（约1000亿令牌）在每个数据集上独立训练模型。对比基线包括[Nemotron-CC-Math](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1)、[MegaMath-Web-Pro](https://huggingface.co/datasets/LLM360/MegaMath)与[FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath)。所有模型在相同条件下进行评估以确保公平对比： <div align="center"> <img src="assets/ultradata-math-full-comparison.png" width="700"/> </div> ## ❤️ 致谢 - **L0解析层**：[magic-html](https://github.com/opendatalab/magic-html)、[w3m](http://w3m.sourceforge.net/)、[trafilatura](https://github.com/adbar/trafilatura) - **L3合成层**：[Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)、[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)、[GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) - **种子数据**：[Nemotron-CC-Math](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1)、[MegaMath](https://huggingface.co/datasets/LLM360/MegaMath)、[FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath) ## 📖 引用如果您在研究中使用***UltraData-Math***，请引用以下文献： bibtex @misc{ultradata-math, title={UltraData-Math}, author={Chuyue Zhou and Hongya Lyu and Xinle Lin and Hengyu Zhao and Junshao Guo and Xueren Zhang and Shuaikang Xue and Qiang Ma and Jie Zhou and Yudong Wang and Zhiyuan Liu}, year={2026}, url={"https://huggingface.co/datasets/openbmb/UltraData-Math"}, publisher={Hugging Face} } ## 📜 许可证本项目采用[Apache 2.0](./LICENSE)许可证开源。

提供机构：

maas

创建时间：

2026-02-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集