swan07/math-tiers

Name: swan07/math-tiers
Creator: swan07
Published: 2026-03-06 14:59:09
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/swan07/math-tiers

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - pretraining - numerical-reasoning - tiered-corpus size_categories: - n>1T --- # Math-Tiers: A Tiered Pretraining Corpus for Studying Numerical Reasoning A large-scale English pretraining corpus split into three tiers by mathematical content density. Designed for controlled experiments studying how data composition during pretraining affects numerical reasoning in language models. ## Tiers | Tier | Description | Shards | Size | Est. Tokens | Sources | |------|-------------|--------|------|-------------|---------| | **T0** | Pure narrative: no digits, number words, or math | 648 | 542 GB | ~113B | RedPajama-Book, PleIAs/English-PD, Project Gutenberg, Institutional Books, FineWeb | | **T1** | Everyday numeric language: blocks formal math only | 1,216 | 314 GB | ~66B | allenai/c4 (English) | | **T2** | Full math content: unfiltered | 751 | 580 GB | ~121B | HuggingFaceTB/finemath (finemath-3plus) | | **Total** | | **2,615** | **1,437 GB** | **~300B** | | ## Format Each tier is stored as sharded JSONL files: `T0/T0_0000.jsonl`, `T1/T1_0000.jsonl`, `T2/T2_0000.jsonl`, etc. Each line is a JSON object with: ```json {"text": "...", "source": "english-pd", "token_estimate": 1234} ``` - `text`: The filtered document text - `source`: Origin dataset identifier - `token_estimate`: Approximate whitespace-split token count ## Filtering All tiers use sentence-level filtering: documents are split into sentences (NLTK punkt), individual sentences matching the blocklist are removed, and remaining sentences are rejoined. This preserves more text than paragraph-level filtering. ### T0 Blocklist (aggressive: removes all numeric content) - **Digits**: All characters 0-9 - **Operators**: `+ - * / = ^ % < >` and Unicode math symbols - **Fraction characters**: `½ ¼ ¾` etc. - **Number words**: zero through trillion, ordinals (first–twelfth), once/twice/thrice, half/quarter/double/triple/dozen - **Math terms**: equation, variable, polynomial, derivative, integral, theorem, eigenvalue, topology, etc. - **Patterns**: LaTeX math (`$...$`, `\frac{}`, `\sum`, `\int`, etc.) ### T1 Blocklist (moderate: removes formal math only) - **No digit or operator blocking** — everyday numbers pass through - **Math terms**: equation, variable, polynomial, derivative, integral, theorem, eigenvalue, topology, etc. - **Patterns**: LaTeX math expressions ### T2 Blocklist None. All content from finemath-3plus is included. ## Intended Use This corpus supports a pretraining experiment with the following design: 1. **Base model**: Train from scratch on T0 (pure narrative) for 60B tokens 2. **Model 0**: Continue base on T0 (held-out shards) for 20B tokens 3. **Model 1**: Continue base on T1 (everyday numeric) for 20B tokens 4. **Model 2**: Continue base on T2 (full math) for 20B tokens Comparing Models 0/1/2 isolates the effect of mathematical content exposure during the second training phase, controlling for total compute and training procedure. ## Sources - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) (book subset) - [PleIAs/English-PD](https://huggingface.co/datasets/PleIAs/English-PD) - [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg) - [institutional/institutional-books-1.0](https://huggingface.co/datasets/institutional/institutional-books-1.0) - [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [allenai/c4](https://huggingface.co/datasets/allenai/c4) - [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath) (finemath-3plus config)

许可证：CC BY 4.0 语言：英语标签：预训练（pretraining）、数值推理（numerical reasoning）、分层语料库（tiered-corpus）规模类别：Token数大于1万亿 # Math-Tiers：用于数值推理研究的分层预训练语料库本数据集为大规模英语预训练语料库，按数学内容密度划分为三个层级，旨在通过可控实验研究预训练阶段的数据组成如何影响大语言模型（Large Language Model, LLM）的数值推理能力。 ## 层级划分 | 层级 | 描述 | 分片数 | 存储规模 | 预估Token数 | 数据源 | |------|------|--------|----------|-------------|--------| | **T0** | 纯叙事文本：无数字、数字词汇或数学内容 | 648 | 542 GB | 约1130亿 | RedPajama-Book、PleIAs/English-PD、Project Gutenberg、机构图书库、FineWeb | | **T1** | 日常数值语言：仅屏蔽正式数学内容 | 1216 | 314 GB | 约660亿 | allenai/c4（英语子集） | | **T2** | 全量数学内容：无过滤处理 | 751 | 580 GB | 约1210亿 | HuggingFaceTB/finemath（finemath-3plus配置） | | **总计** | | 2615 | 1437 GB | 约3000亿 | | ## 数据格式各层级均以分片JSON Lines（JSONL）文件存储，路径格式为`T0/T0_0000.jsonl`、`T1/T1_0000.jsonl`、`T2/T2_0000.jsonl`等。每一行均为符合如下格式的JSON对象： json {"text": "...", "source": "english-pd", "token_estimate": 1234} - `text`：经过过滤的文档文本 - `source`：原始数据集标识符 - `token_estimate`：基于空格分割的近似Token数 ## 过滤规则所有层级均采用句子级过滤流程：将文档拆分为句子（使用NLTK punkt分词器），移除匹配屏蔽列表的单句后，将剩余句子重新拼接。相较于段落级过滤，该方案可保留更多有效文本。 ### T0 屏蔽列表（严格模式：移除所有数值相关内容） - 数字字符：0-9的所有阿拉伯数字 - 运算符：`+`、`-`、`*`、`/`、`=`、`^`、`%`、`<`、`>` 及Unicode数学符号 - 分数符号：`½`、`¼`、`¾` 等 - 数字词汇：zero至trillion的英文数字词、序数词（first至twelfth）、频次词（once、twice、thrice）及倍数/数量词（half、quarter、double、triple、dozen） - 数学术语：方程（equation）、变量（variable）、多项式（polynomial）、导数（derivative）、积分（integral）、定理（theorem）、特征值（eigenvalue）、拓扑学（topology）等 - 匹配模式：LaTeX数学表达式（如`$...$`、`frac{}`、`sum`、`int` 等） ### T1 屏蔽列表（中等模式：仅屏蔽正式数学内容） - 不屏蔽数字字符与运算符：日常数值内容可正常保留 - 数学术语：同T0屏蔽列表中的数学术语集合 - 匹配模式：LaTeX数学表达式 ### T2 屏蔽列表无任何屏蔽规则，finemath-3plus的所有内容均被保留。 ## 预期用途本语料库可支撑如下设计的预训练实验： 1. **基础模型**：基于T0层（纯叙事文本）从零开始预训练，训练Token数达600亿 2. **模型0**：在基础模型的基础上，继续使用T0层的保留分片进行训练，训练Token数达200亿 3. **模型1**：在基础模型的基础上，继续使用T1层（日常数值语言）进行训练，训练Token数达200亿 4. **模型2**：在基础模型的基础上，继续使用T2层（全量数学内容）进行训练，训练Token数达200亿通过对比模型0、1、2的性能，可以隔离出第二训练阶段中数学内容接触量对模型的影响，同时控制总计算量与训练流程的一致性。 ## 数据源 - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)（图书子集） - [PleIAs/English-PD](https://huggingface.co/datasets/PleIAs/English-PD) - [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg) - [institutional/institutional-books-1.0](https://huggingface.co/datasets/institutional/institutional-books-1.0) - [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [allenai/c4](https://huggingface.co/datasets/allenai/c4) - [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath)（finemath-3plus配置）

提供机构：

swan07

5,000+

优质数据集

54 个

任务类型

进入经典数据集