five

swan07/math-tiers

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/swan07/math-tiers
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - pretraining - numerical-reasoning - tiered-corpus size_categories: - n>1T --- # Math-Tiers: A Tiered Pretraining Corpus for Studying Numerical Reasoning A large-scale English pretraining corpus split into three tiers by mathematical content density. Designed for controlled experiments studying how data composition during pretraining affects numerical reasoning in language models. ## Tiers | Tier | Description | Shards | Size | Est. Tokens | Sources | |------|-------------|--------|------|-------------|---------| | **T0** | Pure narrative: no digits, number words, or math | 648 | 542 GB | ~113B | RedPajama-Book, PleIAs/English-PD, Project Gutenberg, Institutional Books, FineWeb | | **T1** | Everyday numeric language: blocks formal math only | 1,216 | 314 GB | ~66B | allenai/c4 (English) | | **T2** | Full math content: unfiltered | 751 | 580 GB | ~121B | HuggingFaceTB/finemath (finemath-3plus) | | **Total** | | **2,615** | **1,437 GB** | **~300B** | | ## Format Each tier is stored as sharded JSONL files: `T0/T0_0000.jsonl`, `T1/T1_0000.jsonl`, `T2/T2_0000.jsonl`, etc. Each line is a JSON object with: ```json {"text": "...", "source": "english-pd", "token_estimate": 1234} ``` - `text`: The filtered document text - `source`: Origin dataset identifier - `token_estimate`: Approximate whitespace-split token count ## Filtering All tiers use sentence-level filtering: documents are split into sentences (NLTK punkt), individual sentences matching the blocklist are removed, and remaining sentences are rejoined. This preserves more text than paragraph-level filtering. ### T0 Blocklist (aggressive: removes all numeric content) - **Digits**: All characters 0-9 - **Operators**: `+ - * / = ^ % < >` and Unicode math symbols - **Fraction characters**: `½ ¼ ¾` etc. - **Number words**: zero through trillion, ordinals (first–twelfth), once/twice/thrice, half/quarter/double/triple/dozen - **Math terms**: equation, variable, polynomial, derivative, integral, theorem, eigenvalue, topology, etc. - **Patterns**: LaTeX math (`$...$`, `\frac{}`, `\sum`, `\int`, etc.) ### T1 Blocklist (moderate: removes formal math only) - **No digit or operator blocking** — everyday numbers pass through - **Math terms**: equation, variable, polynomial, derivative, integral, theorem, eigenvalue, topology, etc. - **Patterns**: LaTeX math expressions ### T2 Blocklist None. All content from finemath-3plus is included. ## Intended Use This corpus supports a pretraining experiment with the following design: 1. **Base model**: Train from scratch on T0 (pure narrative) for 60B tokens 2. **Model 0**: Continue base on T0 (held-out shards) for 20B tokens 3. **Model 1**: Continue base on T1 (everyday numeric) for 20B tokens 4. **Model 2**: Continue base on T2 (full math) for 20B tokens Comparing Models 0/1/2 isolates the effect of mathematical content exposure during the second training phase, controlling for total compute and training procedure. ## Sources - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) (book subset) - [PleIAs/English-PD](https://huggingface.co/datasets/PleIAs/English-PD) - [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg) - [institutional/institutional-books-1.0](https://huggingface.co/datasets/institutional/institutional-books-1.0) - [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [allenai/c4](https://huggingface.co/datasets/allenai/c4) - [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath) (finemath-3plus config)

许可证:CC BY 4.0 语言:英语 标签:预训练(pretraining)、数值推理(numerical reasoning)、分层语料库(tiered-corpus) 规模类别:Token数大于1万亿 # Math-Tiers:用于数值推理研究的分层预训练语料库 本数据集为大规模英语预训练语料库,按数学内容密度划分为三个层级,旨在通过可控实验研究预训练阶段的数据组成如何影响大语言模型(Large Language Model, LLM)的数值推理能力。 ## 层级划分 | 层级 | 描述 | 分片数 | 存储规模 | 预估Token数 | 数据源 | |------|------|--------|----------|-------------|--------| | **T0** | 纯叙事文本:无数字、数字词汇或数学内容 | 648 | 542 GB | 约1130亿 | RedPajama-Book、PleIAs/English-PD、Project Gutenberg、机构图书库、FineWeb | | **T1** | 日常数值语言:仅屏蔽正式数学内容 | 1216 | 314 GB | 约660亿 | allenai/c4(英语子集) | | **T2** | 全量数学内容:无过滤处理 | 751 | 580 GB | 约1210亿 | HuggingFaceTB/finemath(finemath-3plus配置) | | **总计** | | 2615 | 1437 GB | 约3000亿 | | ## 数据格式 各层级均以分片JSON Lines(JSONL)文件存储,路径格式为`T0/T0_0000.jsonl`、`T1/T1_0000.jsonl`、`T2/T2_0000.jsonl`等。 每一行均为符合如下格式的JSON对象: json {"text": "...", "source": "english-pd", "token_estimate": 1234} - `text`:经过过滤的文档文本 - `source`:原始数据集标识符 - `token_estimate`:基于空格分割的近似Token数 ## 过滤规则 所有层级均采用句子级过滤流程:将文档拆分为句子(使用NLTK punkt分词器),移除匹配屏蔽列表的单句后,将剩余句子重新拼接。相较于段落级过滤,该方案可保留更多有效文本。 ### T0 屏蔽列表(严格模式:移除所有数值相关内容) - 数字字符:0-9的所有阿拉伯数字 - 运算符:`+`、`-`、`*`、`/`、`=`、`^`、`%`、`<`、`>` 及Unicode数学符号 - 分数符号:`½`、`¼`、`¾` 等 - 数字词汇:zero至trillion的英文数字词、序数词(first至twelfth)、频次词(once、twice、thrice)及倍数/数量词(half、quarter、double、triple、dozen) - 数学术语:方程(equation)、变量(variable)、多项式(polynomial)、导数(derivative)、积分(integral)、定理(theorem)、特征值(eigenvalue)、拓扑学(topology)等 - 匹配模式:LaTeX数学表达式(如`$...$`、`frac{}`、`sum`、`int` 等) ### T1 屏蔽列表(中等模式:仅屏蔽正式数学内容) - 不屏蔽数字字符与运算符:日常数值内容可正常保留 - 数学术语:同T0屏蔽列表中的数学术语集合 - 匹配模式:LaTeX数学表达式 ### T2 屏蔽列表 无任何屏蔽规则,finemath-3plus的所有内容均被保留。 ## 预期用途 本语料库可支撑如下设计的预训练实验: 1. **基础模型**:基于T0层(纯叙事文本)从零开始预训练,训练Token数达600亿 2. **模型0**:在基础模型的基础上,继续使用T0层的保留分片进行训练,训练Token数达200亿 3. **模型1**:在基础模型的基础上,继续使用T1层(日常数值语言)进行训练,训练Token数达200亿 4. **模型2**:在基础模型的基础上,继续使用T2层(全量数学内容)进行训练,训练Token数达200亿 通过对比模型0、1、2的性能,可以隔离出第二训练阶段中数学内容接触量对模型的影响,同时控制总计算量与训练流程的一致性。 ## 数据源 - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)(图书子集) - [PleIAs/English-PD](https://huggingface.co/datasets/PleIAs/English-PD) - [manu/project_gutenberg](https://huggingface.co/datasets/manu/project_gutenberg) - [institutional/institutional-books-1.0](https://huggingface.co/datasets/institutional/institutional-books-1.0) - [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [allenai/c4](https://huggingface.co/datasets/allenai/c4) - [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath)(finemath-3plus配置)
提供机构:
swan07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作