apat1n/UltraData-Math-filtered
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/apat1n/UltraData-Math-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- math
- grpo
- rl
- filtered
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: problem
dtype: string
- name: answer
dtype: string
splits:
- name: train
num_examples: 4318243
---
# UltraData-Math-filtered (v5)
Filtered subset of [UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math) for GRPO (Group Relative Policy Optimization) training on math problems.
**Original dataset**: [UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math) (18.6M samples)
## Filtering Pipeline (v5)
Original: **18.6M** samples -> Filtered: **4.3M** samples (23% kept)
### Filters Applied
1. **Answer cleaning**: Strip LaTeX delimiters, trailing junk, degree symbols
2. **Answer validation**: Must contain digits, <=30 chars, no variables, no multi-value, no units, no prose, no broken LaTeX
3. **LaTeX verification** via sympy `parse_latex` - answers must be valid mathematical expressions
4. **Problem filtering**: No multi-part, no proofs, no sketching, no rewrite tasks, length 30-2000 chars
### Dataset Format
| Column | Type | Description |
|--------|------|-------------|
| problem | str | Math problem text |
| answer | str | Clean single-value answer (number, fraction, or LaTeX expression) |
### Usage
```python
from datasets import load_dataset
ds = load_dataset("apat1n/UltraData-Math-filtered")
```
### Filter Versions History
| Version | Samples | Key Changes |
|---------|---------|-------------|
| v1 | 13.8M | Basic answer cleaning |
| v2 | 8.5M | Variable rejection, unit filtering |
| v3 | 7.1M | Degree symbol handling, tighter patterns |
| v4 | 6.4M | Problem-type filtering (reject proofs, multi-part) |
| v5 | 4.3M | sympy LaTeX validation, stricter problem filters |
### Credits
Based on [UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math).
语言:
- 英语(en)
许可证:Apache 2.0
任务类别:
- 文本生成
标签:
- 数学
- GRPO(Group Relative Policy Optimization,组相对策略优化)
- RL(Reinforcement Learning,强化学习)
- 经过过滤
样本规模:
- 100万<样本数<1000万
数据集信息:
特征字段:
- 字段名:problem,数据类型:字符串
- 字段名:answer,数据类型:字符串
数据集划分:
- 划分名称:train,样本数量:4318243
# UltraData-Math-filtered(版本5)
本数据集为[UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math)的过滤子集,专为数学题场景下的GRPO训练设计。
**原始数据集**:[UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math),共1860万条样本。
## 过滤流程(版本5)
原始样本量:**1860万** → 过滤后样本量:**430万**(保留率23%)
### 应用的过滤规则
1. **答案清洗**:移除LaTeX分隔符、末尾冗余内容以及度数符号
2. **答案校验**:需包含数字、字符长度不超过30、无变量、无多值结果、无单位、无散文式描述、无格式损坏的LaTeX
3. **LaTeX验证**:通过sympy的`parse_latex`接口进行校验,确保答案为合法数学表达式
4. **问题过滤**:排除多部分试题、证明题、绘图题、改写类任务,文本长度控制在30至2000字符之间
### 数据集格式
| 字段名 | 数据类型 | 字段说明 |
|--------|---------|---------|
| problem | 字符串 | 数学试题文本 |
| answer | 字符串 | 清洗后的单值答案(可为数字、分数或LaTeX表达式) |
### 使用方法
python
from datasets import load_dataset
ds = load_dataset("apat1n/UltraData-Math-filtered")
### 过滤版本迭代历史
| 版本号 | 样本量 | 关键更新 |
|---------|---------|-------------|
| v1 | 1380万 | 基础答案清洗 |
| v2 | 850万 | 新增变量过滤与单位筛选 |
| v3 | 710万 | 新增度数符号处理与更严格的匹配规则 |
| v4 | 640万 | 新增试题类型过滤(排除证明题与多部分试题) |
| v5 | 430万 | 新增sympy LaTeX验证与更严格的试题过滤规则 |
### 致谢
本数据集基于[UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math)构建。
提供机构:
apat1n



