five

apat1n/UltraData-Math-boxed

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/apat1n/UltraData-Math-boxed
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - math - grpo - rl - boxed size_categories: - 100K<n<1M configs: - config_name: default data_files: data/train-* default: true - config_name: normalized data_files: normalized/train-* --- # UltraData-Math-boxed High-confidence subset of [UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math) containing only problems with `\boxed{}` answers. **Original dataset**: [UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math) (53M samples across 2 configs) ## Variants | Config | Samples | Description | |--------|---------|-------------| | **default** | 390,082 | Raw `\boxed{}` extractions, minimal cleanup | | **normalized** | 217,049 | Filtered + validated: no units, no variables, no multi-value, sympy LaTeX verified | ## Why boxed-only? UltraData-Math is synthetic textbook data where only ~2-7% of solutions contain `\boxed{}` final answers. The remaining ~93% require regex extraction from solution text (e.g. grabbing the last `= something`), which produces ~1-5% wrong ground truth labels. For RL training (GRPO), incorrect ground truth creates false learning signals. This dataset keeps only the high-confidence `\boxed{}` extractions for clean reward signal. ## Normalized Variant Filters 1. **Answer cleaning**: Strip LaTeX delimiters, trailing junk, degree symbols 2. **Answer validation**: Must contain digits, <=30 chars, no variables, no multi-value, no units, no prose, no broken LaTeX 3. **LaTeX verification** via sympy `parse_latex` 4. **Problem filtering**: No multi-part, no proofs, no sketching, no rewrite tasks, length 30-2000 chars ## Stats | Source Config | Total Rows | With boxed | Rate | |---|---|---|---| | L3-Textbook-Exercise-Synthetic | 26M | ~411k | 1.6% | | L3-QA-Synthetic | 27M | ~1.87M | 6.9% | | **After dedup** | **53M** | **390k** | **0.7%** | | **After normalization** | - | **217k** | - | ## Dataset Format | Column | Type | Description | |--------|------|-------------| | problem | str | Math problem text | | answer | str | Answer extracted from `\boxed{}` in the solution | ## Usage ```python from datasets import load_dataset # Raw boxed extractions (390k) ds = load_dataset("apat1n/UltraData-Math-boxed") # Normalized + filtered (217k) ds = load_dataset("apat1n/UltraData-Math-boxed", "normalized") ``` ## Credits Based on [UltraData/UltraData-Math](https://huggingface.co/datasets/UltraData/UltraData-Math).
提供机构:
apat1n
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作