devvrit/polaris_filtered_nemotron_easy_math_verifiable
收藏Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/devvrit/polaris_filtered_nemotron_easy_math_verifiable
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: reasoning_content
dtype: string
- name: role
dtype: string
- name: expected_answer
dtype: string
splits:
- name: train
num_bytes: 1130496842
num_examples: 268631
download_size: 590666190
dataset_size: 1130496842
license: cc-by-4.0
task_categories:
- text-generation
tags:
- math
- reasoning
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Polaris-Filtered Nemotron Easy Math (Verifiable)
A filtered subset of [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) (`low` / easy split), retaining only non-TIR samples with verifiable boxed answers.
## Filtering Pipeline
1. **Remove TIR / tool-use samples** — drop any sample that contains Python code blocks (`\`\`\`python`, `<|python_start|>`, `<tool_call>`) or has a non-empty `tools`/`tool` field.
2. **Polaris decontamination** — remove samples whose user prompt shares any 15-gram overlap with problems in [POLARIS-Project/Polaris-Dataset-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K).
3. **Answer verification** — keep only samples where the `\boxed{}` answer in the assistant response matches the `expected_answer` field (verified via [math-verify](https://pypi.org/project/math-verify/)).
## Dataset Statistics
| Metric | Value |
|--------|------:|
| Total samples | 268,631 |
| Min tokens | 99 |
| Max tokens | 31,802 |
| Mean tokens | 1,561.8 |
| Median tokens | 1,416.0 |
| P25 | 972 |
| P75 | 1,975 |
| P90 | 2,647 |
| P95 | 3,118 |
| P99 | 4,186 |
*Token counts measured with [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (full conversation including reasoning + content).*
## Sequence Length Distribution
| Bucket | Count | Percentage |
|--------|------:|----------:|
| <2k | 203,591 | 75.8% |
| 2k-4k | 61,494 | 22.9% |
| 4k-6k | 3,281 | 1.2% |
| 6k-8k | 229 | 0.1% |
| 8k-10k | 30 | 0.0% |
| 10k-12k | 5 | 0.0% |
| 12k-14k | 0 | 0.0% |
| 14k-16k | 0 | 0.0% |
| 16k-18k | 0 | 0.0% |
| >18k | 1 | 0.0% |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("devvrit/polaris_filtered_nemotron_easy_math_verifiable", split="train")
```
## Source
- **Base dataset:** [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) — `data/low.jsonl`
- **Decontamination reference:** [POLARIS-Project/Polaris-Dataset-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K)
提供机构:
devvrit



