devvrit/polaris_filtered_nemotron_easy_math_verifiable

Name: devvrit/polaris_filtered_nemotron_easy_math_verifiable
Creator: devvrit
Published: 2026-03-02 23:44:17
License: 暂无描述

Hugging Face2026-03-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/devvrit/polaris_filtered_nemotron_easy_math_verifiable

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: messages list: - name: content dtype: string - name: reasoning_content dtype: string - name: role dtype: string - name: expected_answer dtype: string splits: - name: train num_bytes: 1130496842 num_examples: 268631 download_size: 590666190 dataset_size: 1130496842 license: cc-by-4.0 task_categories: - text-generation tags: - math - reasoning configs: - config_name: default data_files: - split: train path: data/train-* --- # Polaris-Filtered Nemotron Easy Math (Verifiable) A filtered subset of [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) (`low` / easy split), retaining only non-TIR samples with verifiable boxed answers. ## Filtering Pipeline 1. **Remove TIR / tool-use samples** — drop any sample that contains Python code blocks (`\`\`\`python`, `<|python_start|>`, `<tool_call>`) or has a non-empty `tools`/`tool` field. 2. **Polaris decontamination** — remove samples whose user prompt shares any 15-gram overlap with problems in [POLARIS-Project/Polaris-Dataset-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K). 3. **Answer verification** — keep only samples where the `\boxed{}` answer in the assistant response matches the `expected_answer` field (verified via [math-verify](https://pypi.org/project/math-verify/)). ## Dataset Statistics | Metric | Value | |--------|------:| | Total samples | 268,631 | | Min tokens | 99 | | Max tokens | 31,802 | | Mean tokens | 1,561.8 | | Median tokens | 1,416.0 | | P25 | 972 | | P75 | 1,975 | | P90 | 2,647 | | P95 | 3,118 | | P99 | 4,186 | *Token counts measured with [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (full conversation including reasoning + content).* ## Sequence Length Distribution | Bucket | Count | Percentage | |--------|------:|----------:| | <2k | 203,591 | 75.8% | | 2k-4k | 61,494 | 22.9% | | 4k-6k | 3,281 | 1.2% | | 6k-8k | 229 | 0.1% | | 8k-10k | 30 | 0.0% | | 10k-12k | 5 | 0.0% | | 12k-14k | 0 | 0.0% | | 14k-16k | 0 | 0.0% | | 16k-18k | 0 | 0.0% | | >18k | 1 | 0.0% | ## Usage ```python from datasets import load_dataset ds = load_dataset("devvrit/polaris_filtered_nemotron_easy_math_verifiable", split="train") ``` ## Source - **Base dataset:** [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) — `data/low.jsonl` - **Decontamination reference:** [POLARIS-Project/Polaris-Dataset-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K)

提供机构：

devvrit

5,000+

优质数据集

54 个

任务类型

进入经典数据集