annakosovskaia/NuminaMath-1.5-RL-Verifiable-cleaned
收藏Hugging Face2026-05-18 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/annakosovskaia/NuminaMath-1.5-RL-Verifiable-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: all
data_files:
- split: train
path: all/train-*
- config_name: clean
data_files:
- split: train
path: clean/train-*
dataset_info:
- config_name: all
features:
- name: problem
dtype: large_string
- name: solution
dtype: large_string
- name: problem_type
dtype: large_string
- name: question_type
dtype: large_string
- name: problem_is_valid
dtype: large_string
- name: solution_is_valid
dtype: large_string
- name: source
dtype: large_string
- name: synthetic
dtype: bool
- name: is_verifiable_final_answer_task
dtype: bool
- name: is_coherent_solution
dtype: bool
- name: is_complete
dtype: bool
- name: has_final_answer
dtype: bool
- name: confidence
dtype: large_string
- name: validation_raw
dtype: large_string
- name: answer
dtype: large_string
- name: problem_id
dtype: int64
splits:
- name: train
num_bytes: 161718940
num_examples: 100050
download_size: 63184810
dataset_size: 161718940
- config_name: clean
features:
- name: problem
dtype: large_string
- name: solution
dtype: large_string
- name: problem_type
dtype: large_string
- name: question_type
dtype: large_string
- name: problem_is_valid
dtype: large_string
- name: solution_is_valid
dtype: large_string
- name: source
dtype: large_string
- name: synthetic
dtype: bool
- name: is_verifiable_final_answer_task
dtype: bool
- name: is_coherent_solution
dtype: bool
- name: is_complete
dtype: bool
- name: has_final_answer
dtype: bool
- name: confidence
dtype: large_string
- name: validation_raw
dtype: large_string
- name: answer
dtype: large_string
- name: problem_id
dtype: int64
splits:
- name: train
num_bytes: 132728968
num_examples: 81147
download_size: 51808498
dataset_size: 132728968
---
# NuminaMath-1.5-RL-Verifiable (cleaned)
A cleaned and validated subset of [`nlile/NuminaMath-1.5-RL-Verifiable`](https://huggingface.co/datasets/nlile/NuminaMath-1.5-RL-Verifiable).
LLM-validated and re-extracted final answers for use in math RL / SFT pipelines.
## What changed
Starting from `nlile/NuminaMath-1.5-RL-Verifiable` (131,063 rows), we applied:
### 1. Regex/structural cleanup (`prepare_numina.py`)
**Stripped problem-number prefixes:**
- `Problem 3.`, `Problem A2.`, `Problem N1.`, `Problem 15`
- `Task 1.`, `Task A-1.1.`, `Task Condition`
- `## Problem ...`, `## Task ...`, `## Aufgabe 1`, `## Zadatak B-1.1.`, `## Subject I`, `## Exercise 5`, `## Condition of the problem`
- `# 15.`, `# 6.1. Condition:`, `# Task 4. (10 points)`
- `A3.`, `B2.`, `NT1`, `NT12`, `1A.`, `2B.`
- `A 1.`, `NT 3.` (letter+space+digit)
- `96.2.`, `03.4.` (multi-level numbers)
- `XXXVIII OM - II - Zadanie 4`, `LIV OM - II - Task 3`, `L OM - I - Problem 8` (competition headers)
- `[4 points]`, `(7 points)`, `II. (5 points)`
- `(Option 1)`, `[u]Round 5[/u]`
- Topic tags: `[ Arithmetic. Mental calculation, etc.]`, `[ Decimal numeral system ]`
- 3-letter country codes followed by newline: `MLD`, `ALB`, `SAU`, `EST-`
**Stripped solution prefixes:**
- `Solution.`, `Solution 1.`, `Solution:`, `SOLUTION.`
- `# Solution.`, `## Solution`, `## Solution 1:`, `1. Solution.`, `2. Solution.`
- `[Solution]`, `【Solution】` (bracketed markers)
- `Answer: ...` / `Answer N.` prefix at start (to avoid teaching answer-first generation)
- `22. Answer: 13\nSolution. ...` (problem-number + leaked answer + Solution chain)
**Stripped solution inline / trailing markers:**
- `Detailed Explanation:`, `Detailed Solution:`
- Repeated translation artifacts (`Certainly, here is the translation: ---`)
- Trailing citation tails (e.g. `Kuznetsov Differentiation Problem 17-10`)
- Trailing `Answer: ...` line at end of solution (redundant — answer is in `answer` column)
- Trailing grading rubrics (`Evaluation Criteria:`, `Award N points`)
**Dropped rows** where the solution was less than 30 chars after cleanup (these were "answer-only solutions" with no reasoning content).
**Dropped rows with missing-image references:** `[asy]`, `\includegraphics`, `Fig.`, `Figure N`, `as shown in the figure`, `see diagram`, `in the diagram above`, image filenames (`*.jpg`, `*.png`, ...), xy-pic diagram commands (`\spos`, `\xymatrix`).
**Dropped rows** where the `problem` field actually contained a solution/answer (e.g. starts with `Solution.` or `Answer:`).
**Dropped multi-part problems** (`a) ... b) ...`, `1) ... 2) ...`) — answer extraction is unreliable for these.
### 2. LLM-based quality validation (Qwen3-32B)
Each row was scored on:
- `is_verifiable_final_answer_task` — not a "prove that..." task
- `is_coherent_solution` — internally consistent
- `is_complete` — no truncation, no dangling references, no missing announced steps
- `has_final_answer` — explicit final answer stated
- `confidence` — `high` / `medium` / `low`
<details>
<summary>Validation prompt</summary>
```
You are evaluating a math competition solution for dataset quality.
You will be given a PROBLEM and a SOLUTION. Evaluate the solution and return a JSON object.
Return ONLY a JSON object with these fields:
{
"is_verifiable_final_answer_task": <bool>,
"is_coherent_solution": <bool>,
"is_complete": <bool>,
"has_final_answer": <bool>,
"confidence": "high" | "medium" | "low"
}
Field definitions:
- is_verifiable_final_answer_task: the problem asks for a specific answer (value, set, expression, count) that can be verified — NOT a pure "prove that..." or "show that..." task
- is_coherent_solution: the solution steps follow logically from the problem; internally consistent and not self-contradictory
- is_complete: no truncation mid-sentence or mid-equation; no announced steps/equations/expressions that are then absent ("we get:", "substituting:", followed by nothing); no references to figures or diagrams not present in the text; no dangling references like "from equation (2)" where equation (2) was never shown; argument reaches a conclusion
- has_final_answer: the solution explicitly states a final answer (boxed answer, "the answer is X", "therefore X = ...") — false if it ends without a clearly stated result
- confidence: high = easy to assess, criteria are clear-cut; medium = one or two criteria are borderline; low = hard to assess overall
No explanation, no markdown, just the JSON object.
```
</details>
### 3. Multi-form answer re-extraction (Qwen3-32B → Qwen3-235B)
The original `answer` field was often incorrectly extracted, so we re-did it from scratch. The final answer was extracted directly from the solution in 1-3 semantically equivalent forms (e.g. `["x \in \{1, 3\}", "\{1, 3\}"]`). First pass by Qwen3-32B, then verified and corrected by Qwen3-235B-A22B-Instruct-2507.
<details>
<summary>Extraction prompt (Qwen3-32B, first pass)</summary>
```
Return the final answer of this math solution. Provide 1-3 SEMANTICALLY different forms (skip trivial syntactic rewrites — those are normalized downstream).
Useful variants:
• multiple choice: "a)" and "True"
• parametric: "x = k\pi, k \in \mathbb{Z}" and "\{k\pi : k \in \mathbb{Z}\}"
• pair vs two values: "(2, 3)" and "2, 3"
• binomial: "\binom{n}{k}" and "C_n^k" and "C(n,k)"
Do NOT add variants for: \frac vs \dfrac, \sqrt{3} vs \sqrt 3, 0.5 vs 1/2, \left( vs (, \text{abc} vs abc, x=5 vs 5, \{1,3\} vs 1,3, \{(2,3)\} vs (2,3) — these are auto-normalized.
Rules: preserve LaTeX as in the solution. No "Answer:", no \boxed{}, no explanations. One form per line. If no answer: NONE
PROBLEM:
{problem}
SOLUTION:
{solution}
```
</details>
<details>
<summary>Verification + correction prompt (Qwen3-235B, second pass)</summary>
```
You are correcting extracted answer variants for a math problem. The variants below were produced by a weaker model and may contain mistakes. Look at PROBLEM and SOLUTION, decide what the true final answer is, then output a clean list of variants — at most 3 SEMANTICALLY EQUIVALENT forms of that single true answer.
Strict rules:
1. If a list/set is a SINGLE answer (e.g. "n can be 12, 14, 18, 22, or 32"), output it as ONE variant: "12, 14, 18, 22, 32" or "\{12, 14, 18, 22, 32\}". Do NOT split list members across multiple variant lines.
2. DROP every variant that is:
- prose / explanation in words (e.g. "There are 33 even numbers", "Total of 27 solutions")
- a wrong / different numerical value not supported by the solution
- an alternative hypothesis the model wasn't sure about
- a sub-result, intermediate step, or unrelated object (e.g. listing example values when the answer is a count)
- a trivial syntactic rewrite (\frac vs \dfrac, 0.5 vs 1/2, \{1,3\} vs 1,3 — auto-normalized)
3. KEEP variants that are SEMANTICALLY DIFFERENT FORMS of the same answer:
- "(2, 3)" and "2, 3" (pair vs values)
- "x = k\pi, k \in \mathbb{Z}" and "\{k\pi : k \in \mathbb{Z}\}" (parametric)
- "a)" and "True" and "1" (multiple choice)
- "\binom{n}{k}" and "C_n^k" (alternate notation)
Output one variant per line. Preserve LaTeX. No explanations, no labels, no "Answer:". If no valid variants: NONE
PROBLEM:
{problem}
SOLUTION:
{solution}
CURRENT VARIANTS:
{variants}
```
</details>
## Schema
| Column | Type | Description |
|---|---|---|
| `problem` | str | Cleaned problem statement |
| `solution` | str | Cleaned solution / chain of thought |
| `answer` | str (JSON list) | 1-3 equivalent answer forms; `None` if not extractable |
| `problem_type`, `question_type`, `source`, `synthetic` | inherited from source | |
| `is_verifiable_final_answer_task` | bool | |
| `is_coherent_solution` | bool | |
| `is_complete` | bool | |
| `has_final_answer` | bool | |
| `confidence` | str | `high` / `medium` / `low` |
| `validation_raw` | str | Raw LLM output from the validation step (debugging) |
## Stats
- **Total rows:** 100,050 (down from 131,063)
- **With canonical `answer`:** 96,049 (96.0%)
## Usage
```python
from datasets import load_dataset
import json
ds = load_dataset("annakosovskaia/NuminaMath-1.5-RL-Verifiable-cleaned", split="train")
row = ds[0]
variants = json.loads(row["answer"]) # list of equivalent answer forms
```
For RL/SFT, filter to high-quality rows:
```python
ds = ds.filter(lambda x:
x["is_verifiable_final_answer_task"]
and x["is_coherent_solution"]
and x["is_complete"]
and x["has_final_answer"]
and x["confidence"] in {"high", "medium"}
and x["answer"] is not None
)
```
## License
Same as the source dataset.
提供机构:
annakosovskaia



