openeurollm/Nemotron-Post-Training-Dataset-v2-decontaminated
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/Nemotron-Post-Training-Dataset-v2-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: uuid
dtype: string
- name: license
dtype: string
- name: generator
dtype: string
- name: version
dtype: string
- name: category
dtype: string
- name: reasoning
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
splits:
- name: chat
num_bytes: 5970038196
num_examples: 627641
- name: code
num_bytes: 978716238
num_examples: 174689
- name: math
num_bytes: 506730327
num_examples: 239151
- name: multilingual_de
num_bytes: 18888129433
num_examples: 1015124
- name: multilingual_es
num_bytes: 16270803251
num_examples: 935571
- name: multilingual_fr
num_bytes: 18229229848
num_examples: 1001360
- name: multilingual_it
num_bytes: 18722006630
num_examples: 1016377
- name: multilingual_ja
num_bytes: 18011107778
num_examples: 975002
- name: stem
num_bytes: 807481013
num_examples: 354942
download_size: 98384242714
dataset_size: 98384242714
configs:
- config_name: default
data_files:
- split: chat
path: data/chat-*
- split: code
path: data/code-*
- split: math
path: data/math-*
- split: multilingual_de
path: data/multilingual_de-*
- split: multilingual_es
path: data/multilingual_es-*
- split: multilingual_fr
path: data/multilingual_fr-*
- split: multilingual_it
path: data/multilingual_it-*
- split: multilingual_ja
path: data/multilingual_ja-*
- split: stem
path: data/stem-*
license: cc-by-4.0
language:
- en
- de
- it
- fr
- es
- ja
extra_gated_fields:
Company: text
Institutional Email: text
decontamination:
source_dataset: nvidia/Nemotron-Post-Training-Dataset-v2
benchmarks:
- path: HuggingFaceH4/MATH-500
subset: default
split: test
- path: HuggingFaceH4/aime_2024
subset: default
split: train
- path: math-ai/aime25
subset: default
split: test
- path: math-ai/amc23
subset: default
split: test
- path: daman1209arora/jeebench
subset: default
split: test
- path: Idavidrein/gpqa
subset: gpqa_diamond
split: train
- path: ali-elganzory/livecodebench-code_generation_lite
subset: release_v6
split: test
- path: openai/openai_humaneval
subset: openai_humaneval
split: test
- path: google-research-datasets/mbpp
subset: full
split: train+test+validation+prompt
- path: google/IFEval
subset: default
split: train
- path: tatsu-lab/alpaca_eval
subset: alpaca_eval
split: eval
- path: lmarena-ai/arena-hard-auto
subset: default
split: train
contamination_stats:
- subset: default
split: stem
total: 355000
removed: 58
- subset: default
split: chat
total: 627720
removed: 79
- subset: default
split: math
total: 239467
removed: 316
- subset: default
split: code
total: 175000
removed: 311
- subset: default
split: multilingual_ja
total: 975202
removed: 200
- subset: default
split: multilingual_de
total: 1015314
removed: 190
- subset: default
split: multilingual_it
total: 1016503
removed: 126
- subset: default
split: multilingual_es
total: 935704
removed: 133
- subset: default
split: multilingual_fr
total: 1001504
removed: 144
---
## Decontamination
This dataset is a decontaminated version of [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
### Benchmarks used
- **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test)
- **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train)
- **AIME25**: `math-ai/aime25` (subset=default, split=test)
- **AMC23**: `math-ai/amc23` (subset=default, split=test)
- **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test)
- **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train)
- **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test)
- **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test)
- **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt)
- **IFEval**: `google/IFEval` (subset=default, split=train)
- **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval)
- **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl'])
### Decontamination settings
<table>
<thead>
<tr><th>Parameter</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>N-gram size</td><td>8</td></tr>
<tr><td>Match threshold</td><td>0.5</td></tr>
</tbody>
</table>
### Split and benchmark details
<table>
<thead>
<tr>
<th>Subset</th>
<th>Split</th>
<th>Docs in split (dataset)</th>
<th>Benchmark</th>
<th>Contaminated (dataset)</th>
<th>Contamination rate (dataset)</th>
<th>Docs (benchmark)</th>
<th>Contaminated (benchmark)</th>
<th>Contamination rate (benchmark)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="108">default</td>
<td rowspan="12">chat</td>
<td rowspan="12">627,720</td>
<td>MATH500</td>
<td>47</td>
<td>0.0075%</td>
<td>500</td>
<td>12</td>
<td>2.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>2</td>
<td>0.0003%</td>
<td>1055</td>
<td>1</td>
<td>0.0948%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>6</td>
<td>0.0010%</td>
<td>974</td>
<td>10</td>
<td>1.03%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>17</td>
<td>0.0027%</td>
<td>805</td>
<td>4</td>
<td>0.4969%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>7</td>
<td>0.0011%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">code</td>
<td rowspan="12">175,000</td>
<td>MATH500</td>
<td>70</td>
<td>0.0400%</td>
<td>500</td>
<td>7</td>
<td>1.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>56</td>
<td>0.0320%</td>
<td>1055</td>
<td>9</td>
<td>0.8531%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>1</td>
<td>0.0006%</td>
<td>164</td>
<td>3</td>
<td>1.83%</td>
</tr>
<tr>
<td>MBPP</td>
<td>170</td>
<td>0.0971%</td>
<td>974</td>
<td>30</td>
<td>3.08%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0006%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>14</td>
<td>0.0080%</td>
<td>750</td>
<td>5</td>
<td>0.6667%</td>
</tr>
<tr>
<td rowspan="12">math</td>
<td rowspan="12">239,467</td>
<td>MATH500</td>
<td>244</td>
<td>0.1019%</td>
<td>500</td>
<td>53</td>
<td>10.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>1</td>
<td>0.0004%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>26</td>
<td>0.0109%</td>
<td>40</td>
<td>9</td>
<td>22.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>20</td>
<td>0.0084%</td>
<td>515</td>
<td>7</td>
<td>1.36%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>5</td>
<td>0.0021%</td>
<td>974</td>
<td>3</td>
<td>0.3080%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>9</td>
<td>0.0038%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>11</td>
<td>0.0046%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">multilingual_de</td>
<td rowspan="12">1,015,314</td>
<td>MATH500</td>
<td>138</td>
<td>0.0136%</td>
<td>500</td>
<td>33</td>
<td>6.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>3</td>
<td>0.0003%</td>
<td>164</td>
<td>4</td>
<td>2.44%</td>
</tr>
<tr>
<td>MBPP</td>
<td>31</td>
<td>0.0031%</td>
<td>974</td>
<td>48</td>
<td>4.93%</td>
</tr>
<tr>
<td>IFEval</td>
<td>1</td>
<td>0.0001%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>15</td>
<td>0.0015%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0002%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">multilingual_es</td>
<td rowspan="12">935,704</td>
<td>MATH500</td>
<td>98</td>
<td>0.0105%</td>
<td>500</td>
<td>22</td>
<td>4.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>27</td>
<td>0.0029%</td>
<td>974</td>
<td>34</td>
<td>3.49%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>7</td>
<td>0.0007%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0002%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">multilingual_fr</td>
<td rowspan="12">1,001,504</td>
<td>MATH500</td>
<td>110</td>
<td>0.0110%</td>
<td>500</td>
<td>26</td>
<td>5.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0001%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>1</td>
<td>0.0001%</td>
<td>164</td>
<td>3</td>
<td>1.83%</td>
</tr>
<tr>
<td>MBPP</td>
<td>23</td>
<td>0.0023%</td>
<td>974</td>
<td>43</td>
<td>4.41%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>5</td>
<td>0.0005%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>5</td>
<td>0.0005%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">multilingual_it</td>
<td rowspan="12">1,016,503</td>
<td>MATH500</td>
<td>96</td>
<td>0.0094%</td>
<td>500</td>
<td>25</td>
<td>5.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0001%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>21</td>
<td>0.0021%</td>
<td>974</td>
<td>32</td>
<td>3.29%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>8</td>
<td>0.0008%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>1</td>
<td>0.0001%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">multilingual_ja</td>
<td rowspan="12">975,202</td>
<td>MATH500</td>
<td>156</td>
<td>0.0160%</td>
<td>500</td>
<td>45</td>
<td>9.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>2</td>
<td>0.0002%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>1</td>
<td>0.0001%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0002%</td>
<td>164</td>
<td>1</td>
<td>0.6098%</td>
</tr>
<tr>
<td>MBPP</td>
<td>26</td>
<td>0.0027%</td>
<td>974</td>
<td>42</td>
<td>4.31%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>9</td>
<td>0.0009%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>4</td>
<td>0.0004%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">stem</td>
<td rowspan="12">355,000</td>
<td>MATH500</td>
<td>39</td>
<td>0.0110%</td>
<td>500</td>
<td>6</td>
<td>1.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>1</td>
<td>0.0003%</td>
<td>974</td>
<td>1</td>
<td>0.1027%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>17</td>
<td>0.0048%</td>
<td>805</td>
<td>6</td>
<td>0.7453%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>1</td>
<td>0.0003%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
</tbody>
</table>
### Dataset summary
<table>
<thead>
<tr><th>Metric</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>Total documents in dataset</td><td>6,341,414</td></tr>
<tr><td>Contaminated documents (removed)</td><td>1,557</td></tr>
<tr><td>Documents after decontamination</td><td>6,339,857</td></tr>
<tr><td>Contamination rate (dataset)</td><td>0.0246%</td></tr>
</tbody>
</table>
---
# Nemotron-Post-Training-Dataset-v2 Release
## Data Overview
This dataset adds to NVIDIA’s post-training dataset releases with an extension of SFT and RL data into five target languages: Spanish, French, German, Italian and Japanese. The data supports improvements of math, code, general reasoning, and instruction following capabilities of the [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base), in support of release of [NVIDIA-Nemotron-Nano-8B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2).
NVIDIA-Nemotron-Nano-9B is a family of large language models (LLMs) that consists of [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base) and [NVIDIA-Nemotron-Nano-9B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) models. They are successors of [Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K) and [Nemotron-H-8B-Reasoning-128K](https://huggingface.co/nvidia/Nemotron-H-8B-Reasoning-128K), created with commercial use in mind.
The NVIDIA-Nemotron-Nano-9B-v2-Reasoning model is aligned for human chat preferences and tasks. The reasoning model supports a context length of 128K tokens.
For this latest model, NVIDIA also released pre-training dataset: [Nemotron-Pre-Training-Dataset](https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35)
This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the training set, in addition to the training technique, tools and final model weights, NVIDIA supports both the re-creation and the improvement of our approach.
## Data distribution
| Category | Value |
|----------------|-------------|
| math | 239467 |
| code | 175000 |
| stem | 355000 |
| chat | 627720 |
| multilingual_ja | 975202 |
| multilingual_de | 1015314 |
| multilingual_it | 1016503 |
| multilingual_es | 935704 |
| multilingual_fr | 1001504 |
## Filtering the data
Users can download subsets of the data based on the metadata schema described above. Example script for downloading code and math as follows:
```
from datasets import load_dataset
ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", "SFT", split=["code", "math"])
```
## Prompts
Prompts have been sourced from either public and open corpus or synthetically generated. All responses have been synthetically generated from public and open models.
The prompts were extracted, and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and removing prompts with incorrect syntax.
## Responses
Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. The reasoning traces are presented only in English, not the target language, as most of the pre-training corpus is in English.
Here is the completed table with the aggregated counts for the models that were used in the creation of this dataset.
Please note that multiple models may have been used to produce a single record so it may not always be a 1:1 mapping.
| Model | Number of Samples |
| :--- | :--- |
| [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | 5,713,694 |
| [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | 3,928,913 |
| [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 627,720 |
| [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ) | 1,015,314 |
| [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | 627,720 |
## License/Terms of Use
The dataset contains information about license type on a per sample basis. The dataset is predominantly CC-BY-4.0, with a small subset of prompts from Wildchat having an ODC-BY license and a small subset of prompts from StackOverflow with CC-BY-SA license.
This dataset contains synthetic data created using [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ), [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE).
**Data Developer:** NVIDIA
### Use Case: <br>
Developers training foundation LLM models. <br>
### Release Date: <br>
8/20/2025 <br>
## Data Version
2.0 (8/20/2025)
## Intended use
The Nemotron Post-Training Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate.
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Data Opt-Out:
NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com.
## Citation
If you found this dataset useful, please cite the dataset and the model below :
```
@software{NemotronPostTrainingDatasetV2,
author = {Nathawani, Dhruv and Ding, Shuoyang and Lavrukhin, Vitaly and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Ginsburg, Boris and Polak Scowcroft, Jane},
title = {{Nemotron-Post-Training-Dataset-v2}},
version = {2.0},
publisher = {{NVIDIA}},
year = {2025}, month = aug,
url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2}
}
```
```
@misc{nvidia2025nvidianemotronnano2,
title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model},
author={NVIDIA},
year={2025},
eprint={2508.14444},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14444},
}
}
```
提供机构:
openeurollm



