five

openeurollm/Nemotron-Post-Training-Dataset-v2-decontaminated

收藏
Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/Nemotron-Post-Training-Dataset-v2-decontaminated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: uuid dtype: string - name: license dtype: string - name: generator dtype: string - name: version dtype: string - name: category dtype: string - name: reasoning dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string splits: - name: chat num_bytes: 5970038196 num_examples: 627641 - name: code num_bytes: 978716238 num_examples: 174689 - name: math num_bytes: 506730327 num_examples: 239151 - name: multilingual_de num_bytes: 18888129433 num_examples: 1015124 - name: multilingual_es num_bytes: 16270803251 num_examples: 935571 - name: multilingual_fr num_bytes: 18229229848 num_examples: 1001360 - name: multilingual_it num_bytes: 18722006630 num_examples: 1016377 - name: multilingual_ja num_bytes: 18011107778 num_examples: 975002 - name: stem num_bytes: 807481013 num_examples: 354942 download_size: 98384242714 dataset_size: 98384242714 configs: - config_name: default data_files: - split: chat path: data/chat-* - split: code path: data/code-* - split: math path: data/math-* - split: multilingual_de path: data/multilingual_de-* - split: multilingual_es path: data/multilingual_es-* - split: multilingual_fr path: data/multilingual_fr-* - split: multilingual_it path: data/multilingual_it-* - split: multilingual_ja path: data/multilingual_ja-* - split: stem path: data/stem-* license: cc-by-4.0 language: - en - de - it - fr - es - ja extra_gated_fields: Company: text Institutional Email: text decontamination: source_dataset: nvidia/Nemotron-Post-Training-Dataset-v2 benchmarks: - path: HuggingFaceH4/MATH-500 subset: default split: test - path: HuggingFaceH4/aime_2024 subset: default split: train - path: math-ai/aime25 subset: default split: test - path: math-ai/amc23 subset: default split: test - path: daman1209arora/jeebench subset: default split: test - path: Idavidrein/gpqa subset: gpqa_diamond split: train - path: ali-elganzory/livecodebench-code_generation_lite subset: release_v6 split: test - path: openai/openai_humaneval subset: openai_humaneval split: test - path: google-research-datasets/mbpp subset: full split: train+test+validation+prompt - path: google/IFEval subset: default split: train - path: tatsu-lab/alpaca_eval subset: alpaca_eval split: eval - path: lmarena-ai/arena-hard-auto subset: default split: train contamination_stats: - subset: default split: stem total: 355000 removed: 58 - subset: default split: chat total: 627720 removed: 79 - subset: default split: math total: 239467 removed: 316 - subset: default split: code total: 175000 removed: 311 - subset: default split: multilingual_ja total: 975202 removed: 200 - subset: default split: multilingual_de total: 1015314 removed: 190 - subset: default split: multilingual_it total: 1016503 removed: 126 - subset: default split: multilingual_es total: 935704 removed: 133 - subset: default split: multilingual_fr total: 1001504 removed: 144 --- ## Decontamination This dataset is a decontaminated version of [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2). ### Benchmarks used - **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test) - **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train) - **AIME25**: `math-ai/aime25` (subset=default, split=test) - **AMC23**: `math-ai/amc23` (subset=default, split=test) - **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test) - **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train) - **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test) - **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test) - **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt) - **IFEval**: `google/IFEval` (subset=default, split=train) - **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval) - **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl']) ### Decontamination settings <table> <thead> <tr><th>Parameter</th><th>Value</th></tr> </thead> <tbody> <tr><td>N-gram size</td><td>8</td></tr> <tr><td>Match threshold</td><td>0.5</td></tr> </tbody> </table> ### Split and benchmark details <table> <thead> <tr> <th>Subset</th> <th>Split</th> <th>Docs in split (dataset)</th> <th>Benchmark</th> <th>Contaminated (dataset)</th> <th>Contamination rate (dataset)</th> <th>Docs (benchmark)</th> <th>Contaminated (benchmark)</th> <th>Contamination rate (benchmark)</th> </tr> </thead> <tbody> <tr> <td rowspan="108">default</td> <td rowspan="12">chat</td> <td rowspan="12">627,720</td> <td>MATH500</td> <td>47</td> <td>0.0075%</td> <td>500</td> <td>12</td> <td>2.40%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>2</td> <td>0.0003%</td> <td>1055</td> <td>1</td> <td>0.0948%</td> </tr> <tr> <td>HumanEval</td> <td>0</td> <td>0.0000%</td> <td>164</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>MBPP</td> <td>6</td> <td>0.0010%</td> <td>974</td> <td>10</td> <td>1.03%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>17</td> <td>0.0027%</td> <td>805</td> <td>4</td> <td>0.4969%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>7</td> <td>0.0011%</td> <td>750</td> <td>2</td> <td>0.2667%</td> </tr> <tr> <td rowspan="12">code</td> <td rowspan="12">175,000</td> <td>MATH500</td> <td>70</td> <td>0.0400%</td> <td>500</td> <td>7</td> <td>1.40%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>56</td> <td>0.0320%</td> <td>1055</td> <td>9</td> <td>0.8531%</td> </tr> <tr> <td>HumanEval</td> <td>1</td> <td>0.0006%</td> <td>164</td> <td>3</td> <td>1.83%</td> </tr> <tr> <td>MBPP</td> <td>170</td> <td>0.0971%</td> <td>974</td> <td>30</td> <td>3.08%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>1</td> <td>0.0006%</td> <td>805</td> <td>1</td> <td>0.1242%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>14</td> <td>0.0080%</td> <td>750</td> <td>5</td> <td>0.6667%</td> </tr> <tr> <td rowspan="12">math</td> <td rowspan="12">239,467</td> <td>MATH500</td> <td>244</td> <td>0.1019%</td> <td>500</td> <td>53</td> <td>10.60%</td> </tr> <tr> <td>AIME24</td> <td>1</td> <td>0.0004%</td> <td>30</td> <td>1</td> <td>3.33%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>26</td> <td>0.0109%</td> <td>40</td> <td>9</td> <td>22.50%</td> </tr> <tr> <td>JEEBench</td> <td>20</td> <td>0.0084%</td> <td>515</td> <td>7</td> <td>1.36%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>0</td> <td>0.0000%</td> <td>164</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>MBPP</td> <td>5</td> <td>0.0021%</td> <td>974</td> <td>3</td> <td>0.3080%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>9</td> <td>0.0038%</td> <td>805</td> <td>2</td> <td>0.2484%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>11</td> <td>0.0046%</td> <td>750</td> <td>2</td> <td>0.2667%</td> </tr> <tr> <td rowspan="12">multilingual_de</td> <td rowspan="12">1,015,314</td> <td>MATH500</td> <td>138</td> <td>0.0136%</td> <td>500</td> <td>33</td> <td>6.60%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>3</td> <td>0.0003%</td> <td>164</td> <td>4</td> <td>2.44%</td> </tr> <tr> <td>MBPP</td> <td>31</td> <td>0.0031%</td> <td>974</td> <td>48</td> <td>4.93%</td> </tr> <tr> <td>IFEval</td> <td>1</td> <td>0.0001%</td> <td>541</td> <td>1</td> <td>0.1848%</td> </tr> <tr> <td>AlpacaEval</td> <td>15</td> <td>0.0015%</td> <td>805</td> <td>3</td> <td>0.3727%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>2</td> <td>0.0002%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> <tr> <td rowspan="12">multilingual_es</td> <td rowspan="12">935,704</td> <td>MATH500</td> <td>98</td> <td>0.0105%</td> <td>500</td> <td>22</td> <td>4.40%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>0</td> <td>0.0000%</td> <td>164</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>MBPP</td> <td>27</td> <td>0.0029%</td> <td>974</td> <td>34</td> <td>3.49%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>7</td> <td>0.0007%</td> <td>805</td> <td>2</td> <td>0.2484%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>2</td> <td>0.0002%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> <tr> <td rowspan="12">multilingual_fr</td> <td rowspan="12">1,001,504</td> <td>MATH500</td> <td>110</td> <td>0.0110%</td> <td>500</td> <td>26</td> <td>5.20%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>1</td> <td>0.0001%</td> <td>515</td> <td>1</td> <td>0.1942%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>1</td> <td>0.0001%</td> <td>164</td> <td>3</td> <td>1.83%</td> </tr> <tr> <td>MBPP</td> <td>23</td> <td>0.0023%</td> <td>974</td> <td>43</td> <td>4.41%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>5</td> <td>0.0005%</td> <td>805</td> <td>2</td> <td>0.2484%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>5</td> <td>0.0005%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> <tr> <td rowspan="12">multilingual_it</td> <td rowspan="12">1,016,503</td> <td>MATH500</td> <td>96</td> <td>0.0094%</td> <td>500</td> <td>25</td> <td>5.00%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>1</td> <td>0.0001%</td> <td>515</td> <td>1</td> <td>0.1942%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>0</td> <td>0.0000%</td> <td>164</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>MBPP</td> <td>21</td> <td>0.0021%</td> <td>974</td> <td>32</td> <td>3.29%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>8</td> <td>0.0008%</td> <td>805</td> <td>3</td> <td>0.3727%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>1</td> <td>0.0001%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> <tr> <td rowspan="12">multilingual_ja</td> <td rowspan="12">975,202</td> <td>MATH500</td> <td>156</td> <td>0.0160%</td> <td>500</td> <td>45</td> <td>9.00%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>2</td> <td>0.0002%</td> <td>30</td> <td>1</td> <td>3.33%</td> </tr> <tr> <td>AMC23</td> <td>1</td> <td>0.0001%</td> <td>40</td> <td>1</td> <td>2.50%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>2</td> <td>0.0002%</td> <td>164</td> <td>1</td> <td>0.6098%</td> </tr> <tr> <td>MBPP</td> <td>26</td> <td>0.0027%</td> <td>974</td> <td>42</td> <td>4.31%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>9</td> <td>0.0009%</td> <td>805</td> <td>2</td> <td>0.2484%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>4</td> <td>0.0004%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> <tr> <td rowspan="12">stem</td> <td rowspan="12">355,000</td> <td>MATH500</td> <td>39</td> <td>0.0110%</td> <td>500</td> <td>6</td> <td>1.20%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AMC23</td> <td>0</td> <td>0.0000%</td> <td>40</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>0</td> <td>0.0000%</td> <td>198</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>LiveCodeBench</td> <td>0</td> <td>0.0000%</td> <td>1055</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>HumanEval</td> <td>0</td> <td>0.0000%</td> <td>164</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>MBPP</td> <td>1</td> <td>0.0003%</td> <td>974</td> <td>1</td> <td>0.1027%</td> </tr> <tr> <td>IFEval</td> <td>0</td> <td>0.0000%</td> <td>541</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AlpacaEval</td> <td>17</td> <td>0.0048%</td> <td>805</td> <td>6</td> <td>0.7453%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>1</td> <td>0.0003%</td> <td>750</td> <td>1</td> <td>0.1333%</td> </tr> </tbody> </table> ### Dataset summary <table> <thead> <tr><th>Metric</th><th>Value</th></tr> </thead> <tbody> <tr><td>Total documents in dataset</td><td>6,341,414</td></tr> <tr><td>Contaminated documents (removed)</td><td>1,557</td></tr> <tr><td>Documents after decontamination</td><td>6,339,857</td></tr> <tr><td>Contamination rate (dataset)</td><td>0.0246%</td></tr> </tbody> </table> --- # Nemotron-Post-Training-Dataset-v2 Release ## Data Overview This dataset adds to NVIDIA’s post-training dataset releases with an extension of SFT and RL data into five target languages: Spanish, French, German, Italian and Japanese. The data supports improvements of math, code, general reasoning, and instruction following capabilities of the [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base), in support of release of [NVIDIA-Nemotron-Nano-8B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2). NVIDIA-Nemotron-Nano-9B is a family of large language models (LLMs) that consists of [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base) and [NVIDIA-Nemotron-Nano-9B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) models. They are successors of [Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K) and [Nemotron-H-8B-Reasoning-128K](https://huggingface.co/nvidia/Nemotron-H-8B-Reasoning-128K), created with commercial use in mind. The NVIDIA-Nemotron-Nano-9B-v2-Reasoning model is aligned for human chat preferences and tasks. The reasoning model supports a context length of 128K tokens. For this latest model, NVIDIA also released pre-training dataset: [Nemotron-Pre-Training-Dataset](https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35) This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the training set, in addition to the training technique, tools and final model weights, NVIDIA supports both the re-creation and the improvement of our approach. ## Data distribution | Category | Value | |----------------|-------------| | math | 239467 | | code | 175000 | | stem | 355000 | | chat | 627720 | | multilingual_ja | 975202 | | multilingual_de | 1015314 | | multilingual_it | 1016503 | | multilingual_es | 935704 | | multilingual_fr | 1001504 | ## Filtering the data Users can download subsets of the data based on the metadata schema described above. Example script for downloading code and math as follows: ``` from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", "SFT", split=["code", "math"]) ``` ## Prompts Prompts have been sourced from either public and open corpus or synthetically generated. All responses have been synthetically generated from public and open models. The prompts were extracted, and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and removing prompts with incorrect syntax. ## Responses Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. The reasoning traces are presented only in English, not the target language, as most of the pre-training corpus is in English. Here is the completed table with the aggregated counts for the models that were used in the creation of this dataset. Please note that multiple models may have been used to produce a single record so it may not always be a 1:1 mapping. | Model | Number of Samples | | :--- | :--- | | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | 5,713,694 | | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | 3,928,913 | | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 627,720 | | [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ) | 1,015,314 | | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | 627,720 | ## License/Terms of Use The dataset contains information about license type on a per sample basis. The dataset is predominantly CC-BY-4.0, with a small subset of prompts from Wildchat having an ODC-BY license and a small subset of prompts from StackOverflow with CC-BY-SA license. This dataset contains synthetic data created using [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ), [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE). **Data Developer:** NVIDIA ### Use Case: <br> Developers training foundation LLM models. <br> ### Release Date: <br> 8/20/2025 <br> ## Data Version 2.0 (8/20/2025) ## Intended use The Nemotron Post-Training Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com. ## Citation If you found this dataset useful, please cite the dataset and the model below : ``` @software{NemotronPostTrainingDatasetV2, author = {Nathawani, Dhruv and Ding, Shuoyang and Lavrukhin, Vitaly and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Ginsburg, Boris and Polak Scowcroft, Jane}, title = {{Nemotron-Post-Training-Dataset-v2}}, version = {2.0}, publisher = {{NVIDIA}}, year = {2025}, month = aug, url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2} } ``` ``` @misc{nvidia2025nvidianemotronnano2, title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, author={NVIDIA}, year={2025}, eprint={2508.14444}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14444}, } } ```
提供机构:
openeurollm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作