openeurollm/smoltalk2-decontaminated
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/smoltalk2-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: Mid
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: source
dtype: string
splits:
- name: Llama_Nemotron_Post_Training_Dataset_reasoning_r1
num_bytes: 61572047251
num_examples: 3642011
- name: OpenThoughts3_1.2M
num_bytes: 56323337343
num_examples: 1134737
download_size: 117895384594
dataset_size: 117895384594
- config_name: Preference
features:
- name: chosen
list:
- name: content
dtype: string
- name: role
dtype: string
- name: rejected
list:
- name: content
dtype: string
- name: role
dtype: string
- name: prompt
dtype: string
- name: chat_template_kwargs
struct:
- name: custom_instructions
dtype: string
- name: enable_thinking
dtype: bool
- name: python_tools
list: string
- name: xml_tools
list: string
- name: source
dtype: string
splits:
- name: llama_3.1_tulu_3_8b_preference_mixture_no_think
num_bytes: 1470085457
num_examples: 230233
- name: tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think
num_bytes: 4555578318
num_examples: 216130
download_size: 6025663775
dataset_size: 6025663775
- config_name: SFT
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: chat_template_kwargs
struct:
- name: custom_instructions
dtype: string
- name: enable_thinking
dtype: bool
- name: python_tools
list: string
- name: xml_tools
list: string
- name: source
dtype: string
splits:
- name: LongAlign_64k_Qwen3_32B_yarn_131k_think
num_bytes: 515878388
num_examples: 7472
- name: LongAlign_64k_context_lang_annotated_lang_6_no_think
num_bytes: 397857852
num_examples: 6193
- name: Mixture_of_Thoughts_science_no_think
num_bytes: 130205353
num_examples: 86068
- name: OpenHermes_2.5_no_think
num_bytes: 585305066
num_examples: 384845
- name: OpenThoughts3_1.2M_no_think_no_think
num_bytes: 1214383875
num_examples: 434925
- name: OpenThoughts3_1.2M_think
num_bytes: 56255327596
num_examples: 1133157
- name: aya_dataset_Qwen3_32B_think
num_bytes: 60159585
num_examples: 15221
- name: hermes_function_calling_v1_no_think
num_bytes: 44611215
num_examples: 8929
- name: multi_turn_reasoning_if_think
num_bytes: 421547201
num_examples: 28194
- name: s1k_1.1_think
num_bytes: 24001489
num_examples: 778
- name: smolagents_toolcalling_traces_think
num_bytes: 199988999
num_examples: 9013
- name: smoltalk_everyday_convs_reasoning_Qwen3_32B_think
num_bytes: 11738101
num_examples: 2057
- name: smoltalk_multilingual8_Qwen3_32B_think
num_bytes: 1900277001
num_examples: 244712
- name: smoltalk_multilingual_8languages_lang_5_no_think
num_bytes: 564874295
num_examples: 254023
- name: smoltalk_smollm3_everyday_conversations_no_think
num_bytes: 1954696
num_examples: 2259
- name: smoltalk_smollm3_explore_instruct_rewriting_no_think
num_bytes: 14184832
num_examples: 30388
- name: smoltalk_smollm3_smol_magpie_ultra_no_think
num_bytes: 2815314088
num_examples: 405942
- name: smoltalk_smollm3_smol_rewrite_no_think
num_bytes: 89542052
num_examples: 53250
- name: smoltalk_smollm3_smol_summarize_no_think
num_bytes: 229069587
num_examples: 96050
- name: smoltalk_smollm3_systemchats_30k_no_think
num_bytes: 89594706
num_examples: 33966
- name: smoltalk_systemchats_Qwen3_32B_think
num_bytes: 123488154
num_examples: 27423
- name: table_gpt_Qwen3_32B_think
num_bytes: 77654360
num_examples: 13178
- name: table_gpt_no_think
num_bytes: 31079108
num_examples: 13182
- name: tulu_3_sft_personas_instruction_following_no_think
num_bytes: 59822207
num_examples: 29944
- name: xlam_traces_no_think
num_bytes: 96764391
num_examples: 59932
download_size: 65954624197
dataset_size: 65954624197
configs:
- config_name: Mid
data_files:
- split: Llama_Nemotron_Post_Training_Dataset_reasoning_r1
path: Mid/Llama_Nemotron_Post_Training_Dataset_reasoning_r1-*
- split: OpenThoughts3_1.2M
path: Mid/OpenThoughts3_1.2M-*
- config_name: Preference
data_files:
- split: llama_3.1_tulu_3_8b_preference_mixture_no_think
path: Preference/llama_3.1_tulu_3_8b_preference_mixture_no_think-*
- split: tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think
path: Preference/tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think-*
- config_name: SFT
data_files:
- split: LongAlign_64k_Qwen3_32B_yarn_131k_think
path: SFT/LongAlign_64k_Qwen3_32B_yarn_131k_think-*
- split: LongAlign_64k_context_lang_annotated_lang_6_no_think
path: SFT/LongAlign_64k_context_lang_annotated_lang_6_no_think-*
- split: Mixture_of_Thoughts_science_no_think
path: SFT/Mixture_of_Thoughts_science_no_think-*
- split: OpenHermes_2.5_no_think
path: SFT/OpenHermes_2.5_no_think-*
- split: OpenThoughts3_1.2M_no_think_no_think
path: SFT/OpenThoughts3_1.2M_no_think_no_think-*
- split: OpenThoughts3_1.2M_think
path: SFT/OpenThoughts3_1.2M_think-*
- split: aya_dataset_Qwen3_32B_think
path: SFT/aya_dataset_Qwen3_32B_think-*
- split: hermes_function_calling_v1_no_think
path: SFT/hermes_function_calling_v1_no_think-*
- split: multi_turn_reasoning_if_think
path: SFT/multi_turn_reasoning_if_think-*
- split: s1k_1.1_think
path: SFT/s1k_1.1_think-*
- split: smolagents_toolcalling_traces_think
path: SFT/smolagents_toolcalling_traces_think-*
- split: smoltalk_everyday_convs_reasoning_Qwen3_32B_think
path: SFT/smoltalk_everyday_convs_reasoning_Qwen3_32B_think-*
- split: smoltalk_multilingual8_Qwen3_32B_think
path: SFT/smoltalk_multilingual8_Qwen3_32B_think-*
- split: smoltalk_multilingual_8languages_lang_5_no_think
path: SFT/smoltalk_multilingual_8languages_lang_5_no_think-*
- split: smoltalk_smollm3_everyday_conversations_no_think
path: SFT/smoltalk_smollm3_everyday_conversations_no_think-*
- split: smoltalk_smollm3_explore_instruct_rewriting_no_think
path: SFT/smoltalk_smollm3_explore_instruct_rewriting_no_think-*
- split: smoltalk_smollm3_smol_magpie_ultra_no_think
path: SFT/smoltalk_smollm3_smol_magpie_ultra_no_think-*
- split: smoltalk_smollm3_smol_rewrite_no_think
path: SFT/smoltalk_smollm3_smol_rewrite_no_think-*
- split: smoltalk_smollm3_smol_summarize_no_think
path: SFT/smoltalk_smollm3_smol_summarize_no_think-*
- split: smoltalk_smollm3_systemchats_30k_no_think
path: SFT/smoltalk_smollm3_systemchats_30k_no_think-*
- split: smoltalk_systemchats_Qwen3_32B_think
path: SFT/smoltalk_systemchats_Qwen3_32B_think-*
- split: table_gpt_Qwen3_32B_think
path: SFT/table_gpt_Qwen3_32B_think-*
- split: table_gpt_no_think
path: SFT/table_gpt_no_think-*
- split: tulu_3_sft_personas_instruction_following_no_think
path: SFT/tulu_3_sft_personas_instruction_following_no_think-*
- split: xlam_traces_no_think
path: SFT/xlam_traces_no_think-*
decontamination:
source_dataset: HuggingFaceTB/smoltalk2
benchmarks:
- path: HuggingFaceH4/MATH-500
subset: default
split: test
- path: HuggingFaceH4/aime_2024
subset: default
split: train
- path: math-ai/aime25
subset: default
split: test
- path: math-ai/amc23
subset: default
split: test
- path: daman1209arora/jeebench
subset: default
split: test
- path: Idavidrein/gpqa
subset: gpqa_diamond
split: train
- path: ali-elganzory/livecodebench-code_generation_lite
subset: release_v6
split: test
- path: openai/openai_humaneval
subset: openai_humaneval
split: test
- path: google-research-datasets/mbpp
subset: full
split: train+test+validation+prompt
- path: google/IFEval
subset: default
split: train
- path: tatsu-lab/alpaca_eval
subset: alpaca_eval
split: eval
- path: lmarena-ai/arena-hard-auto
subset: default
split: train
contamination_stats:
- subset: SFT
split: LongAlign_64k_Qwen3_32B_yarn_131k_think
total: 7526
removed: 54
- subset: SFT
split: OpenThoughts3_1.2M_think
total: 1133524
removed: 367
- subset: SFT
split: aya_dataset_Qwen3_32B_think
total: 15222
removed: 1
- subset: SFT
split: multi_turn_reasoning_if_think
total: 84651
removed: 23
- subset: SFT
split: s1k_1.1_think
total: 835
removed: 57
- subset: SFT
split: smolagents_toolcalling_traces_think
total: 9079
removed: 66
- subset: SFT
split: smoltalk_everyday_convs_reasoning_Qwen3_32B_think
total: 4114
removed: 0
- subset: SFT
split: smoltalk_multilingual8_Qwen3_32B_think
total: 244736
removed: 24
- subset: SFT
split: smoltalk_systemchats_Qwen3_32B_think
total: 27436
removed: 13
- subset: SFT
split: table_gpt_Qwen3_32B_think
total: 13201
removed: 23
- subset: SFT
split: LongAlign_64k_context_lang_annotated_lang_6_no_think
total: 6249
removed: 56
- subset: SFT
split: Mixture_of_Thoughts_science_no_think
total: 86110
removed: 42
- subset: SFT
split: OpenHermes_2.5_no_think
total: 384900
removed: 55
- subset: SFT
split: OpenThoughts3_1.2M_no_think_no_think
total: 435193
removed: 268
- subset: SFT
split: hermes_function_calling_v1_no_think
total: 16292
removed: 32
- subset: SFT
split: smoltalk_multilingual_8languages_lang_5_no_think
total: 254047
removed: 24
- subset: SFT
split: smoltalk_smollm3_everyday_conversations_no_think
total: 8880
removed: 1
- subset: SFT
split: smoltalk_smollm3_explore_instruct_rewriting_no_think
total: 30391
removed: 3
- subset: SFT
split: smoltalk_smollm3_smol_magpie_ultra_no_think
total: 1220529
removed: 901
- subset: SFT
split: smoltalk_smollm3_smol_rewrite_no_think
total: 53262
removed: 12
- subset: SFT
split: smoltalk_smollm3_smol_summarize_no_think
total: 96061
removed: 11
- subset: SFT
split: smoltalk_smollm3_systemchats_30k_no_think
total: 106622
removed: 31
- subset: SFT
split: table_gpt_no_think
total: 13203
removed: 21
- subset: SFT
split: tulu_3_sft_personas_instruction_following_no_think
total: 29970
removed: 26
- subset: SFT
split: xlam_traces_no_think
total: 59962
removed: 30
- subset: Mid
split: Llama_Nemotron_Post_Training_Dataset_reasoning_r1
total: 3644790
removed: 2779
- subset: Mid
split: OpenThoughts3_1.2M
total: 1135104
removed: 367
- subset: Preference
split: llama_3.1_tulu_3_8b_preference_mixture_no_think
total: 230501
removed: 268
- subset: Preference
split: tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think
total: 216385
removed: 255
---
## Decontamination
This dataset is a decontaminated version of [HuggingFaceTB/smoltalk2](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2).
### Benchmarks used
- **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test)
- **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train)
- **AIME25**: `math-ai/aime25` (subset=default, split=test)
- **AMC23**: `math-ai/amc23` (subset=default, split=test)
- **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test)
- **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train)
- **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test)
- **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test)
- **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt)
- **IFEval**: `google/IFEval` (subset=default, split=train)
- **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval)
- **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl'])
### Decontamination settings
<table>
<thead>
<tr><th>Parameter</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>N-gram size</td><td>8</td></tr>
<tr><td>Match threshold</td><td>0.5</td></tr>
</tbody>
</table>
### Split and benchmark details
<table>
<thead>
<tr>
<th>Subset</th>
<th>Split</th>
<th>Docs in split (dataset)</th>
<th>Benchmark</th>
<th>Contaminated (dataset)</th>
<th>Contamination rate (dataset)</th>
<th>Docs (benchmark)</th>
<th>Contaminated (benchmark)</th>
<th>Contamination rate (benchmark)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="24">Mid</td>
<td rowspan="12">Llama_Nemotron_Post_Training_Dataset_reasoning_r1</td>
<td rowspan="12">3,644,790</td>
<td>MATH500</td>
<td>426</td>
<td>0.0117%</td>
<td>500</td>
<td>48</td>
<td>9.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>2</td>
<td>0.0001%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>10</td>
<td>0.0003%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>61</td>
<td>0.0017%</td>
<td>515</td>
<td>11</td>
<td>2.14%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0001%</td>
<td>164</td>
<td>2</td>
<td>1.22%</td>
</tr>
<tr>
<td>MBPP</td>
<td>2102</td>
<td>0.0577%</td>
<td>974</td>
<td>308</td>
<td>31.62%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>143</td>
<td>0.0039%</td>
<td>805</td>
<td>15</td>
<td>1.86%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>33</td>
<td>0.0009%</td>
<td>750</td>
<td>5</td>
<td>0.6667%</td>
</tr>
<tr>
<td rowspan="12">OpenThoughts3_1.2M</td>
<td rowspan="12">1,135,104</td>
<td>MATH500</td>
<td>267</td>
<td>0.0235%</td>
<td>500</td>
<td>32</td>
<td>6.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0001%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>10</td>
<td>0.0009%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>52</td>
<td>0.0046%</td>
<td>974</td>
<td>6</td>
<td>0.6160%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>25</td>
<td>0.0022%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>13</td>
<td>0.0011%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="24">Preference</td>
<td rowspan="12">llama_3.1_tulu_3_8b_preference_mixture_no_think</td>
<td rowspan="12">230,501</td>
<td>MATH500</td>
<td>61</td>
<td>0.0265%</td>
<td>500</td>
<td>8</td>
<td>1.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0004%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0004%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>135</td>
<td>0.0586%</td>
<td>974</td>
<td>109</td>
<td>11.19%</td>
</tr>
<tr>
<td>IFEval</td>
<td>4</td>
<td>0.0017%</td>
<td>541</td>
<td>2</td>
<td>0.3697%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>63</td>
<td>0.0273%</td>
<td>805</td>
<td>27</td>
<td>3.35%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>3</td>
<td>0.0013%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think</td>
<td rowspan="12">216,385</td>
<td>MATH500</td>
<td>57</td>
<td>0.0263%</td>
<td>500</td>
<td>9</td>
<td>1.80%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0005%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>128</td>
<td>0.0592%</td>
<td>974</td>
<td>106</td>
<td>10.88%</td>
</tr>
<tr>
<td>IFEval</td>
<td>6</td>
<td>0.0028%</td>
<td>541</td>
<td>2</td>
<td>0.3697%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>60</td>
<td>0.0277%</td>
<td>805</td>
<td>26</td>
<td>3.23%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>3</td>
<td>0.0014%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="300">SFT</td>
<td rowspan="12">LongAlign_64k_Qwen3_32B_yarn_131k_think</td>
<td rowspan="12">7,526</td>
<td>MATH500</td>
<td>36</td>
<td>0.4783%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>1</td>
<td>0.0133%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>2</td>
<td>0.0266%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>3</td>
<td>0.0399%</td>
<td>164</td>
<td>3</td>
<td>1.83%</td>
</tr>
<tr>
<td>MBPP</td>
<td>2</td>
<td>0.0266%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>3</td>
<td>0.0399%</td>
<td>541</td>
<td>2</td>
<td>0.3697%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>5</td>
<td>0.0664%</td>
<td>805</td>
<td>4</td>
<td>0.4969%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>5</td>
<td>0.0664%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">LongAlign_64k_context_lang_annotated_lang_6_no_think</td>
<td rowspan="12">6,249</td>
<td>MATH500</td>
<td>36</td>
<td>0.5761%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>2</td>
<td>0.0320%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>4</td>
<td>0.0640%</td>
<td>164</td>
<td>4</td>
<td>2.44%</td>
</tr>
<tr>
<td>MBPP</td>
<td>2</td>
<td>0.0320%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>3</td>
<td>0.0480%</td>
<td>541</td>
<td>2</td>
<td>0.3697%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>5</td>
<td>0.0800%</td>
<td>805</td>
<td>4</td>
<td>0.4969%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>6</td>
<td>0.0960%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">Mixture_of_Thoughts_science_no_think</td>
<td rowspan="12">86,110</td>
<td>MATH500</td>
<td>38</td>
<td>0.0441%</td>
<td>500</td>
<td>5</td>
<td>1.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0012%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>3</td>
<td>0.0035%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">OpenHermes_2.5_no_think</td>
<td rowspan="12">384,900</td>
<td>MATH500</td>
<td>46</td>
<td>0.0120%</td>
<td>500</td>
<td>6</td>
<td>1.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>2</td>
<td>0.0005%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>7</td>
<td>0.0018%</td>
<td>805</td>
<td>5</td>
<td>0.6211%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">OpenThoughts3_1.2M_no_think_no_think</td>
<td rowspan="12">435,193</td>
<td>MATH500</td>
<td>213</td>
<td>0.0489%</td>
<td>500</td>
<td>31</td>
<td>6.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0002%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0005%</td>
<td>164</td>
<td>2</td>
<td>1.22%</td>
</tr>
<tr>
<td>MBPP</td>
<td>39</td>
<td>0.0090%</td>
<td>974</td>
<td>5</td>
<td>0.5133%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>12</td>
<td>0.0028%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>4</td>
<td>0.0009%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">OpenThoughts3_1.2M_think</td>
<td rowspan="12">1,133,524</td>
<td>MATH500</td>
<td>267</td>
<td>0.0236%</td>
<td>500</td>
<td>32</td>
<td>6.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0001%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>10</td>
<td>0.0009%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>52</td>
<td>0.0046%</td>
<td>974</td>
<td>6</td>
<td>0.6160%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>25</td>
<td>0.0022%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>13</td>
<td>0.0011%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">aya_dataset_Qwen3_32B_think</td>
<td rowspan="12">15,222</td>
<td>MATH500</td>
<td>1</td>
<td>0.0066%</td>
<td>500</td>
<td>1</td>
<td>0.2000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">hermes_function_calling_v1_no_think</td>
<td rowspan="12">16,292</td>
<td>MATH500</td>
<td>30</td>
<td>0.1841%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>1</td>
<td>0.0061%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0061%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">multi_turn_reasoning_if_think</td>
<td rowspan="12">84,651</td>
<td>MATH500</td>
<td>0</td>
<td>0.0000%</td>
<td>500</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>6</td>
<td>0.0071%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>17</td>
<td>0.0201%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">s1k_1.1_think</td>
<td rowspan="12">835</td>
<td>MATH500</td>
<td>27</td>
<td>3.23%</td>
<td>500</td>
<td>12</td>
<td>2.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>36</td>
<td>4.31%</td>
<td>515</td>
<td>37</td>
<td>7.18%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>1</td>
<td>0.1198%</td>
<td>164</td>
<td>1</td>
<td>0.6098%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smolagents_toolcalling_traces_think</td>
<td rowspan="12">9,079</td>
<td>MATH500</td>
<td>66</td>
<td>0.7270%</td>
<td>500</td>
<td>18</td>
<td>3.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_everyday_convs_reasoning_Qwen3_32B_think</td>
<td rowspan="12">4,114</td>
<td>MATH500</td>
<td>0</td>
<td>0.0000%</td>
<td>500</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_multilingual8_Qwen3_32B_think</td>
<td rowspan="12">244,736</td>
<td>MATH500</td>
<td>20</td>
<td>0.0082%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0008%</td>
<td>164</td>
<td>1</td>
<td>0.6098%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0008%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_multilingual_8languages_lang_5_no_think</td>
<td rowspan="12">254,047</td>
<td>MATH500</td>
<td>20</td>
<td>0.0079%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0008%</td>
<td>164</td>
<td>1</td>
<td>0.6098%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0008%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_everyday_conversations_no_think</td>
<td rowspan="12">8,880</td>
<td>MATH500</td>
<td>0</td>
<td>0.0000%</td>
<td>500</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0113%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_explore_instruct_rewriting_no_think</td>
<td rowspan="12">30,391</td>
<td>MATH500</td>
<td>0</td>
<td>0.0000%</td>
<td>500</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>3</td>
<td>0.0099%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_smol_magpie_ultra_no_think</td>
<td rowspan="12">1,220,529</td>
<td>MATH500</td>
<td>155</td>
<td>0.0127%</td>
<td>500</td>
<td>37</td>
<td>7.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>1</td>
<td>0.0001%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AIME25</td>
<td>4</td>
<td>0.0003%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>1</td>
<td>0.0001%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0001%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>3</td>
<td>0.0002%</td>
<td>164</td>
<td>2</td>
<td>1.22%</td>
</tr>
<tr>
<td>MBPP</td>
<td>628</td>
<td>0.0515%</td>
<td>974</td>
<td>191</td>
<td>19.61%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>93</td>
<td>0.0076%</td>
<td>805</td>
<td>19</td>
<td>2.36%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>18</td>
<td>0.0015%</td>
<td>750</td>
<td>3</td>
<td>0.4000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_smol_rewrite_no_think</td>
<td rowspan="12">53,262</td>
<td>MATH500</td>
<td>4</td>
<td>0.0075%</td>
<td>500</td>
<td>1</td>
<td>0.2000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>8</td>
<td>0.0150%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_smol_summarize_no_think</td>
<td rowspan="12">96,061</td>
<td>MATH500</td>
<td>5</td>
<td>0.0052%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>6</td>
<td>0.0062%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_smollm3_systemchats_30k_no_think</td>
<td rowspan="12">106,622</td>
<td>MATH500</td>
<td>7</td>
<td>0.0066%</td>
<td>500</td>
<td>2</td>
<td>0.4000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>24</td>
<td>0.0225%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">smoltalk_systemchats_Qwen3_32B_think</td>
<td rowspan="12">27,436</td>
<td>MATH500</td>
<td>5</td>
<td>0.0182%</td>
<td>500</td>
<td>2</td>
<td>0.4000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>8</td>
<td>0.0292%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">table_gpt_Qwen3_32B_think</td>
<td rowspan="12">13,201</td>
<td>MATH500</td>
<td>17</td>
<td>0.1288%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>1</td>
<td>0.0076%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0076%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>4</td>
<td>0.0303%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">table_gpt_no_think</td>
<td rowspan="12">13,203</td>
<td>MATH500</td>
<td>17</td>
<td>0.1288%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>1</td>
<td>0.0076%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0076%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0151%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">tulu_3_sft_personas_instruction_following_no_think</td>
<td rowspan="12">29,970</td>
<td>MATH500</td>
<td>0</td>
<td>0.0000%</td>
<td>500</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>1</td>
<td>0.0033%</td>
<td>974</td>
<td>1</td>
<td>0.1027%</td>
</tr>
<tr>
<td>IFEval</td>
<td>8</td>
<td>0.0267%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>17</td>
<td>0.0567%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">xlam_traces_no_think</td>
<td rowspan="12">59,962</td>
<td>MATH500</td>
<td>16</td>
<td>0.0267%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>13</td>
<td>0.0217%</td>
<td>974</td>
<td>4</td>
<td>0.4107%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>1</td>
<td>0.0017%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
</tbody>
</table>
### Dataset summary
<table>
<thead>
<tr><th>Metric</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>Total documents in dataset</td><td>9,568,775</td></tr>
<tr><td>Contaminated documents (removed)</td><td>5,810</td></tr>
<tr><td>Documents after decontamination</td><td>9,562,965</td></tr>
<tr><td>Contamination rate (dataset)</td><td>0.0607%</td></tr>
</tbody>
</table>
---
# SmolTalk2

## Dataset description
This dataset contains three subsets (Mid, SFT, Preference) that correspond to the three phases of Post-Training for [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B). You can find more details in our [blog post](https://huggingface.co/blog/smollm3) about how we used the data in each of the stages [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B).
The specific weight of each subset is available in the training recipe in SmolLM's repository.
You can load a dataset using
```python
from datasets import load_dataset
# To load the train split of a specific subset, such as Mixture-of-Thoughts, you can do
ds = load_dataset("HuggingFaceTB/smoltalk2", "SFT", split=["Mixture-of-Thoughts_science", "table_gpt_no_think"])
```
## Dataset Composition
### Mid-Training (`Mid`)
The mid-training dataset has a total of 4.8M rows and is composed of 2 datasets that we decontaminate to remove samples present in the benchmarks used for evaluation.
The datasets are:
- [Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): 3.64M rows
- [OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M): 1.14M rows.
### SFT (`SFT`)
The total mix consists of 25 datasets, which we decontaminated to remove samples present in the benchmarks used for evaluation and remove samples containing emojis. We also created the `chat_template_kwargs` column by extracting any system message or tool descriptions already present in the dataset.
We make a distinction between datasets with and without reasoning traces, denoted by the suffixes `think` and `no_think`, respectively. The 10 `think` datasets have a total of 1.5M rows, and the 15 `no_think` datasets have a total of 1.9M rows.
The `think` datasets are:
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (multilingual-8languages): 244736 rows generated with Qwen3-32B with the prompts in SmolTalk.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (everyday-conversations): 244736 rows generated with Qwen3-32B with the prompts in SmolTalk.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (systemchats-30k): 244736 rows generated with Qwen3-32B with the prompts in SmolTalk.
- LongAlign-64k-context-lang-annotated: 7526 rows generated with Qwen3-32B with the prompts in LongAlign-64k.
- [NEW] smolagents-toolcalling-traces: 9079 rows.
- We generate tool calling data with reasoning traces using `deepseek-ai/DeepSeek-V3-0324`.
- [NEW] Multi-Turn IF: 28217 rows.
- We follow [Multi-IF's approach](https://arxiv.org/abs/2410.15553) to generate multi turn data. We source prompts from [Tulu 3 Personas IF](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following), generate 2 verifiable turns using Qwen3-235B-A22B, and generate responses with Qwen3-32B in reasoning mode.
- [s1k-1.1](https://huggingface.co/datasets/open-r1/s1K-1.1): 835 rows.
- [OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M): 1133524 rows.
- [Aya](https://huggingface.co/datasets/CohereLabs/aya_dataset): 15222 rows generated with Qwen3-32B with the prompts in Aya.
- [Table-GPT](https://huggingface.co/datasets/LipengCS/Table-GPT): 13201 rows generated with Qwen3-32B with the prompts in Table-GPT.
The `no_think` datasets are:
- [NEW] [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (multilingual-8languages): 254047 rows.
- Following [Qwen 2.5 report](https://arxiv.org/pdf/2412.15115), we first translate the prompts in Smol-Magpie-Ultra and Smol-Constraints using Qwen to 8 languages (fr, es, it, pt, de, ar, ru, zh) while respecting local conventions (units, currency, etc.). We then use the model to generate answers for each translated instruction in the target language.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (everyday-conversations): 2260 rows.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (systemchats-30k): 33997 rows.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (smollm3_smol-magpie-ultra): 406843 rows.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (smollm3_explore-instruct-rewriting): 30391 rows.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (smollm3_smol-rewrite): 53262 rows.
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) smollm3_smol-summarize: 96061 rows.
- [Mixture of Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) (science): 86110 rows where we remove the reasoning trace.
- [Tulu 3 SFT Personas IF](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following): 29970 rows.
- [hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1): 8961 rows.
- [Table-GPT](https://huggingface.co/datasets/LipengCS/Table-GPT): 13203 rows.
- [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5): 384900 rows.
- [OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M): 435193 rows where we remove the reasoning trace.
- LongAlign-64k-context-lang-annotated (lang_6): 6249 examples. We filter [LongAlign](https://huggingface.co/datasets/THUDM/LongAlign-10k) for samples up to 64k tokens.
### Preference Data (`Preference`)
We used two datasets to train SmolLM3-3B with APO, which has a total of 447k rows. We generated the `think` equivalent using the prompts of the `no_think` counterpart and decontaminated using the same methods from the other two stages. The datasets are:
- [Tulu 3 8B Preference Mixture (`no_think`)](https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture): 231k rows.
- Tulu 3 8B Preference Mixture (`think`): 216k rows where we generate the chosen responses with Qwen3-32B and the rejected responses with Qwen3-0.6B.
## Dataset Stats
The dataset stats contain a more granular level of the training mix by dataset. We also include the `Weight` column that controls the number of examples we take from each dataset for training. You can find the full configuration files [here](https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm3).
### Mid-Training
| Dataset | Weight | # examples | % of examples | # tokens (M) | % of tokens | Avg. # turns | Avg. # tokens per example | Avg. # tokens in context | Avg. # tokens in response |
|---------------------------------------------------|----------|--------------|-----------------|----------------|---------------|----------------|-----------------------------|----------------------------|-----------------------------|
| Llama-Nemotron-Post-Training-Dataset_reasoning_r1 | 1 | 3644790 | 76.25 | 18707.9 | 53.19 | 2 | 5132.79 | 145 | 4987.79 |
| OpenThoughts3-1.2M | 1 | 1135104 | 23.75 | 16464.2 | 46.81 | 2 | 14504.5 | 219.68 | 14284.9 |
| Total | - | 4779894 | 100 | 35172.1 | 100 | 2 | 7358.34 | 162.73 | 7195.61 |
### SFT
| Dataset | Weight | # examples | % of examples | # tokens (M) | % of tokens | Avg. # turns | Avg. # tokens per example | Avg. # tokens in context | Avg. # tokens in response |
|---------------------------------------------|----------|--------------|-----------------|----------------|---------------|----------------|-----------------------------|----------------------------|-----------------------------|
| smoltalk-smollm3_everyday-conversations_no_think | 1 | 2260 | 0.07 | 0.63 | 0 | 7.75 | 277.24 | 239.23 | 111.01 |
| smoltalk-smollm3_systemchats-30k_no_think | 1 | 33997 | 1 | 22.06 | 0.11 | 6.27 | 648.91 | 439.76 | 284.74 |
| tulu-3-sft-personas-instruction-following_no_think | 1 | 29970 | 0.89 | 13.83 | 0.07 | 2 | 461.46 | 136.72 | 397.74 |
| hermes-function-calling-v1_no_think | 1 | 8961 | 0.26 | 11.38 | 0.06 | 5.35 | 1270.06 | 1163.93 | 468.37 |
| smoltalk-smollm3_smol-magpie-ultra_no_think | 0.5 | 406843 | 12.03 | 619.05 | 3.21 | 6 | 1521.59 | 1072.52 | 522.07 |
| smoltalk-multilingual-8languages_lang_5_no_think | 1 | 254047 | 7.51 | 166.79 | 0.86 | 2 | 656.54 | 179.41 | 550.13 |
| table-gpt_no_think | 1 | 13203 | 0.39 | 11.49 | 0.06 | 2 | 870.39 | 787.81 | 155.58 |
| OpenHermes-2.5_no_think | 0.5 | 384900 | 11.38 | 158.23 | 0.82 | 2 | 411.1 | 269.39 | 214.71 |
| OpenThoughts3-1.2M_no_think_no_think | 0.4 | 435193 | 12.86 | 379.82 | 1.97 | 2 | 872.76 | 288.03 | 657.73 |
| Mixture-of-Thoughts_science_no_think | 1 | 86110 | 2.55 | 37.51 | 0.19 | 2 | 435.61 | 135.64 | 372.97 |
| smoltalk-smollm3_explore-instruct-rewriting_no_think | 1 | 30391 | 0.9 | 4.63 | 0.02 | 2 | 152.29 | 119.44 | 110.87 |
| smoltalk-smollm3_smol-rewrite_no_think | 1 | 53262 | 1.57 | 20.34 | 0.11 | 2 | 381.86 | 235.05 | 229.28 |
| smoltalk-smollm3_smol-summarize_no_think | 1 | 96061 | 2.84 | 51.82 | 0.27 | 2 | 539.47 | 442.18 | 182.86 |
| LongAlign-64k-context-lang-annotated_lang_6_no_think | 1 | 6249 | 0.18 | 95.78 | 0.5 | 2 | 15327.7 | 15126.2 | 274.55 |
| multi-turn-reasoning-if_think | 1 | 28217 | 0.83 | 97.62 | 0.51 | 6 | 3459.66 | 2404.17 | 1312.48 |
| smoltalk-everyday-convs-reasoning-Qwen3-32B_think | 1 | 2057 | 0.06 | 3.17 | 0.02 | 4 | 1539.37 | 393.76 | 1402.6 |
| smoltalk-systemchats-Qwen3-32B_think | 1 | 27436 | 0.81 | 29.84 | 0.15 | 2 | 1087.79 | 101.63 | 1059.73 |
| xlam-traces_no_think | 1 | 59962 | 1.77 | 29.4 | 0.15 | 2 | 490.25 | 431.42 | 455.84 |
| smolagents-toolcalling-traces_think | 1 | 9079 | 0.27 | 63.81 | 0.33 | 5.34 | 7028.12 | 6934.23 | 681.89 |
| s1k-1.1_think | 1 | 835 | 0.02 | 8.25 | 0.04 | 2 | 9876.31 | 387.87 | 9745.45 |
| LongAlign-64k-Qwen3-32B-yarn-131k_think | 1 | 7526 | 0.22 | 136.21 | 0.71 | 2 | 18099.2 | 16220.5 | 2135.73 |
| aya_dataset-Qwen3-32B_think | 1 | 15222 | 0.45 | 18.92 | 0.1 | 2 | 1242.73 | 301.34 | 1198.4 |
| smoltalk-multilingual8-Qwen3-32B_think | 0.3 | 244736 | 7.23 | 551.97 | 2.86 | 2 | 2255.38 | 363.63 | 2148.74 |
| OpenThoughts3-1.2M_think | 0.02 | 1133524 | 33.5 | 16734 | 86.74 | 2 | 14762.8 | 476.17 | 14543.6 |
| table-gpt-Qwen3-32B_think | 1 | 13201 | 0.39 | 25.92 | 0.13 | 2 | 1963.49 | 971.89 | 1248.6 |
| Total | - | 3383242 | 100 | 19292.4 | 100 | 2.58 | 5702.35 | 545.35 | 5317.08 |
### Preference Data
| Dataset | Weight | # examples | % of examples | Avg. # turns | Avg. # tokens in context | # tokens (M) (Chosen) | % of tokens (Chosen) | Avg. # tokens per example (Chosen) | Avg. # tokens in response (Chosen) |
|----------------------------------------------------------------|----------|--------------|-----------------|-------------------------|-------------------------------------|-------------------------|------------------------|--------------------------------------|--------------------------------------|
| llama_3.1_tulu_3_8b_preference_mixture_no_think | 0.5 | 230501 | 51.58 | 2 | 283.34 | 168.3 | 19.79 | 730.14 | 519.8 |
| tulu_3_8b_pref_mix_Qwen3_32B_Qwen3_0.6B_think | 0.25 | 216385 | 48.42 | 2 | 469.94 | 682.32 | 80.21 | 3153.27 | 2940.33 |
| Total | - | 446886 | 100 | 2 | 373.69 | 850.62 | 100 | 1903.44 | 1691.84 |
## License
All the new datasets (aya_dataset-Qwen3-32B, multi-turn-reasoning-if, smolagents-toolcalling-traces, smoltalk-everyday-convs-reasoning-Qwen3-32B, smoltalk-multilingual8-Qwen3-32B, smoltalk-systemchats-Qwen3-32B, table-gpt-Qwen3-32B, tulu_3_8b_pref_mix_qwen3_32b_qwen3_06b_think) are licensed under Apache 2.0. For the existing public datasets, please refer to the original dataset for the license.
提供机构:
openeurollm



