openeurollm/Dolci-Think-SFT-32B-decontaminated
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/Dolci-Think-SFT-32B-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: id
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 77680944029
num_examples: 2252837
download_size: 77680944029
dataset_size: 77680944029
license: odc-by
language:
- en
size_categories:
- 1M<n<10M
decontamination:
source_dataset: allenai/Dolci-Think-SFT-32B
benchmarks:
- path: HuggingFaceH4/MATH-500
subset: default
split: test
- path: HuggingFaceH4/aime_2024
subset: default
split: train
- path: math-ai/aime25
subset: default
split: test
- path: math-ai/amc23
subset: default
split: test
- path: daman1209arora/jeebench
subset: default
split: test
- path: Idavidrein/gpqa
subset: gpqa_diamond
split: train
- path: ali-elganzory/livecodebench-code_generation_lite
subset: release_v6
split: test
- path: openai/openai_humaneval
subset: openai_humaneval
split: test
- path: google-research-datasets/mbpp
subset: full
split: train+test+validation+prompt
- path: google/IFEval
subset: default
split: train
- path: tatsu-lab/alpaca_eval
subset: alpaca_eval
split: eval
- path: lmarena-ai/arena-hard-auto
subset: default
split: train
contamination_stats:
- subset: default
split: train
total: 2253684
removed: 847
---
## Decontamination
This dataset is a decontaminated version of [allenai/Dolci-Think-SFT-32B](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-32B).
### Benchmarks used
- **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test)
- **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train)
- **AIME25**: `math-ai/aime25` (subset=default, split=test)
- **AMC23**: `math-ai/amc23` (subset=default, split=test)
- **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test)
- **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train)
- **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test)
- **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test)
- **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt)
- **IFEval**: `google/IFEval` (subset=default, split=train)
- **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval)
- **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl'])
### Decontamination settings
<table>
<thead>
<tr><th>Parameter</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>N-gram size</td><td>8</td></tr>
<tr><td>Match threshold</td><td>0.5</td></tr>
</tbody>
</table>
### Split and benchmark details
<table>
<thead>
<tr>
<th>Subset</th>
<th>Split</th>
<th>Docs in split (dataset)</th>
<th>Benchmark</th>
<th>Contaminated (dataset)</th>
<th>Contamination rate (dataset)</th>
<th>Docs (benchmark)</th>
<th>Contaminated (benchmark)</th>
<th>Contamination rate (benchmark)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="12">default</td>
<td rowspan="12">train</td>
<td rowspan="12">2,346,939</td>
<td>MATH500</td>
<td>325</td>
<td>0.0138%</td>
<td>500</td>
<td>68</td>
<td>13.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>16</td>
<td>0.0007%</td>
<td>40</td>
<td>3</td>
<td>7.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>35</td>
<td>0.0015%</td>
<td>1055</td>
<td>9</td>
<td>0.8531%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>32</td>
<td>0.0014%</td>
<td>164</td>
<td>9</td>
<td>5.49%</td>
</tr>
<tr>
<td>MBPP</td>
<td>300</td>
<td>0.0128%</td>
<td>974</td>
<td>119</td>
<td>12.22%</td>
</tr>
<tr>
<td>IFEval</td>
<td>37</td>
<td>0.0016%</td>
<td>541</td>
<td>17</td>
<td>3.14%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>79</td>
<td>0.0034%</td>
<td>805</td>
<td>29</td>
<td>3.60%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>23</td>
<td>0.0010%</td>
<td>750</td>
<td>6</td>
<td>0.8000%</td>
</tr>
</tbody>
</table>
### Dataset summary
<table>
<thead>
<tr><th>Metric</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>Total documents in dataset</td><td>2,253,684</td></tr>
<tr><td>Contaminated documents (removed)</td><td>847</td></tr>
<tr><td>Documents after decontamination</td><td>2,252,837</td></tr>
<tr><td>Contamination rate (dataset)</td><td>0.0376%</td></tr>
</tbody>
</table>
---
# Dolci-Think-SFT
Sources include a mixture of existing reasoning traces:
* [OpenThoughts 3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) (Apache 2.0): Extended to 32K context length and downsampled code prompts to 16X multiple, to 941,164 total prompts. Access our version, Dolci OpenThoughts 3 here.
* [SYNTHETIC-2](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified) (Apache 2.0) via the SFT-Verified split, 104,568 prompts.
* [Nemotron Post-training dataset](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) (CC BY 4), code split only, 113,777 prompts.
New prompts and new reasoning traces from us (all ODC-BY-1.0):
* Dolci Think Persona IF: New precise instruction following prompts and traces created with [Nvidia's Nemotron Post-training Personas](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA). 220,530 prompts.
* Dolci Precise IF: New multi-constraint instruction following data building off Pyatkin, Valentina, et al. "[Generalizing Verifiable Instruction Following](https://arxiv.org/abs/2507.02833)." (2025). 135,722 prompts.
* [Dolci Think Python](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-Python): 466,676 prompts (subsampled from larger mix).
Existing prompts with new reasoning traces, largely repurposed from Tülu 3 / OLMo 2, with new traces generated by a mix of DeepSeek R1 and DeepSeek R1 0528:
* [WildChat](https://huggingface.co/datasets/allenai/WildChat-1M) (ODC-BY-1.0), 76,209 prompts.
* [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0), 6,647 prompts.
* [CoCoNot](https://huggingface.co/datasets/allenai/coconot) (ODC-BY-1.0), 9,549 prompts.
* [WildGuardMix ](https://huggingface.co/datasets/allenai/wildguardmix) (Apache 2.0), 36,673 prompts.
* [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) (ODC-BY-1.0) 40,002 prompts.
* [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) (Apache 2.0), 97,156 prompts.
* [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT) (MIT), 4,973 prompts.
* Olmo Identity Prompts, 58 samples (we trained with 290, 5 repetitions per prompt, uploaded single repetition to HuggingFace)
The counts are smaller than the original prompt sources pulled from Tülu 3 / OLMo 2 due to more extensive filtering for data quality and by topics within the Azure API (blocked requests).
This dataset was used for 32B post-training, the [7B dataset](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B) is slightly different.
## Dataset Structure
Each example in the dataset contains the standard instruction-tuning data points as follow:
- `id` (str): a unique identifier
- `messages` (list): message format used for supervised fine-tuning (this contains user prompt and assistant responses)
- `source` (str): the source dataset for the given sample
Every datapoint contains the model's reasoning in `<think>...</think>` and NO `<answer>...</answer>` tags -- the answer follows directly after `</think>`.
## Model Family
| **Stage** | **Olmo 3 7B Think** | **Olmo 3 32B Think** | **Olmo 3 7B Instruct** |
|--------------------------|-----------------------|------------------------|---------------------------|
| **Base Model** | [Olmo-3-7B](https://huggingface.co/allenai/Olmo-3-1025-7B) | [Olmo-3-32B](https://huggingface.co/allenai/Olmo-3-1125-32B) | [Olmo-3-7B](https://huggingface.co/allenai/Olmo-3-1025-7B) |
| **SFT** | [Olmo-3-7B-Think-SFT](https://huggingface.co/allenai/Olmo-3-7B-Think-SFT) | [Olmo-3-32B-Think-SFT](https://huggingface.co/allenai/Olmo-3-32B-Think-SFT) | [Olmo-3-7B-Instruct-SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT) |
| **DPO** | [Olmo-3-7B-Think-DPO](https://huggingface.co/allenai/Olmo-3-7B-Think-DPO) | [Olmo-3-32B-Think-DPO](https://huggingface.co/allenai/Olmo-3-32B-Think-DPO) | [Olmo-3-7B-Instruct-DPO](https://huggingface.co/allenai/Olmo-3-7B-Instruct-DPO) |
| **Final Models (RLVR)** | [Olmo-3-7B-Think](https://huggingface.co/allenai/Olmo-3-7B-Think) | [Olmo-3-32B-Think](https://huggingface.co/allenai/Olmo-3-32B-Think) | [Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) |
## License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
```
@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}
```
提供机构:
openeurollm



