tulu-3-sft-olmo-2-mixture
收藏魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/tulu-3-sft-olmo-2-mixture
下载链接
链接失效反馈官方服务:
资源简介:
*Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact.*
The OLMo v2 SFT mixture was used to train the [OLMo models](https://huggingface.co/collections/allenai/olmo-v2-models-6744f0938a9e7c6340140de8).
It contains 939,344 samples from the following sets:
- [CoCoNot](https://huggingface.co/datasets/allenai/coconot) (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024)
- [FLAN v2](https://github.com/google-research/FLAN/tree/main) via [`ai2-adapt-dev/flan_v2_converted`](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted), 89,982 prompts (Longpre et al., 2023)
- [No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) (CC-BY-NC-4.0), 9,500 prompts (Rajani et al. 2023)
- [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0), 7,132 prompts (Kopf et al., 2024)
- [Tulu 3 Persona MATH](https://huggingface.co/datasets/allenai/tulu-3-personas-math) (ODC-BY-1.0), 149,960 prompts
- [Tulu 3 Persona GSM](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade) (ODC-BY-1.0), 49,980 prompts
- [Tulu 3 Persona Python](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code) (ODC-BY-1.0), 34,999 prompts
- [Tulu 3 Persona Algebra](https://huggingface.co/datasets/allenai/tulu-3-personas-algebra) (ODC-BY-1.0), 20,000 prompts
- [Tulu 3 Persona IF](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) (ODC-BY-1.0), 29,980 prompts
- [NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (Apache 2.0), 64,312 prompts (Beeching et al. 2024)
- [Tulu 3 WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix) (Apache 2.0), 50,000 prompts (Han et al., 2024)
- [Tulu 3 WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) (ODC-BY-1.0), 50,000 prompts (Wildteaming, 2024)
- [OLMo 2 Hardcoded](https://huggingface.co/datasets/allenai/olmo-2-hard-coded) (CC-BY-4.0), 240 prompts
- [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) (Apache 2.0), 100,000 prompts (Singh et al., 2024)
- [WildChat GPT-4](https://huggingface.co/datasets/allenai/WildChat-1M) (ODC-BY-1.0), 100,000 prompts (Zhao et al., 2024)
- [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT) (MIT), 5,000 prompts (Zha et al., 2023)
- [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF) (ODC-BY-1.0), 10,000 prompts (Wadden et al., 2024)
- [Evol CodeAlpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) (Apache 2.0), 107,276 prompts (Luo et al., 2023)
## Dataset Structure
Each example in the dataset contains the standard instruction-tuning data points as follow:
- `id` (str): a unique identifier
- `messages` (list): message format used for supervised fine-tuning (this contains user prompt and assistant responses)
- `source` (str): the source dataset for the given sample
### Model Family
| **Stage** | **OLMo-2-1124-7B** | **OLMo-2-1124-13B** |
|----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| **Base Model** | [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo2-7B-1124) | [OLMo-2-1124-13B](https://huggingface.co/allenai/OLMo2-13B-1124) |
| **SFT** | [OLMo-2-1124-7B-SFT](https://huggingface.co/allenai/OLMo-2-1124-7B-SFT) | [allenai/OLMo-2-1124-13B-SFT](https://huggingface.co/allenai/OLMo-2-1124-13B-SFT) |
| **DPO** | [OLMo-2-1124-7B-DPO](https://huggingface.co/allenai/OLMo-2-1124-7B-DPO) | [allenai/OLMo-2-1124-13B-DPO](https://huggingface.co/allenai/OLMo-2-1124-13B-DPO) |
## License
This dataset is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use. For more information on license and terms, consult each subset linked above.
## Citation
If OLMo or any of the related materials were helpful to your work, please cite:
*请注意,本数据集集合采用ODC-BY-1.0许可协议;数据子集适用不同的许可条款。部分数据集子集仅可用于非商业用途。本混合数据集作为研究成果发布。
OLMo v2 监督微调(Supervised Fine-Tuning, SFT)混合数据集用于训练[OLMo模型](https://huggingface.co/collections/allenai/olmo-v2-models-6744f0938a9e7c6340140de8)。该数据集包含939,344条样本,源自以下数据集子集:
- CoCoNot数据集(CoCoNot):链接至https://huggingface.co/datasets/allenai/coconot,采用ODC-BY-1.0许可协议,包含10,983条提示词(prompt)(Brahman等,2024年)
- FLAN v2数据集(FLAN v2):通过`ai2-adapt-dev/flan_v2_converted`转换,链接至https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted,包含89,982条提示词(prompt)(Longpre等,2023年)
- No Robots数据集(No Robots):链接至https://huggingface.co/datasets/HuggingFaceH4/no_robots,采用CC-BY-NC-4.0许可协议,包含9,500条提示词(prompt)(Rajani等,2023年)
- OpenAssistant Guanaco数据集(OpenAssistant Guanaco):链接至https://huggingface.co/datasets/OpenAssistant/oasst1,采用Apache 2.0许可协议,包含7,132条提示词(prompt)(Kopf等,2024年)
- Tulu 3 Persona MATH数据集(Tulu 3 Persona MATH):链接至https://huggingface.co/datasets/allenai/tulu-3-personas-math,采用ODC-BY-1.0许可协议,包含149,960条提示词(prompt)
- Tulu 3 Persona GSM数据集(Tulu 3 Persona GSM):链接至https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade,采用ODC-BY-1.0许可协议,包含49,980条提示词(prompt)
- Tulu 3 Persona Python数据集(Tulu 3 Persona Python):链接至https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code,采用ODC-BY-1.0许可协议,包含34,999条提示词(prompt)
- Tulu 3 Persona Algebra数据集(Tulu 3 Persona Algebra):链接至https://huggingface.co/datasets/allenai/tulu-3-personas-algebra,采用ODC-BY-1.0许可协议,包含20,000条提示词(prompt)
- Tulu 3 Persona IF数据集(Tulu 3 Persona IF):链接至https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following,采用ODC-BY-1.0许可协议,包含29,980条提示词(prompt)
- NuminaMath-TIR数据集(NuminaMath-TIR):链接至https://huggingface.co/datasets/AI-MO/NuminaMath-TIR,采用Apache 2.0许可协议,包含64,312条提示词(prompt)(Beeching等,2024年)
- Tulu 3 WildGuardMix数据集(Tulu 3 WildGuardMix):链接至https://huggingface.co/datasets/allenai/wildguardmix,采用Apache 2.0许可协议,包含50,000条提示词(prompt)(Han等,2024年)
- Tulu 3 WildJailbreak数据集(Tulu 3 WildJailbreak):链接至https://huggingface.co/datasets/allenai/wildjailbreak,采用ODC-BY-1.0许可协议,包含50,000条提示词(prompt)(Wildteaming,2024年)
- OLMo 2 Hardcoded数据集(OLMo 2 Hardcoded):链接至https://huggingface.co/datasets/allenai/olmo-2-hard-coded,采用CC-BY-4.0许可协议,包含240条提示词(prompt)
- Aya数据集(Aya):链接至https://huggingface.co/datasets/CohereForAI/aya_dataset,采用Apache 2.0许可协议,包含100,000条提示词(prompt)(Singh等,2024年)
- WildChat GPT-4数据集(WildChat GPT-4):链接至https://huggingface.co/datasets/allenai/WildChat-1M,采用ODC-BY-1.0许可协议,包含100,000条提示词(prompt)(Zhao等,2024年)
- TableGPT数据集(TableGPT):链接至https://huggingface.co/datasets/LipengCS/Table-GPT,采用MIT许可协议,包含5,000条提示词(prompt)(Zha等,2023年)
- SciRIFF数据集(SciRIFF):链接至https://huggingface.co/datasets/allenai/SciRIFF,采用ODC-BY-1.0许可协议,包含10,000条提示词(prompt)(Wadden等,2024年)
- Evol CodeAlpaca数据集(Evol CodeAlpaca):链接至https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1,采用Apache 2.0许可协议,包含107,276条提示词(prompt)(Luo等,2023年)
## 数据集结构
每条样本均包含标准的监督微调数据格式,具体如下:
- `id`(字符串类型):唯一标识符
- `messages`(列表类型):用于监督微调的消息格式,包含用户提示词与助手回复内容
- `source`(字符串类型):对应样本的来源数据集
### 模型家族
| **阶段** | **OLMo-2-1124-7B** | **OLMo-2-1124-13B** |
|----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| **基础模型** | [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo2-7B-1124) | [OLMo-2-1124-13B](https://huggingface.co/allenai/OLMo2-13B-1124) |
| **监督微调(SFT)** | [OLMo-2-1124-7B-SFT](https://huggingface.co/allenai/OLMo-2-1124-7B-SFT) | [allenai/OLMo-2-1124-13B-SFT](https://huggingface.co/allenai/OLMo-2-1124-13B-SFT) |
| **直接偏好优化(DPO)** | [OLMo-2-1124-7B-DPO](https://huggingface.co/allenai/OLMo-2-1124-7B-DPO) | [allenai/OLMo-2-1124-13B-DPO](https://huggingface.co/allenai/OLMo-2-1124-13B-DPO) |
## 许可协议
本数据集采用ODC-BY-1.0许可协议,旨在用于研究与教育用途,需遵循艾伦人工智能研究所(Allen Institute for AI, Ai2)的[负责任使用指南](https://allenai.org/responsible-use)。本数据集包含由第三方模型生成的输出数据,此类数据受其自身独立使用条款约束。如需了解详细许可与使用条款,请查阅上文链接的各数据子集。
## 引用
若OLMo或其相关材料对你的研究工作有所帮助,请引用:
提供机构:
maas
创建时间:
2025-05-28
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个混合监督微调数据集,包含来自多个子集的93万多个样本,主要用于训练OLMo模型。数据集采用多种许可证,部分子集限制非商业用途,适用于研究和教育目的。
以上内容由遇见数据集搜集并总结生成



