Dolci-Instruct-SFT
收藏魔搭社区2025-12-05 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/Dolci-Instruct-SFT
下载链接
链接失效反馈官方服务:
资源简介:
# Dolci Instruct SFT Mixture
*Note that this collection licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).*
The Dolci Instruct SFT mixture was used to train [Olmo 3 7B Instruct SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT).
It contains 2,152,112 samples from the following sets:
Sources include a mixture of existing prompts:
- [OpenThoughts 3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) (Apache 2.0): Extended to 32K context length and downsampled code prompts to 16X multiple, to 941,166 total prompts, reasoning traces removed for instruct, 99,268 prompts.
- [CoCoNot](https://huggingface.co/datasets/allenai/coconot) (ODC-BY-1.0), 10,957 prompts (Brahman et al., 2024)
- [FLAN v2](https://github.com/google-research/FLAN/tree/main) via [`ai2-adapt-dev/flan_v2_converted`](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted), 89,981 prompts (Longpre et al., 2023)
- [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0), 7,132 prompts (Kopf et al., 2024)
- [Tulu 3 Persona MATH](https://huggingface.co/datasets/allenai/tulu-3-personas-math) (ODC-BY-1.0), 149,958 prompts
- [Tulu 3 Persona GSM](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade) (ODC-BY-1.0), 49,980 prompts
- [Tulu 3 Persona Python](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code) (ODC-BY-1.0), 34,999 prompts
- [Tulu 3 Persona Algebra](https://huggingface.co/datasets/allenai/tulu-3-personas-algebra) (ODC-BY-1.0), 19,999 prompts
- [Tulu 3 WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix) (Apache 2.0), 49,373 prompts (Han et al., 2024)
- [Tulu 3 WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) (ODC-BY-1.0), 49,965 prompts (Wildteaming, 2024)
- [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) (Apache 2.0), 99,987 prompts (Singh et al., 2024)
- [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT) (MIT), 5,000 prompts (Zha et al., 2023)
- [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF) (ODC-BY-1.0), 4,557 prompts (Wadden et al., 2024)
- [Evol CodeAlpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) (Apache 2.0), 107,270 prompts (Luo et al., 2023)
And new prompts from us:
- Dolci Tülu 3 Precise IF: 136,833 prompts.
- Dolci Instruct Python Algorithms: 186,345
- WildChat with upgraded responses from GPT-4.1 (ODC-BY-1.0), 302,406 prompts (Zhao et al., 2024)
- Logic puzzles, 159,882 prompts.
- Verifiable reasoning, 310,572 prompts.
- New hardcoded data, 69 prompts.
- Dolci Instruct Tool Use, 227,579 prompts.
The counts are smaller than the original prompt sources pulled from Tülu 3 / OLMo 2 due to more extensive filtering for data quality and by topics within the Azure API (blocked requests).
# Dolci Instruct SFT 混合数据集
*请注意,本数据集采用ODC-BY协议开源,仅可用于符合AllenAI(艾伦人工智能研究所)[负责任使用指南](https://allenai.org/responsible-use)的研究与教育用途。*
本Dolci Instruct SFT混合数据集曾用于训练[Olmo 3 7B Instruct SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT)模型。该数据集包含来自以下数据集子集的总计2,152,112条样本:
数据集来源包含多种现有提示词集合:
- [OpenThoughts 3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)(采用Apache 2.0协议):已扩展至32K上下文长度,并将代码提示词按16倍比例下采样,最终得到941,166条提示词;为适配指令微调任务移除了推理轨迹后,剩余99,268条提示词。
- [CoCoNot](https://huggingface.co/datasets/allenai/coconot)(采用ODC-BY-1.0协议),共10,957条提示词(Brahman et al., 2024)
- [FLAN v2](https://github.com/google-research/FLAN/tree/main),通过[`ai2-adapt-dev/flan_v2_converted`](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted)转换得到,共89,981条提示词(Longpre et al., 2023)
- [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1)(采用Apache 2.0协议),共7,132条提示词(Kopf et al., 2024)
- [Tulu 3 Persona MATH](https://huggingface.co/datasets/allenai/tulu-3-personas-math)(采用ODC-BY-1.0协议),共149,958条提示词
- [Tulu 3 Persona GSM](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)(采用ODC-BY-1.0协议),共49,980条提示词
- [Tulu 3 Persona Python](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code)(采用ODC-BY-1.0协议),共34,999条提示词
- [Tulu 3 Persona Algebra](https://huggingface.co/datasets/allenai/tulu-3-personas-algebra)(采用ODC-BY-1.0协议),共19,999条提示词
- [Tulu 3 WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix)(采用Apache 2.0协议),共49,373条提示词(Han et al., 2024)
- [Tulu 3 WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak)(采用ODC-BY-1.0协议),共49,965条提示词(Wildteaming, 2024)
- [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset)(采用Apache 2.0协议),共99,987条提示词(Singh et al., 2024)
- [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT)(采用MIT协议),共5,000条提示词(Zha et al., 2023)
- [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF)(采用ODC-BY-1.0协议),共4,557条提示词(Wadden et al., 2024)
- [Evol CodeAlpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)(采用Apache 2.0协议),共107,270条提示词(Luo et al., 2023)
此外还包含本团队新增的提示词:
- Dolci Tülu 3 Precise IF:136,833条提示词
- Dolci Instruct Python Algorithms:186,345条提示词
- 采用GPT-4.1升级回复的WildChat(采用ODC-BY-1.0协议),共302,406条提示词(Zhao et al., 2024)
- 逻辑谜题:159,882条提示词
- 可验证推理任务:310,572条提示词
- 新增硬编码数据:69条样本
- Dolci Instruct Tool Use:227,579条提示词
由于针对数据质量及Azure API(含被拦截请求)内的主题进行了更为严格的筛选,本数据集的样本数量少于从Tülu 3 / OLMo 2中直接提取的原始提示词源规模。
提供机构:
maas
创建时间:
2025-11-21



