five

Dolci-Instruct-SFT

收藏
魔搭社区2025-12-05 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/Dolci-Instruct-SFT
下载链接
链接失效反馈
官方服务:
资源简介:
# Dolci Instruct SFT Mixture *Note that this collection licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).* The Dolci Instruct SFT mixture was used to train [Olmo 3 7B Instruct SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT). It contains 2,152,112 samples from the following sets: Sources include a mixture of existing prompts: - [OpenThoughts 3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) (Apache 2.0): Extended to 32K context length and downsampled code prompts to 16X multiple, to 941,166 total prompts, reasoning traces removed for instruct, 99,268 prompts. - [CoCoNot](https://huggingface.co/datasets/allenai/coconot) (ODC-BY-1.0), 10,957 prompts (Brahman et al., 2024) - [FLAN v2](https://github.com/google-research/FLAN/tree/main) via [`ai2-adapt-dev/flan_v2_converted`](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted), 89,981 prompts (Longpre et al., 2023) - [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0), 7,132 prompts (Kopf et al., 2024) - [Tulu 3 Persona MATH](https://huggingface.co/datasets/allenai/tulu-3-personas-math) (ODC-BY-1.0), 149,958 prompts - [Tulu 3 Persona GSM](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade) (ODC-BY-1.0), 49,980 prompts - [Tulu 3 Persona Python](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code) (ODC-BY-1.0), 34,999 prompts - [Tulu 3 Persona Algebra](https://huggingface.co/datasets/allenai/tulu-3-personas-algebra) (ODC-BY-1.0), 19,999 prompts - [Tulu 3 WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix) (Apache 2.0), 49,373 prompts (Han et al., 2024) - [Tulu 3 WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) (ODC-BY-1.0), 49,965 prompts (Wildteaming, 2024) - [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) (Apache 2.0), 99,987 prompts (Singh et al., 2024) - [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT) (MIT), 5,000 prompts (Zha et al., 2023) - [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF) (ODC-BY-1.0), 4,557 prompts (Wadden et al., 2024) - [Evol CodeAlpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) (Apache 2.0), 107,270 prompts (Luo et al., 2023) And new prompts from us: - Dolci Tülu 3 Precise IF: 136,833 prompts. - Dolci Instruct Python Algorithms: 186,345 - WildChat with upgraded responses from GPT-4.1 (ODC-BY-1.0), 302,406 prompts (Zhao et al., 2024) - Logic puzzles, 159,882 prompts. - Verifiable reasoning, 310,572 prompts. - New hardcoded data, 69 prompts. - Dolci Instruct Tool Use, 227,579 prompts. The counts are smaller than the original prompt sources pulled from Tülu 3 / OLMo 2 due to more extensive filtering for data quality and by topics within the Azure API (blocked requests).

# Dolci Instruct SFT 混合数据集 *请注意,本数据集采用ODC-BY协议开源,仅可用于符合AllenAI(艾伦人工智能研究所)[负责任使用指南](https://allenai.org/responsible-use)的研究与教育用途。* 本Dolci Instruct SFT混合数据集曾用于训练[Olmo 3 7B Instruct SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT)模型。该数据集包含来自以下数据集子集的总计2,152,112条样本: 数据集来源包含多种现有提示词集合: - [OpenThoughts 3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)(采用Apache 2.0协议):已扩展至32K上下文长度,并将代码提示词按16倍比例下采样,最终得到941,166条提示词;为适配指令微调任务移除了推理轨迹后,剩余99,268条提示词。 - [CoCoNot](https://huggingface.co/datasets/allenai/coconot)(采用ODC-BY-1.0协议),共10,957条提示词(Brahman et al., 2024) - [FLAN v2](https://github.com/google-research/FLAN/tree/main),通过[`ai2-adapt-dev/flan_v2_converted`](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted)转换得到,共89,981条提示词(Longpre et al., 2023) - [OpenAssistant Guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1)(采用Apache 2.0协议),共7,132条提示词(Kopf et al., 2024) - [Tulu 3 Persona MATH](https://huggingface.co/datasets/allenai/tulu-3-personas-math)(采用ODC-BY-1.0协议),共149,958条提示词 - [Tulu 3 Persona GSM](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)(采用ODC-BY-1.0协议),共49,980条提示词 - [Tulu 3 Persona Python](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-code)(采用ODC-BY-1.0协议),共34,999条提示词 - [Tulu 3 Persona Algebra](https://huggingface.co/datasets/allenai/tulu-3-personas-algebra)(采用ODC-BY-1.0协议),共19,999条提示词 - [Tulu 3 WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix)(采用Apache 2.0协议),共49,373条提示词(Han et al., 2024) - [Tulu 3 WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak)(采用ODC-BY-1.0协议),共49,965条提示词(Wildteaming, 2024) - [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset)(采用Apache 2.0协议),共99,987条提示词(Singh et al., 2024) - [TableGPT](https://huggingface.co/datasets/LipengCS/Table-GPT)(采用MIT协议),共5,000条提示词(Zha et al., 2023) - [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF)(采用ODC-BY-1.0协议),共4,557条提示词(Wadden et al., 2024) - [Evol CodeAlpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)(采用Apache 2.0协议),共107,270条提示词(Luo et al., 2023) 此外还包含本团队新增的提示词: - Dolci Tülu 3 Precise IF:136,833条提示词 - Dolci Instruct Python Algorithms:186,345条提示词 - 采用GPT-4.1升级回复的WildChat(采用ODC-BY-1.0协议),共302,406条提示词(Zhao et al., 2024) - 逻辑谜题:159,882条提示词 - 可验证推理任务:310,572条提示词 - 新增硬编码数据:69条样本 - Dolci Instruct Tool Use:227,579条提示词 由于针对数据质量及Azure API(含被拦截请求)内的主题进行了更为严格的筛选,本数据集的样本数量少于从Tülu 3 / OLMo 2中直接提取的原始提示词源规模。
提供机构:
maas
创建时间:
2025-11-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作