five

tulu-v3.1-mix-preview-4096-OLMoE

收藏
魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE
下载链接
链接失效反馈
官方服务:
资源简介:
# OLMoE SFT Mix The SFT mix used is an expanded version of the [Tulu v2 SFT mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture-olmo-4096) with new additions for code, [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction), reasoning, [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), and instruction following, [No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) and a subset of [Daring Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater). Please see the referenced datasets for the multiple licenses used in subsequent data. We do not introduce any new data with this dataset. Config for creation via [`open-instruct`](https://github.com/allenai/open-instruct/blob/main/open_instruct/mix_data.py): ``` dataset_mixer: allenai/tulu-v2-sft-mixture-olmo-4096: 1.0 HuggingFaceH4/no_robots: 1.0 meta-math/MetaMathQA: 0.25 m-a-p/CodeFeedback-Filtered-Instruction: 1.0 ai2-adapt-dev/daring-anteater-specialized: 1.0 max_seq_length: 4096 ``` Reanming code: ``` def rename_messages(example): messages = example["messages"] new_messages = [] for m in messages: new_messages.append({"role": m["role"], "content":m["content"].replace("OLMo","OLMoE")}) example["messages"] = new_messages return example ``` Related datasets (for updated list, see [collection](https://huggingface.co/collections/allenai/tulu-3-data-mixes-66a944d48990fafa62c2c18c)) | Version | Name | Summary | Max Length | Model Name | |---------|------|---------|------------|------------| | v1 | [allenai/tulu-v1-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture) | | | | | v2 | [allenai/tulu-v2-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | | - | | | v2 | [allenai/tulu-v2-sft-mixture-olmo-2048](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture-olmo-2048) | | 2048 | OLMo-2048 | | v3.0 | [allenai/tulu-v3.0-mix-preview-4096-OLMo](https://huggingface.co/datasets/allenai/tulu-v3.0-mix-preview-4096-OLMo) | Tulu 2 + Math/Code + No Robots| 4096 | OLMo | | v3.0 | [allenai/tulu-v3.0-mix-preview-4096-OLMoE](https://huggingface.co/datasets/allenai/tulu-v3.0-mix-preview-4096-OLMoE) | OLMoE Name| 4096 | OLMoE | | v3.1 | [**allenai/tulu-v3.1-mix-preview-4096-OLMoE**](https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE) | Add specialized Daring Anteater | 4096 | OLMoE |

# OLMoE 监督微调(Supervised Fine-Tuning,SFT)混合数据集 本次使用的SFT混合数据集是[Tulu v2 SFT混合数据集](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture-olmo-4096)的扩展版本,新增了代码相关数据集[CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction)、推理相关数据集[MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)、指令遵循数据集[No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots),以及[Daring Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)的子集。 有关本数据集后续所使用的多种许可协议,请参阅所引用的各原始数据集。本数据集未引入任何全新的自有数据。 本数据集通过[`open-instruct`](https://github.com/allenai/open-instruct/blob/main/open_instruct/mix_data.py)工具生成,对应的配置参数如下: dataset_mixer: allenai/tulu-v2-sft-mixture-olmo-4096: 1.0 HuggingFaceH4/no_robots: 1.0 meta-math/MetaMathQA: 0.25 m-a-p/CodeFeedback-Filtered-Instruction: 1.0 ai2-adapt-dev/daring-anteater-specialized: 1.0 max_seq_length: 4096 以下为消息重命名处理代码: python def rename_messages(example): messages = example["messages"] new_messages = [] for m in messages: new_messages.append({"role": m["role"], "content":m["content"].replace("OLMo","OLMoE")}) example["messages"] = new_messages return example 相关数据集(完整列表请参阅[数据集集合](https://huggingface.co/collections/allenai/tulu-3-data-mixes-66a944d48990fafa62c2c18c)): | 版本号 | 名称 | 摘要 | 最大序列长度 | 模型名称 | |---------|------|---------|------------|------------| | v1 | [allenai/tulu-v1-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture) | 无 | 无 | 无 | | v2 | [allenai/tulu-v2-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 无 | - | 无 | | v2 | [allenai/tulu-v2-sft-mixture-olmo-2048](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture-olmo-2048) | 无 | 2048 | OLMo-2048 | | v3.0 | [allenai/tulu-v3.0-mix-preview-4096-OLMo](https://huggingface.co/datasets/allenai/tulu-v3.0-mix-preview-4096-OLMo) | Tulu 2 + 数学/代码 + No Robots | 4096 | OLMo | | v3.0 | [allenai/tulu-v3.0-mix-preview-4096-OLMoE](https://huggingface.co/datasets/allenai/tulu-v3.0-mix-preview-4096-OLMoE) | 适配OLMoE命名 | 4096 | OLMoE | | v3.1 | [**allenai/tulu-v3.1-mix-preview-4096-OLMoE**](https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE) | 新增专用版Daring Anteater数据集 | 4096 | OLMoE |
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作