Dolci-Instruct-RL
收藏魔搭社区2026-01-06 更新2026-01-10 收录
下载链接:
https://modelscope.cn/datasets/allenai/Dolci-Instruct-RL
下载链接
链接失效反馈官方服务:
资源简介:
# Dolci-Instruct-RL
## Dataset Summary
**Dolci-Instruct-RL** is the reinforcement learning dataset used to train the *Olmo-3-7B-Instruct* model.
It contains **169,964** prompts spanning:
- Math
- Code
- Precise Instruction Following
- General Chat
The dataset aggregates multiple curated sources, applies extensive filtering, and produces a unified RL-ready prompt set.
---
## Dataset Composition
### **Total Samples:** 169,964
### **Original Dataset Contribution**
| Source Dataset | Count |
|----------------|-------|
| IF Multi-Constraint (IFBench/IFEval derived) | 37,568 |
| Multi-Subject RLVR ([paper](https://arxiv.org/abs/2503.23829v1)) | 18,971 |
| Tulu 3 Rewritten ([paper](https://arxiv.org/abs/2411.15124)) | 18,757 |
| WildChat English General ([paper](https://arxiv.org/abs/2405.01470)) | 10,670 |
### **Dataset Source Counts (Grouped Mixes)**
| Mix | Count |
|------|-------|
| General RLVR Mix | 48,398 |
| IF Multi-Constraint Mixture | 37,568 |
| AceCoder RLVR ([paper](https://arxiv.org/abs/2502.01718)) | 20,000 |
| OMEGA (Math) ([paper](https://arxiv.org/abs/2506.18880)) | 20,000 |
| ORZ Math (Open-Reasoner-Zero) ([paper](https://arxiv.org/abs/2503.24290)) | 14,000 |
| Polaris Math | 14,000 |
| MathSub-30K (KlearReasoner Math) ([paper](https://arxiv.org/abs/2508.07629)) | 8,998 |
| DAPO-Math ([paper](https://arxiv.org/abs/2503.14476)) | 7,000 |
---
## Data Sources & Description
### **Instruction Following**
- Derived from IFBench-Train & IFEval-style prompts
- Strict multi-constraint format (up to 5 constraints)
- Normalized and filtered for safety and clarity
### **General Chat**
- **Tulu 3 Rewritten** prompts (clarified and F1 filtered)
- **WildChat English** (filtered for non-math, non-code; character caps)
- **Multi-Subject RLVR** exam-style reasoning questions
### **Math**
- **OMEGA** ([paper](https://arxiv.org/abs/2506.18880))
- **Open-Reasoner-Zero (ORZ)** ([paper](https://arxiv.org/abs/2503.24290))
- **DAPO-Math** ([paper](https://arxiv.org/abs/2503.14476))
- **MathSub-30K (KlearReasoner Math)** ([paper](https://arxiv.org/abs/2508.07629))
- **Polaris**
### **Code**
- **AceCoder** ([paper](https://arxiv.org/abs/2502.01718))
- Test-case–based RL prompts
- High-quality filtering via solution execution
- Some test cases synthesized programmatically
---
## Processing & Filtering
- **Keyword & topic filtering**
- **Character caps** (max 10 per character for WildChat)
- **F1-quality screening** for Tulu 3 rewritten prompts
- **Removal of math/code** from general-chat datasets
- **Execution-based filtering** for code datasets
- **Constraint normalization** for IF prompts
The final result is a clean, high-entropy, instruction-following RL dataset.
---
## License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with [Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
```
@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}
```
# Dolci-Instruct-RL
## 数据集概述
**Dolci-Instruct-RL** 是用于训练*Olmo-3-7B-Instruct*模型的强化学习(Reinforcement Learning, RL)数据集。该数据集包含**169,964**条提示词,涵盖以下四大领域:
- 数学
- 代码
- 精准指令遵循
- 通用对话
本数据集整合了多份经过精心筛选的数据源,经过多轮严格处理与过滤,最终形成一套统一的、可直接用于强化学习训练的提示词集合。
---
## 数据集构成
### **总样本量:169,964**
### **原始数据集贡献占比**
| 源数据集 | 样本数量 |
|----------------|-------|
| IF多约束(IFBench/IFEval衍生) | 37,568 |
| 多学科RLVR([论文](https://arxiv.org/abs/2503.23829v1)) | 18,971 |
| Tulu 3改写版([论文](https://arxiv.org/abs/2411.15124)) | 18,757 |
| WildChat英语通用语料([论文](https://arxiv.org/abs/2405.01470)) | 10,670 |
### **数据集分组混合源统计**
| 混合数据集 | 样本数量 |
|------|-------|
| 通用RLVR混合集 | 48,398 |
| IF多约束混合集 | 37,568 |
| AceCoder RLVR([论文](https://arxiv.org/abs/2502.01718)) | 20,000 |
| OMEGA(数学领域)([论文](https://arxiv.org/abs/2506.18880)) | 20,000 |
| ORZ数学(Open-Reasoner-Zero)([论文](https://arxiv.org/abs/2503.24290)) | 14,000 |
| Polaris数学 | 14,000 |
| MathSub-30K(KlearReasoner数学数据集)([论文](https://arxiv.org/abs/2508.07629)) | 8,998 |
| DAPO-Math([论文](https://arxiv.org/abs/2503.14476)) | 7,000 |
---
## 数据源与说明
### **指令遵循类**
- 源自IFBench-Train与IFEval风格的提示词
- 采用严格的多约束格式(最多包含5项约束条件)
- 经过标准化处理与安全、清晰度筛选
### **通用对话类**
- **Tulu 3改写版**提示词(经内容澄清处理并通过F1指标筛选)
- **WildChat英语语料**(已过滤掉数学、代码相关内容,并设置字符上限)
- **多学科RLVR**考试风格推理问题
### **数学类**
- **OMEGA**([论文](https://arxiv.org/abs/2506.18880))
- **Open-Reasoner-Zero(ORZ)**([论文](https://arxiv.org/abs/2503.24290))
- **DAPO-Math**([论文](https://arxiv.org/abs/2503.14476))
- **MathSub-30K(KlearReasoner数学数据集)**([论文](https://arxiv.org/abs/2508.07629))
- **Polaris**
### **代码类**
- **AceCoder**([论文](https://arxiv.org/abs/2502.01718))
- 基于测试用例的强化学习提示词
- 通过代码解决方案执行进行高质量筛选
- 部分测试用例通过程序自动生成
---
## 处理与过滤流程
- **关键词与主题筛选**
- **字符长度限制**(WildChat语料单条最大10字符)
- **F1质量筛查**(针对Tulu 3改写版提示词)
- **移除通用对话数据集中的数学与代码内容**
- **基于代码执行的筛选**(针对代码数据集)
- **约束条件标准化**(针对IF类提示词)
最终产出的数据集为一套干净、高信息熵、适用于指令遵循任务的强化学习数据集。
---
## 许可证
本数据集采用ODC-BY许可证发布,仅可用于研究与教育用途,需符合[艾伦AI负责任使用指南](https://allenai.org/responsible-use)的要求。
## 引用
@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}
提供机构:
maas
创建时间:
2025-12-10



