Webscale-RL
收藏魔搭社区2025-12-05 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/Webscale-RL
下载链接
链接失效反馈官方服务:
资源简介:
# Webscale-RL Dataset
[Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)
## Dataset Description
**Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data.
While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources.

**Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI.
## Data Pipeline
The pretraining-to-RL data pipeline includes four stages:
1. **Filter**: Pre-processes and filters raw materials for quality
2. **Identifier**: Identifies domain classification and target persona
3. **Generator**: Creates question-answer pairs based on identified personas
4. **Checker**: Validates generated content for quality and correctness
More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline).
## Dataset Sources
We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23).
| Source | Size | Domain |
|--------|------|--------|
| DCLM | ~550K | Web text |
| Wikipedia | ~300K | Encyclopedia |
| MegaMath | ~100K | Mathematics |
| OpenMathReasoning | ~100K | Math reasoning |
| OpenCodeReasoning | ~50K | Code reasoning |
**Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details.
## Dataset Structure
Each sample in the dataset contains:
- `pretraining_text`: The original text from the source material
- `domain`: The domain of the source material
- `persona`: The persona of the source material
- `question`: A verifiable question or prompt extracted from the source material
- `answer`: The ground-truth answer
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Salesforce/Webscale-RL")
# Example of accessing data
for sample in dataset["train"]:
print(f"Pretraining Text: {sample['pretraining_text']}")
print(f"Question: {sample['question']}")
print(f"Answer: {sample['answer']}")
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@article{cen2025webscalerl,
title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels},
author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao},
journal={arXiv preprint arXiv:2510.06499},
year={2025},
}
```
# Webscale-RL 数据集
[Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)
## 数据集概述
**Webscale-RL** 是一款大规模强化学习(Reinforcement Learning,简称RL)数据集,旨在解决大语言模型(Large Language Model,简称LLM)强化学习训练中的核心瓶颈:高质量、多样化强化学习数据的稀缺性。
当前预训练流程可利用超过1万亿个多样化网页Token,而现有的强化学习数据集的Token规模仍不足100亿,且多样性受到极大限制。Webscale-RL通过将预训练语料库转换为可验证的查询与标准答案对,填补了这一差距——在保留原始数据源多样性的同时,将强化学习数据的规模扩展至预训练级别。

**注意**:本数据集由GPT生成,不得用于开发与OpenAI竞争的模型。
## 数据流水线
预训练转强化学习的数据流水线包含四个阶段:
1. **筛选模块**:对原始素材进行预处理与质量筛选
2. **标识模块**:识别素材的领域分类与目标人设
3. **生成模块**:基于识别出的人设创建问答对
4. **校验模块**:验证生成内容的质量与正确性
更多细节可参阅[PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)。
## 数据集来源
我们发布的Webscale-RL数据集包含约110万个样本。原则上,借助本数据流水线,可轻松将数据集规模进一步扩展至预训练级别。Webscale-RL数据集基于以下预训练语料库构建,构建流程遵循[SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23)的构建方案。
| 数据源 | 规模 | 领域 |
|--------|------|--------|
| DCLM | ~55万 | 网页文本 |
| Wikipedia | ~30万 | 百科全书 |
| MegaMath | ~10万 | 数学 |
| OpenMathReasoning | ~10万 | 数学推理 |
| OpenCodeReasoning | ~5万 | 代码推理 |
**注意**:OpenMathReasoning与OpenCodeReasoning同样被纳入SmolLM3预训练构建方案。更多细节可参阅[预训练数据集](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9)。
## 数据集结构
数据集中的每个样本包含以下字段:
- `pretraining_text`:源材料中的原始文本
- `domain`:源材料的领域分类
- `persona`:源材料对应的目标人设
- `question`:从源材料中提取的可验证问题或提示词
- `answer`:标准答案
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("Salesforce/Webscale-RL")
# 数据访问示例
for sample in dataset["train"]:
print(f"预训练文本: {sample['pretraining_text']}")
print(f"问题: {sample['question']}")
print(f"答案: {sample['answer']}")
## 引用
如果您在研究中使用本数据集,请引用以下文献:
bibtex
@article{cen2025webscalerl,
title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels},
author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao},
journal={arXiv preprint arXiv:2510.06499},
year={2025},
}
提供机构:
maas
创建时间:
2025-10-09



