Webscale-RL

Name: Webscale-RL
Creator: maas
Published: 2025-12-05 16:53:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-10-11 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/Webscale-RL

下载链接

链接失效反馈

官方服务：

资源简介：

# Webscale-RL Dataset [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## Dataset Description **Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data. While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources. ![Webscale-RL Pipeline](assets/webscale-rl-pipeline.png) **Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI. ## Data Pipeline The pretraining-to-RL data pipeline includes four stages: 1. **Filter**: Pre-processes and filters raw materials for quality 2. **Identifier**: Identifies domain classification and target persona 3. **Generator**: Creates question-answer pairs based on identified personas 4. **Checker**: Validates generated content for quality and correctness More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline). ## Dataset Sources We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23). | Source | Size | Domain | |--------|------|--------| | DCLM | ~550K | Web text | | Wikipedia | ~300K | Encyclopedia | | MegaMath | ~100K | Mathematics | | OpenMathReasoning | ~100K | Math reasoning | | OpenCodeReasoning | ~50K | Code reasoning | **Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details. ## Dataset Structure Each sample in the dataset contains: - `pretraining_text`: The original text from the source material - `domain`: The domain of the source material - `persona`: The persona of the source material - `question`: A verifiable question or prompt extracted from the source material - `answer`: The ground-truth answer ## Usage ```python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # Example of accessing data for sample in dataset["train"]: print(f"Pretraining Text: {sample['pretraining_text']}") print(f"Question: {sample['question']}") print(f"Answer: {sample['answer']}") ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @article{cen2025webscalerl, title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels}, author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao}, journal={arXiv preprint arXiv:2510.06499}, year={2025}, } ```

# Webscale-RL 数据集 [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## 数据集概述 **Webscale-RL** 是一款大规模强化学习（Reinforcement Learning，简称RL）数据集，旨在解决大语言模型（Large Language Model，简称LLM）强化学习训练中的核心瓶颈：高质量、多样化强化学习数据的稀缺性。当前预训练流程可利用超过1万亿个多样化网页Token，而现有的强化学习数据集的Token规模仍不足100亿，且多样性受到极大限制。Webscale-RL通过将预训练语料库转换为可验证的查询与标准答案对，填补了这一差距——在保留原始数据源多样性的同时，将强化学习数据的规模扩展至预训练级别。 ![Webscale-RL 流水线](assets/webscale-rl-pipeline.png) **注意**：本数据集由GPT生成，不得用于开发与OpenAI竞争的模型。 ## 数据流水线预训练转强化学习的数据流水线包含四个阶段： 1. **筛选模块**：对原始素材进行预处理与质量筛选 2. **标识模块**：识别素材的领域分类与目标人设 3. **生成模块**：基于识别出的人设创建问答对 4. **校验模块**：验证生成内容的质量与正确性更多细节可参阅[PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)。 ## 数据集来源我们发布的Webscale-RL数据集包含约110万个样本。原则上，借助本数据流水线，可轻松将数据集规模进一步扩展至预训练级别。Webscale-RL数据集基于以下预训练语料库构建，构建流程遵循[SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23)的构建方案。 | 数据源 | 规模 | 领域 | |--------|------|--------| | DCLM | ~55万 | 网页文本 | | Wikipedia | ~30万 | 百科全书 | | MegaMath | ~10万 | 数学 | | OpenMathReasoning | ~10万 | 数学推理 | | OpenCodeReasoning | ~5万 | 代码推理 | **注意**：OpenMathReasoning与OpenCodeReasoning同样被纳入SmolLM3预训练构建方案。更多细节可参阅[预训练数据集](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9)。 ## 数据集结构数据集中的每个样本包含以下字段： - `pretraining_text`：源材料中的原始文本 - `domain`：源材料的领域分类 - `persona`：源材料对应的目标人设 - `question`：从源材料中提取的可验证问题或提示词 - `answer`：标准答案 ## 使用方法 python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # 数据访问示例 for sample in dataset["train"]: print(f"预训练文本: {sample['pretraining_text']}") print(f"问题: {sample['question']}") print(f"答案: {sample['answer']}") ## 引用如果您在研究中使用本数据集，请引用以下文献： bibtex @article{cen2025webscalerl, title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels}, author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao}, journal={arXiv preprint arXiv:2510.06499}, year={2025}, }

提供机构：

maas

创建时间：

2025-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集