five

Webscale-RL

收藏
魔搭社区2025-12-05 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/Webscale-RL
下载链接
链接失效反馈
官方服务:
资源简介:
# Webscale-RL Dataset [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## Dataset Description **Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data. While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources. ![Webscale-RL Pipeline](assets/webscale-rl-pipeline.png) **Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI. ## Data Pipeline The pretraining-to-RL data pipeline includes four stages: 1. **Filter**: Pre-processes and filters raw materials for quality 2. **Identifier**: Identifies domain classification and target persona 3. **Generator**: Creates question-answer pairs based on identified personas 4. **Checker**: Validates generated content for quality and correctness More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline). ## Dataset Sources We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23). | Source | Size | Domain | |--------|------|--------| | DCLM | ~550K | Web text | | Wikipedia | ~300K | Encyclopedia | | MegaMath | ~100K | Mathematics | | OpenMathReasoning | ~100K | Math reasoning | | OpenCodeReasoning | ~50K | Code reasoning | **Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details. ## Dataset Structure Each sample in the dataset contains: - `pretraining_text`: The original text from the source material - `domain`: The domain of the source material - `persona`: The persona of the source material - `question`: A verifiable question or prompt extracted from the source material - `answer`: The ground-truth answer ## Usage ```python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # Example of accessing data for sample in dataset["train"]: print(f"Pretraining Text: {sample['pretraining_text']}") print(f"Question: {sample['question']}") print(f"Answer: {sample['answer']}") ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @article{cen2025webscalerl, title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels}, author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao}, journal={arXiv preprint arXiv:2510.06499}, year={2025}, } ```

# Webscale-RL 数据集 [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## 数据集概述 **Webscale-RL** 是一款大规模强化学习(Reinforcement Learning,简称RL)数据集,旨在解决大语言模型(Large Language Model,简称LLM)强化学习训练中的核心瓶颈:高质量、多样化强化学习数据的稀缺性。 当前预训练流程可利用超过1万亿个多样化网页Token,而现有的强化学习数据集的Token规模仍不足100亿,且多样性受到极大限制。Webscale-RL通过将预训练语料库转换为可验证的查询与标准答案对,填补了这一差距——在保留原始数据源多样性的同时,将强化学习数据的规模扩展至预训练级别。 ![Webscale-RL 流水线](assets/webscale-rl-pipeline.png) **注意**:本数据集由GPT生成,不得用于开发与OpenAI竞争的模型。 ## 数据流水线 预训练转强化学习的数据流水线包含四个阶段: 1. **筛选模块**:对原始素材进行预处理与质量筛选 2. **标识模块**:识别素材的领域分类与目标人设 3. **生成模块**:基于识别出的人设创建问答对 4. **校验模块**:验证生成内容的质量与正确性 更多细节可参阅[PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)。 ## 数据集来源 我们发布的Webscale-RL数据集包含约110万个样本。原则上,借助本数据流水线,可轻松将数据集规模进一步扩展至预训练级别。Webscale-RL数据集基于以下预训练语料库构建,构建流程遵循[SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23)的构建方案。 | 数据源 | 规模 | 领域 | |--------|------|--------| | DCLM | ~55万 | 网页文本 | | Wikipedia | ~30万 | 百科全书 | | MegaMath | ~10万 | 数学 | | OpenMathReasoning | ~10万 | 数学推理 | | OpenCodeReasoning | ~5万 | 代码推理 | **注意**:OpenMathReasoning与OpenCodeReasoning同样被纳入SmolLM3预训练构建方案。更多细节可参阅[预训练数据集](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9)。 ## 数据集结构 数据集中的每个样本包含以下字段: - `pretraining_text`:源材料中的原始文本 - `domain`:源材料的领域分类 - `persona`:源材料对应的目标人设 - `question`:从源材料中提取的可验证问题或提示词 - `answer`:标准答案 ## 使用方法 python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # 数据访问示例 for sample in dataset["train"]: print(f"预训练文本: {sample['pretraining_text']}") print(f"问题: {sample['question']}") print(f"答案: {sample['answer']}") ## 引用 如果您在研究中使用本数据集,请引用以下文献: bibtex @article{cen2025webscalerl, title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels}, author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao}, journal={arXiv preprint arXiv:2510.06499}, year={2025}, }
提供机构:
maas
创建时间:
2025-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作