recycling_the_web
收藏魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/facebook/recycling_the_web
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Recycling-The-Web Synthetic Data
We release 44.4B tokens of high-quality, model-filtered synthetic texts obtained via our [REcycling the Web with guIded REwrite (REWIRE)](https://arxiv.org/abs/2506.04689) approach.
The generation process involves taking all documents that are of moderate quality (i.e., having passed some rule-based filters),
using an LLM (Llama-3.3-70B-Instruct) to identify the purpose of the text content, and then asking the LLM to come up with an improved document conditioned on chain-of-thought reasoning.
Our approach specifically targets the vast quantity of low-quality documents that are somewhat informative but still not considered high-quality by existing filters.
We use LLM’s knowledge and reasoning capabilities to recycle these discarded documents and add them back to the training pool.
## Dataset Details
### Dataset Description
Curated by: Thao Nguyen
Language(s): Mostly English texts
License: CC-by-NC
The texts are outputs of Llama 3.3 and subject to the Llama 3.3 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE).
If you use the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.
Third party content pulled from other locations are subject to its own licenses and you may have other legal obligations or restrictions that govern your use of that content.
### Dataset Sources
Paper: [Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models](https://arxiv.org/abs/2506.04689), COLM 2025
## Uses
### Direct Use
The data is intended for pre-training large language models (LLMs), and designed with the goal of _complementing_ existing web-scraped texts.
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
The data was obtained by using an LLM to rewrite low-quality documents that have been discarded by quality filters, in order to make them useful for pre-training.
This helps increase the token availability and address the impending data bottleneck, as the growth of public human-generated texts has been lagging behind the increase in model capacity and training token budget.
Across different model scales, we find that mixing our synthetic data and high-quality web data consistently outperforms training on only the latter.
<img src="https://huggingface.co/datasets/facebook/recycling_the_web/resolve/main/main_figure.png" alt="Summary of performance improvement" width="350px">
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
We first gathered raw web documents from DCLM-RefinedWeb (Li et al., 2024), Common Crawl data that has passed the rule-based quality filters from RefinedWeb (Penedo et al., 2023)
(e.g., repetition filter, page length filter, URL filter, etc.) and global deduplication, but has not gone through model-based filtering.
We then prompted Llama-3.3-70B-Instruct (Grattafiori et al., 2024) to perform chain-of-thought reasoning on the original web documents,
such as identifying the task or purpose of the text, reasoning about the steps needed to achieve the purpose, etc. before generating an improved version of the documents.
Refer to our paper for the full prompt we used. We applied this rewriting process to all documents in the starting pool (DCLM-RefinedWeb).
To control the quality of the generations, we further performed model-based filtering using a fastText classifier, following DCLM (Li et al., 2024).
This data release contains _only_ the rewritten outputs that have been filtered, i.e. those in the top 10% of the generations based on fastText scores.
<img src="https://huggingface.co/datasets/facebook/recycling_the_web/resolve/main/REWIRE_pipeline.png" alt="REWIRE pipeline" width="800px">
## Citation
If you use data from Recyling The Web, please cite with the following BibTex entry:
```
@article{nguyen2025recycling,
title={Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models},
author={Nguyen, Thao and Li, Yang and Golovneva, Olga and Zettlemoyer, Luke and Oh, Sewoong and Schmidt, Ludwig and Li, Xian},
journal={arXiv preprint arXiv:2506.04689},
year={2025}
}
```
## Dataset Card Contact
Thao Nguyen (thaottn@cs.washington.edu)
# 网页回收合成数据集卡片(Recycling-The-Web Synthetic Data)
我们发布了总计444亿个Token的高质量、经过模型筛选的合成文本,这些文本通过我们提出的**基于引导重写的网页回收(REcycling the Web with guIded REwrite, REWIRE)**方法生成,相关方法详见论文[https://arxiv.org/abs/2506.04689]。
该数据的生成流程为:首先筛选出所有中等质量的文档(即已通过部分基于规则的筛选器的文档),随后使用大语言模型(Large Language Model, LLM)识别文本内容的核心意图,再令该大语言模型基于思维链推理(Chain-of-Thought Reasoning)生成优化后的文档版本。我们的方法专门针对海量具备一定信息价值,但仍被现有筛选器判定为低质量的文档。我们借助大语言模型的知识与推理能力,对这些被舍弃的文档进行回收利用,并将其重新加入训练数据集池中。
## 数据集详情
### 数据集描述
整理者:Thao Nguyen
语言:以英文文本为主
许可协议:CC-by-NC
本数据集的文本均为Llama 3.3的生成结果,需遵循Llama 3.3的许可协议(https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE)。若您使用本数据集创建、训练、微调或以其他方式改进并分发或公开一款人工智能模型,则需在该模型名称的开头添加“Llama”字样。从其他来源获取的第三方内容需遵循其自身的许可协议,您可能需遵守其他相关法律义务或限制条款来使用该类内容。
### 数据集来源
论文:《Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models》(https://arxiv.org/abs/2506.04689),发表于COLM 2025会议
## 使用场景
### 直接使用
本数据集旨在用于大语言模型(Large Language Model, LLM)的预训练,其设计目标是对现有网页爬取文本进行补充。
## 数据集构建
### 整理依据
<!-- 本数据集的创建动机。 -->
本数据集通过对被质量筛选器舍弃的低质量文档进行大语言模型重写,使其具备预训练价值而获得。由于公开的人工生成文本增长速度已滞后于模型容量与训练Token预算的增长,该方法有助于提升可用Token总量,缓解即将到来的数据瓶颈问题。在不同模型规模下的实验均表明,将本合成数据集与高质量网页数据混合训练的效果,始终优于仅使用后者进行训练的效果。
<img src="https://huggingface.co/datasets/facebook/recycling_the_web/resolve/main/main_figure.png" alt="性能提升总结" width="350px">
### 源数据
#### 数据收集与处理
<!-- 本节介绍数据收集与处理流程,包括数据筛选标准、过滤与归一化方法、所用工具与库等。 -->
我们首先从DCLM-RefinedWeb(Li等人,2024)中获取原始网页文档,该数据集源自已通过RefinedWeb(Penedo等人,2023)提出的基于规则的质量筛选器(如重复内容过滤、页面长度过滤、URL过滤等)以及全局去重处理,但尚未经过模型级筛选的Common Crawl数据。
随后我们提示Llama-3.3-70B-Instruct(Grattafiori等人,2024)对原始网页文档执行思维链推理,例如识别文本的任务或核心意图、推导实现该意图所需的步骤等,之后生成优化后的文档版本。我们所用的完整提示词可参考本论文。我们将该重写流程应用于初始数据集池(DCLM-RefinedWeb)中的所有文档。
为控制生成文本的质量,我们参考DCLM(Li等人,2024)的方法,使用fastText分类器进一步执行模型级筛选。本次发布的数据集仅包含经过筛选的重写输出结果,即基于fastText评分排名前10%的生成文本。
<img src="https://huggingface.co/datasets/facebook/recycling_the_web/resolve/main/REWIRE_pipeline.png" alt="REWIRE流程框架" width="800px">
## 引用声明
若您使用本网页回收数据集,请按照以下BibTex条目进行引用:
@article{nguyen2025recycling,
title={Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models},
author={Nguyen, Thao and Li, Yang and Golovneva, Olga and Zettlemoyer, Luke and Oh, Sewoong and Schmidt, Ludwig and Li, Xian},
journal={arXiv preprint arXiv:2506.04689},
year={2025}
}
## 数据集卡片联系人
联系人:Thao Nguyen(邮箱:thaottn@cs.washington.edu)
提供机构:
maas
创建时间:
2025-08-28



