falcon-refinedweb
收藏魔搭社区2026-01-06 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/falcon-refinedweb
下载链接
链接失效反馈官方服务:
资源简介:
# 📀 Falcon RefinedWeb
**Falcon RefinedWeb is a massive English web dataset built by [TII](https://www.tii.ae) and released under an ODC-By 1.0 license.**
See the 📓 [paper on arXiv](https://arxiv.org/abs/2306.01116) for more details.
RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.
RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.
This public extract should contain 500-650GT depending on the tokenizer you use, and can be enhanced with the curated corpora of your choosing. This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.
```python
from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")
```
RefinedWeb is the main dataset we have used for training the [Falcon LLM](https://falconllm.tii.ae) models:
* It was used in conjunction with a curated corpora to train Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b), two state-of-the-art open-source models.
* It was also used to train Falcon-RW-[1B](https://huggingface.co/tiiuae/falcon-rw-1b)/[7B](https://huggingface.co/tiiuae/falcon-rw-7b), two models trained on 350 billion tokens of RefinedWeb alone to demonstrate its quality compared to curated corpora.
# Dataset card for Falcon RefinedWeb
## Dataset Description
* **Homepage:** [falconllm.tii.ae](falconllm.tii.ae)
* **Paper:** [https://arxiv.org/abs/2306.01116](https://arxiv.org/abs/2306.01116)
* **Point of Contact:** [falconllm@tii.ae](mailto:falconllm@tii.ae)
### Dataset Summary
Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).
It was built on top of CommonCrawl, leveraging stringent filtering and extensive deduplication.
### Supported Tasks and Leaderboards
RefinedWeb is intended to be primarly used as a pretraining dataset for large language models. Practitioners may leverage it for upstream evaluation with a validation loss, but we do not provide any canonical split.
### Languages
RefinedWeb primarly contains English.
## Dataset Structure
### Data Instances
Each data instance corresponds to an individual web page which has been crawled, processed, and deduplicated against all other instances.
This public extract of RefinedWeb contains about 1B instances (968M individual web pages), for a total of 2.8TB of clean text data.
### Data Fields
* `content`: the processed and cleaned text contained in the page;
* `url`: the url of the webpage crawled to produce the sample;
* `timestamp`: timestamp of when the webpage was crawled by CommonCrawl;
* `dump`: the CommonCrawl dump the sample is a part of;
* `segment`: the CommonCrawl segment the sample is a part of;
* `image_urls`: a list of elements in the type [`image_url`, `image_alt_text`] for all the images found in the content of the sample.
### Data Splits
We do not provide any canonical splits for RefinedWeb.
## Dataset Creation
### Curation Rationale
Falcon RefinedWeb is built on-top of [CommonCrawl](https://commoncrawl.org), using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication.
In designing RefinedWeb, we abided to the following philosophy:
* (1) **Scale first.** We intend MDR to produce datasets to be used to train 40-200B parameters models, thus requiring trillions of tokens [(Hoffmann et al., 2022)](https://arxiv.org/abs/2203.15556). For English-only RefinedWeb, we target a size of 3-6 trillion tokens. Specifically, we eschew any labour intensive human curation process, and focus on CommonCrawl instead of disparate single-domain sources.
* (2) **Strict deduplication.** Inspired by the work of [Lee et al., 2021](https://arxiv.org/abs/2107.06499), which demonstrated the value of deduplication for large language models, we implement a rigorous deduplication pipeline. We combine both exact and fuzzy deduplication, and use strict settings leading to removal rates far higher than others datasets have reported.
* (3) **Neutral filtering.** To avoid introducing further undesirable biases into the model, we avoid using ML-based filtering outside of language identification ([Dodge et al., 2021](https://arxiv.org/abs/2104.08758); [Welbl et al., 2021](https://arxiv.org/abs/2109.07445)) . We stick to simple rules and heuristics, and use only URL filtering for adult content.
During its development, we iterated on RefinedWeb by measuring the zero-shot performance of models trained on development version of the dataset. Our main goal was to maximize the performance obtained, bridging the gap between curated and web data. We also manually audited samples to identify potential filtering improvements.
### Source Data
RefinedWeb is built from [CommonCrawl](https://commoncrawl.org) dumps. These dumps are constructed from crawling publicly available web pages.
### Data Collection and Preprocessing
We applied extensive preprocessing and cleaning of the data, using our Macrodata Refinement Pipeline.
We first filter URLs to remove adult content using a blocklist and a score system, we then use `trafilatura` to extract content from pages, and perform language identification with the `fastText` classifier from CCNet ([Wenzek et al., 2019](https://arxiv.org/abs/1911.00359)). After this first preprocessing stage, we filter data using heuristics from MassiveWeb ([Rae et al., 2021](https://arxiv.org/abs/2112.11446)), and our own line-wise corrections.
Finally, we run extensive deduplication, removing URLs revisited across dumps and performing subsequently fuzzy and exact substring deduplication.
### Annotations
We provide automatically collected annotations for the source `url`, `timestamp` of the crawl, original CommonCrawl `dump` and `segment` in which the document was found, and `image_urls` contained in the page.
### Personal and Sensitive Information
As RefinedWeb is built upon publicly available web pages, it may contain sensitive information such as emails, phone numbers, or IP addresses. We believe that deduplication may have helped reduced the prevalence of PII in the dataset, but practitioners working with RefinedWeb should take care.
## Considerations for Using the Data
### Social Impact of Dataset
With the open-source release of Falcon RefinedWeb, we aim to increase access to high-quality web data, which has typically been held private by model developers. We believe this release will in turn improve the accessibility and the spread of performant large language models.
### Discussion of Biases
As toxic or biased data is prevalent on the internet, it is likely our dataset contains such content. Notably, using the Perspective API, we estimated the prevalence of toxic content in the dataset to be similar to The Pile.
### Other Known Limitations
Despite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.
## Additional Information
### Licensing Information
This public extract is made available under an [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) license; users should also abide to the [CommonCrawl ToU](https://commoncrawl.org/terms-of-use/).
### Citation Information
```
@article{refinedweb,
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
journal={arXiv preprint arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
```
### Opt-out request
RefinedWeb is based on [CommonCrawl](https://commoncrawl.org/). Their crawler honors opt-out requests in the `robots.txt`, see the [CC FAQ](https://commoncrawl.org/big-picture/frequently-asked-questions/) for details.
To remove a document from RefinedWeb, please message falconllm@tii.ae.
### Contact
falconllm@tii.ae
# 📀 Falcon RefinedWeb
**Falcon RefinedWeb 是由[TII](https://www.tii.ae)构建的超大规模英文网页数据集,采用ODC-By 1.0协议发布。**
欲了解更多细节,请参阅arXiv上的📓[相关论文](https://arxiv.org/abs/2306.01116)。
RefinedWeb 基于公共爬虫数据集CommonCrawl(CommonCrawl)通过严格筛选与大规模去重构建而成;我们发现,仅依托网页数据训练的基于RefinedWeb的模型,其性能可与基于精选数据集训练的模型持平甚至更优。
RefinedWeb 同时具备“多模态友好性”:其经过处理的样本中包含图片的链接与替代文本(alt text)。
本次公开的数据集子集的Token规模约为5000亿至6500亿,具体数值取决于所使用的分词器,同时可通过自选的精选语料进行扩充。该公开子集的下载体积约为500GB,解压后需占用2.8TB的本地存储空间。
python
from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")
RefinedWeb 是我们训练[Falcon 大语言模型(Large Language Model)](https://falconllm.tii.ae)所使用的核心数据集:
* 该数据集曾与精选语料结合,用于训练Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)与[40B](https://huggingface.co/tiiuae/falcon-40b)两款顶尖开源模型。
* 它还被用于训练Falcon-RW-[1B](https://huggingface.co/tiiuae/falcon-rw-1b)/[7B](https://huggingface.co/tiiuae/falcon-rw-7b),这两款模型仅基于3500亿Token的RefinedWeb数据训练,用以验证其相较于精选语料的性能表现。
# Falcon RefinedWeb 数据集卡片
## 数据集描述
* **官方主页:** [falconllm.tii.ae](falconllm.tii.ae)
* **相关论文:** [https://arxiv.org/abs/2306.01116](https://arxiv.org/abs/2306.01116)
* **联系方式:** [falconllm@tii.ae](mailto:falconllm@tii.ae)
### 数据集概述
Falcon RefinedWeb 专为大语言模型(Large Language Model)预训练打造,是一款大规模英文数据集。其既可单独使用,也可通过维基百科、StackOverflow等精选数据源进行扩充。
该数据集基于公共爬虫数据集CommonCrawl(CommonCrawl)构建,通过严格筛选与大规模去重处理完成。
### 支持任务与排行榜
RefinedWeb 主要作为大语言模型的预训练数据集使用。研究人员可利用其进行基于验证损失的上游评估,但本数据集未提供标准划分方式。
### 语言分布
RefinedWeb 主要包含英文文本。
## 数据集结构
### 数据实例
每个数据实例对应一个经过爬取、处理并与所有其他实例完成去重的独立网页。
本次公开的RefinedWeb子集包含约10亿个数据实例(9.68亿个独立网页),总计2.8TB的纯净文本数据。
### 数据字段
* `content`:网页中经过处理与清洗后的文本内容;
* `url`:生成该样本所爬取的网页链接;
* `timestamp`:公共爬虫数据集CommonCrawl(CommonCrawl)爬取该网页的时间戳;
* `dump`:该样本所属的CommonCrawl数据批次;
* `segment`:该样本所属的CommonCrawl数据分段;
* `image_urls`:样本内容中所有图片的元素列表,格式为`[image_url, image_alt_text]`,其中image_url为图片链接,image_alt_text为图片替代文本(alt text)。
### 数据划分
本数据集未提供标准划分方式。
## 数据集构建
### 构建理念
Falcon RefinedWeb 基于[公共爬虫数据集CommonCrawl(CommonCrawl)](https://commoncrawl.org)构建,采用宏数据细化流水线(Macrodata Refinement Pipeline)完成,该流程整合了内容提取、筛选启发式规则与去重操作。
在设计RefinedWeb时,我们遵循以下核心原则:
* (1) **优先保障规模。** 我们开发宏数据细化流水线的初衷是打造可用于训练400亿至2000亿参数模型的数据集,因此需要达到万亿级Token的规模[(Hoffmann等人,2022)](https://arxiv.org/abs/2203.15556)。针对纯英文的RefinedWeb,我们设定的目标规模为3万亿至6万亿Token。具体而言,我们摒弃了劳动密集型的人工精选流程,优先选用CommonCrawl而非分散的单领域数据源。
* (2) **严格去重。** 受[Lee等人,2021](https://arxiv.org/abs/2107.06499)研究的启发,该研究证实了去重对大语言模型的重要价值,我们搭建了一套严谨的去重流水线。我们结合精确去重与模糊去重两种方式,并采用严格的参数设置,使得去重率远高于其他公开数据集。
* (3) **中性筛选。** 为避免向模型引入额外的不良偏见,我们仅在语言识别环节使用基于机器学习的筛选方法([Dodge等人,2021](https://arxiv.org/abs/2104.08758); [Welbl等人,2021](https://arxiv.org/abs/2109.07445)),其余环节仅采用简单规则与启发式方法,且仅通过URL过滤处理成人内容。
在数据集开发过程中,我们通过评估基于开发版RefinedWeb训练的模型的零样本(Zero-shot)性能,对数据集进行迭代优化。我们的核心目标是最大化模型性能,缩小基于精选数据集与纯网页数据训练的模型之间的性能差距。同时,我们还手动审核样本以识别可优化的筛选规则。
### 源数据
RefinedWeb 的数据源自[公共爬虫数据集CommonCrawl(CommonCrawl)](https://commoncrawl.org)的数据批次,这些批次由公开网页的爬取结果汇集而成。
### 数据采集与预处理
我们通过自研的宏数据细化流水线对数据进行了全方位的预处理与清洗。
我们首先通过黑名单与评分系统对URL进行筛选,移除成人内容;随后使用`trafilatura`工具提取网页内容,并采用CCNet中的`fastText`分类器完成语言识别([Wenzek等人,2019](https://arxiv.org/abs/1911.00359))。在首轮预处理完成后,我们结合MassiveWeb的启发式规则([Rae等人,2021](https://arxiv.org/abs/2112.11446))与自研的逐行修正规则进一步筛选数据。
最后,我们执行大规模去重操作:移除跨数据批次重复出现的URL,并依次完成模糊与精确子串去重。
### 标注信息
我们为样本提供了自动采集的标注信息,包括源`url`、爬取`timestamp`、该文档所属的原始CommonCrawl `dump`与`segment`,以及页面中包含的`image_urls`。
### 个人与敏感信息
由于RefinedWeb基于公开网页构建,其可能包含电子邮件、电话号码或IP地址等敏感信息。我们认为去重操作已在一定程度上降低了PII(个人可识别信息,Personally Identifiable Information)在数据集中的占比,但使用RefinedWeb的研究人员仍需谨慎处理。
## 数据使用注意事项
### 数据集的社会影响
通过开源发布Falcon RefinedWeb,我们旨在提升高质量网页数据的可及性——这类数据此前通常仅为模型开发者所私有。我们相信,本次开源将进一步提升高性能大语言模型的可获取性与普及程度。
### 偏见问题讨论
由于互联网上充斥着有毒或带有偏见的数据,本数据集大概率也包含此类内容。值得注意的是,通过Perspective API评估,我们发现数据集中的有毒内容占比与The Pile数据集相当。
### 其他已知局限性
尽管我们已尽力过滤非自然语言内容并完成文档去重,但本流水线仍可能遗漏部分存在错误或冗余的文档。
## 补充信息
### 许可协议
本公开数据集子集采用[ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/)协议发布;用户同时需遵守[CommonCrawl服务条款](https://commoncrawl.org/terms-of-use/)。
### 引用信息
@article{refinedweb,
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
journal={arXiv preprint arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
### 下架申请
RefinedWeb 基于[公共爬虫数据集CommonCrawl(CommonCrawl)](https://commoncrawl.org/)构建。其爬虫支持`robots.txt`中的下架请求,详细信息可参阅[CC常见问题解答](https://commoncrawl.org/big-picture/frequently-asked-questions/)。
如需从RefinedWeb中移除某文档,请发送邮件至falconllm@tii.ae。
### 联系方式
falconllm@tii.ae
提供机构:
maas
创建时间:
2025-10-03



