fineweb-edu-score-2
收藏魔搭社区2025-12-04 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceFW/fineweb-edu-score-2
下载链接
链接失效反馈官方服务:
资源简介:
# 📚 FineWeb-Edu-score-2
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/wwRnEQydH9qdRtFofIE-A.png" alt="FineWeb-Edu: The finest collection of educational content the web has to offer">
</center>
> 1.3 trillion tokens of the finest educational data the 🌐 web has to offer
## What is it?
📚 FineWeb-Edu dataset consists of **1.3T tokens** ([FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)) and **5.4T tokens** of educational web pages filtered from 🍷 FineWeb dataset. This is the 5.4 trillion version.
### Note: this version uses a lower educational score threshold = 2, which results in more documents, but lower quality compared to the 1.3T version. For more details check the FineWeb [blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
To enhance FineWeb's quality, we developed an [educational quality classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) using annotations generated by LLama3-70B-Instruct. We then used this classifier to retain only the most educational web pages. FineWeb-Edu outperforms FineWeb on popular benchmarks and shows the power of classifiers trained on synthetic data.
The [Dataset Curation](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu#dataset-curation) section details the process for creating the dataset.

## What is being released?
Along with the dataset, which includes all filtered CommonCrawl dumps since 2013, we also release the educational classifier used for the filtering as well as the code for training it and running inference at: https://github.com/huggingface/cosmopedia/tree/main/classification.
## Changelog
_Previous versions remain available in the branch `version name`._
- **v1.4.0 (11-07-2025):** Added 6 new snapshots: `CC-MAIN-2025-05`, `CC-MAIN-2025-08`, `CC-MAIN-2025-13`, `CC-MAIN-2025-18`, `CC-MAIN-2025-21`, and `CC-MAIN-2025-26` (January to June 2025)
- **v1.3.0 (31-01-2025):** Fixed an issue with some dumps where some documents hadn't been processed: `CC-MAIN-2024-10`, `CC-MAIN-2024-18`, `CC-MAIN-2024-22`, `CC-MAIN-2024-26`, `CC-MAIN-2024-30`, `CC-MAIN-2024-33`, `CC-MAIN-2024-38`, `CC-MAIN-2024-42`, `CC-MAIN-2024-46` -- they now contain more data (~330B additional tokens).
- **v1.2.0 (03-01-2024):** Added 9 new snapshots: `CC-MAIN-2024-18`, `CC-MAIN-2024-22`, `CC-MAIN-2024-26`, `CC-MAIN-2024-30`, `CC-MAIN-2024-33`, `CC-MAIN-2024-38`, `CC-MAIN-2024-42`, `CC-MAIN-2024-46`, `CC-MAIN-2024-51`, covering April to December 2024.
- **v1.0.0 (02-06-2024):** Initial version
## How to load the dataset
Similarily to FineWeb, You can load the full dataset or a specific crawl/dump. Dumps have the format `CC-MAIN-(year)-(week number)`.
### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/)
```python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2", glob_pattern="data/*/*.parquet", limit=1000)
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2/CC-MAIN-2024-10", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2/CC-MAIN-2024-10", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
```
### Using `datasets`
```python
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu-score-2", name="CC-MAIN-2024-10", split="train", streaming=True)
```
## Dataset curation
A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of [LLama3](https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/), [Claude3](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) and [Phi3](https://arxiv.org/abs/2404.14219), but its large-scale impact on web data filtering hasn't been fully explored or published.
The highly popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper stating: “Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data". Similarly, the LLama3 blog post notes: “We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.” However these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by [LLama3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to create FineWeb-Edu.
### Annotation
We used [Llama3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to score 500k FineWeb samples for their educational quality on a scale from 0 to 5.
We explored various prompts and found that the additive scale by [Yuan et al.](https://arxiv.org/pdf/2401.10020) worked best. To avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages. The final prompt can be found in this blog post TODO.
We also experimented with different LLMs: Llama3-70B-Instruct, Mixtral-8x-7B-Instruct, and Mixtral-8x22B-Instruct. Llama3 and Mixtral-8x22B produced similar scores, while Mixtral-8x7B tended to be more generous, not fully adhering to the score scale. Verga et al. suggest using multiple LLMs as juries. We tried averaging the scores from the three models, but this shifted the distribution to the right due to the higher scores from Mixtral-8x7B. Training on a dataset filtered with a classifier using jury annotations performed worse than using a classifier based on Llama3 annotations. We hypothesize that the jury-based approach retains more low-quality samples.
### Classifier training
We fine-tuned a Bert-like regression model using these annotations, based on [Snowflake-arctic-embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). When converted to a binary classification using a score of 3 as a threshold for keeping and removing files, the model achieved an F1 score of 82%. The classification of FineWeb 15T tokens took 6k H100 GPU hours.
The classifier is available at: [https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/ ](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/)
### Filtering and results
**Note**: You can find more details about the ablations and results in [the FineWeb blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA.
We then built 📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3T educational tokens. Our ablation demonstrated that this refined dataset surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA. The plot below compares FineWeb-Edu to other web datasets:

To retain more tokens, we also experimented with a less strict threshold of 2 instead of 3. While being less performant than using threshold 3, it still outperformed FineWeb and it preserved 5.4T tokens. We release these two dataset as [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [FineWeb-Edu-score-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2) along with the [classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).
You will find all the ablation models in [this collection](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32). The FineWeb-Edu ablation model (trained on 350B tokens) is available at [https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu](https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu).
## Considerations for Using the Data
This section is copied from the parent dataset: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
### Social Impact of Dataset
With the release of this dataset we aim to make model training more accessible to the machine learning community at large.
While multiple open-weights models with strong performance have been publicly released in the past, more often than not these releases are not accompanied by the corresponding training dataset. This is unfortunate as the dataset specificities and characteristics have been demonstrated to have a very large impact and role in the performances of the models. As the creation of a high quality training dataset is a fundamental requirement to training an LLM capable of excelling at downstream tasks, with 🍷 FineWeb we (a) not only make the dataset creation process more transparent, by sharing our entire processing setup including the codebase used, we also (b) help alleviate the costs of dataset curation, both in time and in compute, for model creators by publicly releasing our dataset with the community.
### Discussion of Biases
Efforts were made to minimize the amount of NSFW and toxic content present in the dataset by employing filtering on the URL level. However, there are still a significant number of documents present in the final dataset that could be considered toxic or contain harmful content. As 🍷 FineWeb was sourced from the web as a whole, any harmful biases typically present in it may be reproduced on our dataset.
We deliberately avoided using machine learning filtering methods that define text quality based on the similarity to a “gold” source such as wikipedia or toxicity classifiers as these methods have been known to [disproportionately remove content in specific dialects](https://aclanthology.org/D16-1120/) and [overclassify as toxic text related to specific social identities](https://arxiv.org/pdf/2109.07445.pdf), respectively.
### Other Known Limitations
As a consequence of some of the filtering steps applied, it is likely that code content is not prevalent in our dataset. If you are training a model that should also perform code tasks, we recommend you use 🍷 FineWeb with a code dataset, such as [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2). You should also probably consider complementing 🍷 FineWeb with specialized curated sources (such as Wikipedia, for example) as they will likely have better formatting than the wikipedia content included in 🍷 FineWeb (we did not tailor the processing to individual websites).
## Additional Information
### Licensing Information
The dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** [license](https://opendatacommons.org/licenses/by/1-0/). The use of this dataset is also subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use).
### Future work
We plan to work on better educational classifier to improve the quality of FineWeb-Edu.
### Citation Information
```
@software{lozhkov2024fineweb-edu,
author = {Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas},
title = {FineWeb-Edu},
month = May,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu}
}
```
# 📚 FineWeb-Edu-score-2
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/wwRnEQydH9qdRtFofIE-A.png" alt="FineWeb-Edu:互联网上最优质的教育内容合集">
</center>
> **包含1.3万亿Token的顶尖互联网教育数据集合**
## 一、数据集概述
📚 FineWeb-Edu 数据集包含**1.3万亿Token**([FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu))以及从🍷 FineWeb数据集中筛选出的**5.4万亿Token**的教育类网页数据。本版本即为该5.4万亿Token的数据集。
### 注意:本版本使用的教育评分阈值更低(为2),因此相较于1.3万亿Token版本,本数据集包含更多文档,但整体质量稍低。如需了解更多细节,请查阅FineWeb的[官方博客文章](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)。
为提升FineWeb数据集的质量,我们基于LLama3-70B-Instruct生成的标注数据,开发了一款[教育质量分类器](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier),并使用该分类器仅保留最具教育价值的网页内容。FineWeb-Edu在主流基准测试中表现优于FineWeb,证明了基于合成数据训练的分类器的强大效果。
数据集的构建流程可详见[Dataset Curation](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu#dataset-curation)章节。

## 二、本次发布内容
本次发布的内容包含自2013年以来所有经过筛选的CommonCrawl转储数据构成的数据集,同时还发布了本次筛选所用的教育质量分类器,以及用于训练该分类器和运行推理的代码,代码仓库地址为:https://github.com/huggingface/cosmopedia/tree/main/classification。
## 三、更新日志
_过往版本仍可在对应`版本号`分支中获取。_
- **v1.4.0(2025年7月11日):** 新增6个数据集快照:`CC-MAIN-2025-05`、`CC-MAIN-2025-08`、`CC-MAIN-2025-13`、`CC-MAIN-2025-18`、`CC-MAIN-2025-21`以及`CC-MAIN-2025-26`(覆盖2025年1月至6月)
- **v1.3.0(2025年1月31日):** 修复了部分转储数据中未处理文档的问题,涉及`CC-MAIN-2024-10`、`CC-MAIN-2024-18`、`CC-MAIN-2024-22`、`CC-MAIN-2024-26`、`CC-MAIN-2024-30`、`CC-MAIN-2024-33`、`CC-MAIN-2024-38`、`CC-MAIN-2024-42`、`CC-MAIN-2024-46`,修复后新增约3300亿Token数据。
- **v1.2.0(2024年1月3日):** 新增9个数据集快照:`CC-MAIN-2024-18`、`CC-MAIN-2024-22`、`CC-MAIN-2024-26`、`CC-MAIN-2024-30`、`CC-MAIN-2024-33`、`CC-MAIN-2024-38`、`CC-MAIN-2024-42`、`CC-MAIN-2024-46`、`CC-MAIN-2024-51`,覆盖2024年4月至12月。
- **v1.0.0(2024年6月2日):** 初始版本
## 四、数据集加载方式
与FineWeb数据集类似,您可以加载完整数据集,或加载特定的爬取转储文件。转储文件的命名格式为`CC-MAIN-(年份)-(周数)`。
### 使用🏭 [`datatrove`](https://github.com/huggingface/datatrove/)
python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2", glob_pattern="data/*/*.parquet", limit=1000)
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2/CC-MAIN-2024-10", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu-score-2/CC-MAIN-2024-10", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
### 使用`datasets`库
python
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu-score-2", name="CC-MAIN-2024-10", split="train", streaming=True)
## 五、数据集构建流程
近期出现了一种用于筛选大语言模型(Large Language Model,简称LLM)训练数据集的新方法:基于合成数据开发分类器,以识别教育类内容。该技术已应用于[LLama3](https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/)、[Claude3](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf)以及[Phi3](https://arxiv.org/abs/2404.14219)的模型训练中,但该方法在互联网数据筛选领域的大规模应用效果尚未得到充分研究与公开。
广受欢迎的Phi3模型基于3.3万亿与4.8万亿Token的数据进行训练,其论文中提到:“我们的训练数据包含从各类公开互联网来源获取的、经严格筛选的公开网络数据(基于‘教育等级’标准),以及大语言模型生成的合成数据”。类似地,LLama3的官方博客中提到:“我们发现前代Llama模型擅长识别高质量数据,因此我们使用Llama 2辅助构建了支撑LLama3的文本质量分类器”。但上述分类器与筛选后的数据集均未公开。为提升FineWeb数据集的质量,我们基于[LLama3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)生成的标注数据开发了教育质量分类器,以此构建了FineWeb-Edu数据集。
### 标注流程
我们使用[LLama3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)对50万个FineWeb样本进行教育质量评分,评分范围为0至5分。
我们测试了多种提示词模板,发现[Yuan等人](https://arxiv.org/pdf/2401.10020)提出的加法评分量表效果最佳。为避免大语言模型偏向arXiv摘要等技术型强的页面,我们将评分聚焦于小学与初中阶段的知识内容。在筛选阶段,我们设置阈值为3(0-5分制),以此保留部分高等教育级别的内容。最终使用的提示词可详见本次博客文章TODO。
我们还测试了多款大语言模型:LLama3-70B-Instruct、Mixtral-8x-7B-Instruct以及Mixtral-8x22B-Instruct。LLama3与Mixtral-8x22B的评分结果较为一致,而Mixtral-8x7B的评分更为宽松,未严格遵循评分量表规则。Verga等人提出可使用多个大语言模型作为评审。我们尝试对三个模型的评分取平均值,但由于Mixtral-8x7B的评分偏高,导致整体评分分布右移。基于该多模型评审标注数据集训练的分类器,其效果不如基于LLama3标注数据训练的分类器。我们推测,多模型评审的方式会保留更多低质量样本。
### 分类器训练
我们基于[Snowflake-arctic-embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m),使用上述标注数据对一个类Bert的回归模型进行了微调。若将该模型转换为二分类任务,以3分作为保留/移除文件的阈值,则模型的F1分数可达82%。对FineWeb数据集的15万亿Token数据进行分类共耗费了6000个H100 GPU算力小时。
该分类器的公开地址为:[https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/ ](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/)
### 筛选流程与实验结果
**注意**:关于消融实验与结果的更多细节,请查阅[FineWeb官方博客文章](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)。
我们测试了不同筛选阈值的影响,发现阈值3能取得最佳的综合效果。虽然使用高于3的阈值能提升模型在知识与推理密集型基准测试中的表现,但会显著降低其在HellaSwag与PIQA基准测试中的性能。
随后我们通过过滤掉评分低于3的样本,构建了📚 FineWeb-Edu数据集。该步骤过滤了92%的原始数据,最终得到1.3万亿Token的教育类数据。我们的消融实验表明,该优化后的数据集优于🍷 FineWeb与其他所有公开网络数据集,在MMLU、ARC以及OpenBookQA等教育类基准测试中取得了显著提升。下图对比了FineWeb-Edu与其他网络数据集的表现:

为保留更多Token数据,我们还测试了更低的阈值2(替代阈值3)。虽然该阈值下的数据集性能不如阈值3的版本,但仍优于FineWeb数据集,且保留了5.4万亿Token的数据。我们将这两个数据集分别发布为[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)与[FineWeb-Edu-score-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2),并同步发布了对应的[分类器](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)。
所有消融实验所用的模型均可在[该模型集合](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32)中获取。FineWeb-Edu的消融实验模型(基于3500亿Token数据训练)的公开地址为[https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu](https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu)。
## 六、数据使用注意事项
本章节内容源自父数据集[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)。
### 数据集的社会影响
我们发布本数据集的目标是让广大机器学习社区更易获取模型训练所需的数据资源。
尽管过往已有多款高性能开源权重模型公开发布,但这些发布往往未附带对应的训练数据集。这一现状令人遗憾,因为数据集的特性与细节对模型性能有着极为显著的影响。构建高质量训练数据集是训练出能在下游任务中表现优异的大语言模型的核心前提。通过🍷 FineWeb项目,我们(a)通过公开全部处理流程与所用代码库,让数据集构建过程更加透明;同时(b)通过向社区公开发布本数据集,帮助模型开发者降低数据集构建的时间与算力成本。
### 数据集偏差说明
我们通过URL层面的过滤,尽可能减少了数据集中的NSFW与有害内容。但最终的数据集中仍存在大量可能被视为有毒或包含有害内容的文档。由于🍷 FineWeb数据集源自整个互联网,其本身存在的各类有害偏差也可能会出现在本数据集当中。
我们刻意避免使用基于与“黄金标准”源(如维基百科)的相似度来定义文本质量的机器学习过滤方法,也未使用毒性分类器,因为已知这两类方法分别会[过度移除特定方言的内容](https://aclanthology.org/D16-1120/),以及[将与特定社会身份相关的文本过度归类为有毒内容](https://arxiv.org/pdf/2109.07445.pdf)。
### 已知其他局限性
由于部分筛选步骤的影响,本数据集可能较少包含代码内容。若您需要训练能够处理代码任务的模型,建议将🍷 FineWeb与代码数据集(如[The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2))结合使用。您还可以考虑使用专门精选的数据源(例如维基百科)来补充🍷 FineWeb,因为这些数据源的格式会比🍷 FineWeb中包含的维基百科内容更规范(我们未针对单个网站定制处理流程)。
## 七、附加信息
### 授权信息
本数据集采用**Open Data Commons Attribution License (ODC-By) v1.0**[授权协议](https://opendatacommons.org/licenses/by/1-0/)发布。使用本数据集还需遵守[CommonCrawl的使用条款](https://commoncrawl.org/terms-of-use)。
### 未来工作计划
我们计划开发更优秀的教育质量分类器,以进一步提升FineWeb-Edu数据集的质量。
### 引用信息
@software{lozhkov2024fineweb-edu,
author = {Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas},
title = {FineWeb-Edu},
month = May,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu}
}
提供机构:
maas
创建时间:
2025-09-08



